Posts

Showing posts from April, 2016

k-Nearest Neighbors (kNN) for Flink

A young and exciting open source tool for distributed data processing known as Apache Flink has recently emerged as a player in the data engineering ecosystem. Similar data processing tools do indeed already exist, most notably Spark, Storm and Hadoop MapReduce. Compared to existing technologies, Flink has a unique framework, placing batch and streaming into a unified streaming framework. In contrast, Spark is a batch processing tool and the Spark Streaming lumps relatively small amounts of data into “micro-batches”. Storm is able to process data one-by-one in a purely streaming way, though does not have a batch processing framework. Flink, on the other hand, operates in a purely streaming framework, and instantiates the vision of Jay Kreps of the kappa architecture. The quick rise in popularity and development of Flink should be noted: Flink started as a university project in Berlin, and in a matter of a mere eight months Flink went from Incubator status to becoming a Top-Level Apache

Adopting Self-Service BI with Tableau - Notes from the field

Image
(originally this article was created and posted by me on March 7, 2016 at datasciencecentral.com, now I am transferring it here) I have spent many hours planning and executing in-company self-service BI implementation. This enabled me to gain several insights. Now that the ideas became mature enough and field-proven, I believe they are worth sharing. No matter how far you are in toying with potential approaches (possibly you are already in the thick of it!), I hope my attempt of describing feasible scenarios would provide a decent foundation.   All scenarios presume that IT plays its main role by owning the infrastructure, managing scalability, data security, and governance. Scenario 1. Tableau Desktop + departmental/cross-functional data schemas. This scenario involves gaining insights by data analysts on a daily basis. They might be either independent individuals or a team. Business users’ interaction with published workbooks is applicable, but limited to simple filtering.