k-Nearest Neighbors (kNN) for Flink

A young and exciting open source tool for distributed data processing known as Apache Flink has recently emerged as a player in the data engineering ecosystem. Similar data processing tools do indeed already exist, most notably Spark, Storm and Hadoop MapReduce. Compared to existing technologies, Flink has a unique framework, placing batch and streaming into a unified streaming framework. In contrast, Spark is a batch processing tool and the Spark Streaming lumps relatively small amounts of data into “micro-batches”. Storm is able to process data one-by-one in a purely streaming way, though does not have a batch processing framework.

Flink, on the other hand, operates in a purely streaming framework, and instantiates the vision of Jay Kreps of the kappa architecture. The quick rise in popularity and development of Flink should be noted: Flink started as a university project in Berlin, and in a matter of a mere eight months Flink went from Incubator status to becoming a Top-Level Apache project in December 2014. Even more recently, Yahoo wrote a blog about their experiences in using Flink in comparison to other tools.

For my Insight project, I improved on an initial brute-force approach to add an exact k-nearest neighbors (kNN) algorithm by writing a quadtree data structure. I have since made a pull-request, and the Flink community intends to soon merge the work with its main body of source code.

Details: https://blog.insightdatascience.com/planting-quadtrees-for-apache-flink-b396ebc80d35

Comments