k-Nearest Neighbors (kNN) for Flink
A young and exciting open source tool for distributed data processing known as Apache Flink has recently emerged as a player in the data engineering ecosystem. Similar data processing tools do indeed already exist, most notably Spark, Storm and Hadoop MapReduce. Compared to existing technologies, Flink has a unique framework, placing batch and streaming into a unified streaming framework. In contrast, Spark is a batch processing tool and the Spark Streaming lumps relatively small amounts of data into “micro-batches”. Storm is able to process data one-by-one in a purely streaming way, though does not have a batch processing framework. Flink, on the other hand, operates in a purely streaming framework, and instantiates the vision of Jay Kreps of the kappa architecture. The quick rise in popularity and development of Flink should be noted: Flink started as a university project in Berlin, and in a matter of a mere eight months Flink went from Incubator status to becoming a Top-Level Apache...