Apache Beam: A unified model for batch and stream processing data

Apache Beam, a new distributed processing tool that's currently being incubated at the ASF, provides an abstraction layer allowing developers to focus on Beam code, using the Beam programming model. Thanks to Apache Beam, an implementation is agnostic to the runtime technologies being used, meaning you can switch technologies quickly and easily.

Apache Beam also offers a programming model that is agnostic in terms of coverage—meaning the programming model is unified, which allows developers to implement both batch and streaming data processing. It’s actually where the Apache Beam name comes from: B (for Batch) and EAM (for strEAM).

To implement your data processes using the Beam programming model, you will use an SDK or DSL provided by Beam. Now, you really have only one SDK: The Java SDK. However, a Python SDK is expected to be released and Beam will provide a Scala SDK and additional DSL (Declarative DSL with XML for instance) soon.

With Apache Beam, first you choose a Beam SDK then you implement your data processes as Beam Pipelines. You don’t need to worry about the actual runtime where your processes will be deployed and run. Apache Beam provides runners. The runners are responsible for translating your pipelines to the target runtime. Apache Beam provides turnkey runners for Apache Spark, Apache Flink, and Google Cloud Dataflow platform. It will also provide new runners soon for Apache Hadoop MapReduce, Apache Karaf, and more.


To summarize, Apache Beam offers several unique capabilities for advancing your big data environment, including:

  • Portability - Your data processing jobs are exactly the same, decoupled from the actual runtime that you will use. You use the same code with different runners (abstraction) and backends on premise, in the cloud, or locally.
  • Unified Processing- Apache Beam provides the same unified model for batch and stream processing.
  • Advanced features – The Apache Beam programming model supports advanced features, implementing the latest patterns about data processing. For instance, the programming model supports event windowing, triggering, watermarking, delay in data arrival, etc.
  • Extensible model and SDK - Most of the Apache Beam parts can be extended to meet your varying environmental needs. Say you want to support a new runtime – fine, you can create your own runner. Perhaps you wish to create a DSL. No problem, just create it on top of one of the provided SDKs. Or, say you need to support a new data source. Just create a new Beam IO (implemented custom data source and sink).
Details: https://www.slideshare.net/HadoopSummit/apache-beam-a-unified-model-for-batch-and-stream-processing-data

Comments