Apache Arrow - In-Memory Columnar Data Structure

Engineers from across the Apache Hadoop community are collaborating to establish Arrow as a de-facto standard for columnar in-memory processing and interchange. Here’s how it works.

Apache Arrow is an in-memory data structure specification for use by engineers building data systems. It has several key benefits:

  • A columnar memory-layout permitting O(1) random access. The layout is highly cache-efficient in analytics workloads and permits SIMD optimizations with modern processors. Developers can create very fast algorithms which process Arrow data structures.
  • Efficient and fast data interchange between systems without the serialization costs associated with other systems like Thrift, Avro, and Protocol Buffers.
  • A flexible structured data model supporting complex types that handles flat tables as well as real-world JSON-like data engineering workloads.

Arrow isn’t a standalone piece of software but rather a component used to accelerate analytics within a particular system and to allow Arrow-enabled systems to exchange data with low overhead. It is sufficiently flexible to support most complex data models.

Performance Advantage of Columnar In-Memory
SIMD
Advantages of a Common Data Layer
Details:

Comments