Apache Arrow - In-Memory Columnar Data Structure

Engineers from across the Apache Hadoop community are collaborating to establish Arrow as a de-facto standard for columnar in-memory processing and interchange. Here’s how it works.

Apache Arrow is an in-memory data structure specification for use by engineers building data systems. It has several key benefits:

A columnar memory-layout permitting O(1) random access. The layout is highly cache-efficient in analytics workloads and permits SIMD optimizations with modern processors. Developers can create very fast algorithms which process Arrow data structures.
Efficient and fast data interchange between systems without the serialization costs associated with other systems like Thrift, Avro, and Protocol Buffers.
A flexible structured data model supporting complex types that handles flat tables as well as real-world JSON-like data engineering workloads.

Arrow isn’t a standalone piece of software but rather a component used to accelerate analytics within a particular system and to allow Arrow-enabled systems to exchange data with low overhead. It is sufficiently flexible to support most complex data models.

Performance Advantage of Columnar In-Memory

Advantages of a Common Data Layer

Details:

https://blog.cloudera.com/blog/2016/02/introducing-apache-arrow-a-fast-interoperable-in-memory-columnar-data-structure-standard/

Search This Blog

Notes about Cutting-Edge Technologies and Everything

Apache Arrow - In-Memory Columnar Data Structure

Comments

Post a Comment