Posts

Showing posts with the label Apache Arrow

What's the future of the pandas library?

Image
Pandas is a powerful, open source Python library for data analysis, manipulation, and visualization. I've been teaching data scientists to use pandas since 2014, and in the years since, it has grown in popularity to an estimated 5 to 10 million users and become a "must-use" tool in the Python data science toolkit. I started using pandas around version 0.14.0, and I've followed the library as it has significantly matured to its current version, 0.23.4. But numerous data scientists have asked me questions like these over the years: "Is pandas reliable?" "Will it keep working in the future?" "Is it buggy? They haven't even released version 1.0!" Version numbers can be used to signal the maturity of a product, and so I understand why someone might be hesitant to rely on "pre-1.0" software. But in the world of open source, version numbers don't necessarily tell you anything about the maturity or reliability ...

Dremio 2.1 is shipped with many new features!

Image
This is a major release that includes many new features, performance improvements, and hundreds of stability enhancements - see the highlights and more details below. • Elasticsearch 6.  Dremio now supports the latest versions of Elasticsearch. Enjoy full SQL support, including JOINs, Window functions, and accelerated analytics through any BI tool, including Tableau and Power BI. We also added support for compressing Elasticsearch responses to minimize network traffic.  • Approximate count distinct acceleration.  Dremio now supports accelerating count distinct queries based on an approximation-based algorithm (HyperLogLog). This provides a faster and more memory efficient way of providing distinct counts and is especially useful in high cardinality scenarios with very large datasets.  • Faster ORC performance.  Data encoded in ORC is now significantly faster to access and more memory efficient for ORC managed in Hive sources.  • Support for AWS GovClou...

What Open Source Software Do You Use?

To gather insights on the current and future state of open source software (OSS), we talked to 31 executives. This is nearly double the number we speak to for a research guide and believe this reiterates the popularity of, acceptance of, and demand for OSS. We began by asking, "What Open Source software do you use?" As you would expect, most respondents are using several versions of open source software. Here's what they told us: Apache Apache Cassandra, Elassandra  (ElasticSearch + Cassandra) , Spark, and Kafka  (as the core tech we provide through our managed service) are the big ones for us. We find that the governance arrangements and independence of the Apache Foundation make a great foundation for strong open source projects. 95% of what we do with big data is open source. We use  Apache Hadoop  and contribute back to grow skills and expertise. We use so much that it would be impossible to list. The core of our software is based on  Apache So...

Apache Arrow - In-Memory Columnar Data Structure

Image
Engineers from across the Apache Hadoop community are collaborating to establish Arrow as a de-facto standard for columnar in-memory processing and interchange. Here’s how it works. Apache Arrow is an in-memory data structure specification for use by engineers building data systems. It has several key benefits: A columnar memory-layout permitting O(1) random access. The layout is highly cache-efficient in analytics workloads and permits SIMD optimizations with modern processors. Developers can create very fast algorithms which process Arrow data structures. Efficient and fast data interchange between systems without the serialization costs associated with other systems like Thrift, Avro, and Protocol Buffers. A flexible structured data model supporting complex types that handles flat tables as well as real-world JSON-like data engineering workloads. Arrow isn’t a standalone piece of software but rather a component used to accelerate analytics within a particular system and to...