Posts

Showing posts with the label Avro

Real-Time Stock Processing With Apache NiFi and Apache Kafka

Image
Implementing Streaming Use Case From REST to Hive With Apache NiFi and Apache Kafka Part 1 With Apache Kafka 2.0 and Apache NiFi 1.8, there are many new features and abilities coming out. It's time to put them to the test. So to plan out what we are going to do, I have a high-level architecture diagram. We are going to ingest a number of sources including REST feeds, Social Feeds, Messages, Images, Documents, and Relational Data. We will ingest with NiFi and then filter, process, and segment it into Kafka topics. Kafka data will be in Apache Avro format with schemas specified in the Hortonworks Schema Registry. Spark and NiFi will do additional event processing along with machine learning and deep learning. This will be stored in Druid for real-time analytics and summaries. Hive, HDFS, and S3 will store the data for permanent storage. We will do dashboards with Superset and Spark SQL + Zeppelin. We will also push back cleaned and aggregated data to subscribers via Kafka ...

File Format Benchmark - Avro, JSON, ORC & Parquet (Owen O’Malley, 2016 Strata + Hadoop World)

Image
Benchmarks from Owen O’Malley presented at 2016 Strata + Hadoop World conference in New York. 3 datasets: 1. NYC Taxi Data (Every taxi cab ride in NYC from 2009) 2. Github Logs (All actions on Github public repositories) 3. Sales (Generated data)   3 compression mechanisms: 1. None 2. Snappy 3. Zlib (Gzip) 3 usecases: 1. Full Table Scans 2. Column Projection 3. Predicate Pushdown  Compression results     Use cases results Full Table Scans   Column Projection Dataset format compression us/row projection Percent time github orc zlib 21.319 0.185 0.87% github parquet zlib 72.494 0.585 0.81% sales orc zlib 1.866 0.056 3.00% sales parquet zlib 12.893 0.329 2.55% taxi orc zlib 2.766 0.063 2.28% taxi parquet zlib 3.496 ...