Posts

Showing posts from October, 2016

File Format Benchmark - Avro, JSON, ORC & Parquet (Owen O’Malley, 2016 Strata + Hadoop World)

Image
Benchmarks from Owen O’Malley presented at 2016 Strata + Hadoop World conference in New York. 3 datasets: 1. NYC Taxi Data (Every taxi cab ride in NYC from 2009) 2. Github Logs (All actions on Github public repositories) 3. Sales (Generated data)   3 compression mechanisms: 1. None 2. Snappy 3. Zlib (Gzip) 3 usecases: 1. Full Table Scans 2. Column Projection 3. Predicate Pushdown  Compression results     Use cases results Full Table Scans   Column Projection Dataset format compression us/row projection Percent time github orc zlib 21.319 0.185 0.87% github parquet zlib 72.494 0.585 0.81% sales orc zlib 1.866 0.056 3.00% sales parquet zlib 12.893 0.329 2.55% taxi orc zlib 2.766 0.063 2.28% taxi parquet zlib 3.496 0.718 20.54%