File Format Benchmark - Avro, JSON, ORC & Parquet (Owen O’Malley, 2016 Strata + Hadoop World)

Benchmarks from Owen O’Malley presented at 2016 Strata + Hadoop World conference in New York.

3 datasets:

1. NYC Taxi Data (Every taxi cab ride in NYC from 2009)
2. Github Logs (All actions on Github public repositories)
3. Sales (Generated data)
 

3 compression mechanisms:
1. None
2. Snappy
3. Zlib (Gzip)

3 usecases:
1. Full Table Scans
2. Column Projection
3. Predicate Pushdown 


Compression results

 

 

Use cases results

Full Table Scans



 



Column Projection


Dataset
format
compression
us/row
projection
Percent time
github
orc
zlib
21.319
0.185
0.87%
github
parquet
zlib
72.494
0.585
0.81%
sales
orc
zlib
1.866
0.056
3.00%
sales
parquet
zlib
12.893
0.329
2.55%
taxi
orc
zlib
2.766
0.063
2.28%
taxi
parquet
zlib
3.496
0.718
20.54%

Recommendations

(Disclaimer – Everything changes! Both these benchmarks and the formats will change)
  • Don’t use JSON for processing.
  • If your use case needs column projection or predicate push down (ORC or Parquet)
  • For complex tables with common strings (Avro with Snappy is a good fit (w/o projection) 
  • For other tables ORC with Zlib or Snappy is a good fit 
Details: https://www.safaribooksonline.com/library/view/strata-hadoop/9781491944660/video282727.html
Slides: https://conferences.oreilly.com/strata/strata-ny-2016/public/schedule/detail/51952

Comments