File Format Benchmark - Avro, JSON, ORC & Parquet (Owen O’Malley, 2016 Strata + Hadoop World)
Benchmarks from Owen O’Malley presented at 2016 Strata + Hadoop World conference in New York.
3 datasets:
1. NYC Taxi Data (Every taxi cab ride in NYC from 2009)
2. Github Logs (All actions on Github public repositories)
3. Sales (Generated data)
3 compression mechanisms:
1. None
2. Snappy
3. Zlib (Gzip)
3 usecases:
1. Full Table Scans
2. Column Projection
3. Predicate Pushdown
Compression results
Column Projection
Recommendations
(Disclaimer – Everything changes! Both these benchmarks and the formats will change)
Slides: https://conferences.oreilly.com/strata/strata-ny-2016/public/schedule/detail/51952
3 datasets:
1. NYC Taxi Data (Every taxi cab ride in NYC from 2009)
2. Github Logs (All actions on Github public repositories)
3. Sales (Generated data)
3 compression mechanisms:
1. None
2. Snappy
3. Zlib (Gzip)
3 usecases:
1. Full Table Scans
2. Column Projection
3. Predicate Pushdown
Compression results
Use cases results
Full Table ScansColumn Projection
Dataset
|
format
|
compression
|
us/row
|
projection
|
Percent time
|
github
|
orc
|
zlib
|
21.319
|
0.185
|
0.87%
|
github
|
parquet
|
zlib
|
72.494
|
0.585
|
0.81%
|
sales
|
orc
|
zlib
|
1.866
|
0.056
|
3.00%
|
sales
|
parquet
|
zlib
|
12.893
|
0.329
|
2.55%
|
taxi
|
orc
|
zlib
|
2.766
|
0.063
|
2.28%
|
taxi
|
parquet
|
zlib
|
3.496
|
0.718
|
20.54%
|
Recommendations
(Disclaimer – Everything changes! Both these benchmarks and the formats will change)
- Don’t use JSON for processing.
- If your use case needs column projection or predicate push down (ORC or Parquet)
- For complex tables with common strings (Avro with Snappy is a good fit (w/o projection)
- For other tables ORC with Zlib or Snappy is a good fit
Slides: https://conferences.oreilly.com/strata/strata-ny-2016/public/schedule/detail/51952
Comments
Post a Comment