File Format Benchmark - Avro, JSON, ORC & Parquet (Owen O’Malley, 2016 Strata + Hadoop World)

Benchmarks from Owen O’Malley presented at 2016 Strata + Hadoop World conference in New York.

3 datasets:

1. NYC Taxi Data (Every taxi cab ride in NYC from 2009)
2. Github Logs (All actions on Github public repositories)
3. Sales (Generated data)

3 compression mechanisms:
1. None
2. Snappy
3. Zlib (Gzip)

3 usecases:
1. Full Table Scans
2. Column Projection
3. Predicate Pushdown

Compression results

Use cases results

Full Table Scans

Column Projection

Dataset	format	compression	us/row	projection	Percent time
github	orc	zlib	21.319	0.185	0.87%
github	parquet	zlib	72.494	0.585	0.81%
sales	orc	zlib	1.866	0.056	3.00%
sales	parquet	zlib	12.893	0.329	2.55%
taxi	orc	zlib	2.766	0.063	2.28%
taxi	parquet	zlib	3.496	0.718	20.54%

Recommendations

(Disclaimer – Everything changes! Both these benchmarks and the formats will change)

Don’t use JSON for processing.
If your use case needs column projection or predicate push down (ORC or Parquet)
For complex tables with common strings (Avro with Snappy is a good fit (w/o projection)
For other tables ORC with Zlib or Snappy is a good fit

Details: https://www.safaribooksonline.com/library/view/strata-hadoop/9781491944660/video282727.html
Slides: https://conferences.oreilly.com/strata/strata-ny-2016/public/schedule/detail/51952

Search This Blog

Notes about Cutting-Edge Technologies and Everything

File Format Benchmark - Avro, JSON, ORC & Parquet (Owen O’Malley, 2016 Strata + Hadoop World)

Use cases results

Comments

Post a Comment