Cost-Efficient Open Source Big Data Platform at Uber

 


In this blog post, we shared efforts and ideas in improving the platform efficiency of Uber’s Big Data Platform, including file format improvements, HDFS erasure coding, YARN scheduling policy improvements, load balancing, query engines, and Apache Hudi.  These improvements have resulted in significant savings.  In addition, we explored some open challenges like analytics and online colocation, and pricing mechanisms.  However, as the framework outlined in our previous post established, platform efficiency improvements alone do not guarantee efficient operation.  Controlling the supply and the demand of data is equally important, which we will address in an upcoming post.

As Uber’s business has expanded, the underlying pool of data that powers it has grown exponentially, and thus ever more expensive to process. When Big Data rose to become one of our largest operational expenses, we began an initiative to reduce costs on our data platform, which divides challenges into 3 broad pillars: platform efficiency, supply, and demand. In this post we will discuss our efforts to improve the efficiency of our data platform and bring down costs.

HDFS Erasure Coding

Erasure Coding can dramatically reduce the replication factor of HDFS files. Due to the potentially increased IOPS workload, at Uber we are mainly looking at 3+2 and 6+3 schemes with replication factor of 1.67x and 1.5x. Given that the default HDFS replication factor is 3x, we can reduce the HDD space needs by nearly half!

YARN Scheduling Policy Improvements

At Uber, we use Apache YARNTM to run the majority of our Big Data Compute workload (except Presto which runs directly on dedicated servers). Just like many other companies, we started with the standard Capacity Scheduler inside YARN. Capacity Schedule allows us to configure a hierarchical queue structure with MIN and MAX settings for each of the queues. We have created a 2-level queue structure with organizations as the first level, and which allow users to create second-level queues based on sub-teams, priority tiers, or job types.

Avoid the Rush Hours

Another problem we have with YARN resource utilization is that there is still a daily pattern at the whole cluster level. Many teams decided to run their ETL pipelines between 00:00-01:00 UTC, since that’s supposedly when the last day of logs is ready. Those pipelines may run for 1-2 hours. This makes the YARN cluster extremely busy in those rush hours.

Instead of adding more machines to the YARN cluster, which would reduce the average utilization and hurt cost efficiency, we plan to implement time-based rates. Basically, when we calculate the average usage in the last 23 hours, we apply a scaling factor that is different based on the hour of the day.  For example, the scaling factor will be 2x for the rush hours from 0-4 UTC, and 0.8x for the rest of the day.

Cluster of Clusters

As our YARN and HDFS clusters continued to grow bigger, we started to notice a performance bottleneck. Both HDFS NameNode and YARN ResourceManager started to slow down due to the ever-increasing cluster size. While this is mainly a scalability challenge, it also dramatically affects our cost efficiency goals.

Generalized Load Balancing

We described the P99 and Average Utilization challenge earlier. The solution on Cheap and Big HDDs in Part 3 will touch on the importance of IOPS P99.

Query Engines

We use several query engines in Uber’s Big Data ecosystem: Hive-on-Spark, Spark, and Presto. These query engines, in combination with the file formats (Parquet and ORC), have created an interesting trade-off matrix for our cost efficiency effort. Other options including SparkSQL and Hive-on-Tez make the decision even more complex.

Apache Hudi

One of the biggest cost efficiency opportunities we had in the Big Data Platform is efficient incremental processing. Many of our fact data sets can arrive late or be changed. For example, in many cases a rider doesn’t rate the driver of the last trip until he or she is about to ask for the next ride. Credit card chargeback to a trip can sometimes take a month to process.

Continue reading >>>

Comments