Deep dive into how Uber uses Spark


Apache Spark is a foundational piece of Uber’s Big Data infrastructure that powers many critical aspects of our business. We currently run more than one hundred thousand Spark applications per day, across multiple different compute environments. Spark’s versatility, which allows us to build applications and run them everywhere that we need, makes this scale possible.


uSCS gateway, Apache Livy, and Resource Manager diagramHowever, our ever-growing infrastructure means that these environments are constantly changing, making it increasingly difficult for both new and existing users to give their applications reliable access to data sources, compute resources, and supporting tools. Also, as the number of users grow, it becomes more challenging for the data team to communicate these environmental changes to users, and for us to understand exactly how Spark is being used.

We built the Uber Spark Compute Service (uSCS) to help manage the complexities of running Spark at this scale. This Spark-as-a-service solution leverages Apache Livy, currently undergoing Incubation at the Apache Software Foundation, to provide applications with necessary configurations, then schedule them across our Spark infrastructure using a rules-based approach.

uSCS now handles the Spark applications that power business tasks such as rider and driver pricing computation, demand prediction, and restaurant recommendations, as well as important behind-the-scenes tasks like ETL operations and data exploration. uSCS introduced other useful features into our Spark infrastructure, including observability, performance tuning, and migration automation.

Continue reading >>>

Comments