Posts

Showing posts with the label Apache Spark

Cost-Efficient Open Source Big Data Platform at Uber

Image
  In this blog post, we shared efforts and ideas in improving the platform efficiency of Uber’s Big Data Platform, including file format improvements, HDFS erasure coding, YARN scheduling policy improvements, load balancing, query engines, and Apache Hudi.  These improvements have resulted in significant savings.  In addition, we explored some open challenges like analytics and online colocation, and pricing mechanisms.  However, as the framework outlined in our previous post established, platform efficiency improvements alone do not guarantee efficient operation.  Controlling the supply and the demand of data is equally important, which we will address in an upcoming post. As Uber’s business has expanded, the underlying pool of data that powers it has grown exponentially, and thus ever more expensive to process. When Big Data rose to become one of our largest operational expenses, we began an initiative to reduce costs on our data platform, which divides challe...

Turbocharging Analytics at Uber with Data Science Workbench

Image
Millions of Uber trips take place each day across nearly 80 countries, generating information on traffic, preferred routes, estimated times of arrival/delivery, drop-off locations, and more that enables us to facilitate better experiences for users. To make our data exploration and analysis more streamlined and efficient, we built Uber’s data science workbench (DSW), an all-in-one toolbox for interactive analytics and machine learning that leverages aggregate data. DSW centralizes everything a data scientist needs to perform data exploration, data preparation, ad-hoc analyses, model exploration, workflow scheduling, dashboarding, and collaboration in a single-pane, web-based graphical user interface (GUI). Leveraged by data science, engineering, and operations teams across the company, DSW has quickly scaled to become Uber’s go-to data analytics solution. Current DSW use cases include pricing, safety, fraud detection, and navigation, among other foundational elements of the trip experi...

The Apache Spark 3.0 Preview is here!

Image
Preview release of Spark 3.0 To enable wide-scale community testing of the upcoming Spark 3.0 release, the Apache Spark community has posted a preview release of Spark 3.0 . This preview is not a stable release in terms of either API or functionality, but it is meant to give the community early access to try the code that will become Spark 3.0. If you would like to test the release, please download it, and send feedback using either the mailing lists or JIRA. The Spark issue tracker already contains a list of features in 3.0.

Deep dive into how Uber uses Spark

Image
Apache Spark is a foundational piece of Uber’s Big Data infrastructure that powers many critical aspects of our business. We currently run more than one hundred thousand Spark applications per day, across multiple different compute environments. Spark’s versatility, which allows us to build applications and run them everywhere that we need, makes this scale possible. However, our ever-growing infrastructure means that these environments are constantly changing, making it increasingly difficult for both new and existing users to give their applications reliable access to data sources, compute resources, and supporting tools. Also, as the number of users grow, it becomes more challenging for the data team to communicate these environmental changes to users, and for us to understand exactly how Spark is being used. We built the Uber Spark Compute Service (uSCS) to help manage the complexities of running Spark at this scale. This Spark-as-a-service solution leverages Apache Livy, cu...

Understanding Apache Spark Failures and Bottlenecks

Image
Apache Spark is a powerful open-source distributed computing framework for scalable and efficient analysis of big data apps running on commodity compute clusters. Spark provides a framework for programming entire clusters with built-in data parallelism and fault tolerance while hiding the underlying complexities of using distributed systems. Spark has seen a massive spike in adoption by enterprises across a wide swath of verticals, applications, and use cases. Spark provides speed (up to 100x faster in-memory execution than Hadoop MapReduce) and easy access to all Spark components (write apps in R, Python, Scala, and Java) via unified high-level APIs. Spark also handles a wide range of workloads (ETL, BI, analytics, ML, graph processing, etc.) and performs interactive SQL queries, batch processing, streaming data analytics, and data pipelines. Spark is also replacing MapReduce as the processing engine component of Hadoop. Spark applications are easy to write and easy to understa...

Using AWK and R to parse 25tb

Image
Intro Recently I was put in charge of setting up a workflow for dealing with a large amount of raw DNA sequencing (well technically a SNP chip) data for my lab. The goal was to be able to quickly get data for a given genetic location (called a SNP) for use for modeling etc. Using vanilla R and AWK I was able to cleanup and organize the data in a natural way, massively speeding up the querying. It certainly wasn’t easy and it took lots of iterations. This post is meant to help others avoid some of the same mistakes and show what did eventually work. The Data The data was delivered to us by our university’s genetics processing center as 25 TB of tsvs. Before handing it off to me, my advisor split and gzipped these files into five batches each composed of roughly 240 four gigabyte files. Each row contained a data for a single SNP for a single person. There were ~2.5 million SNPS and ~60 thousand people Along with the SNP value there were multiple numeric columns on things like intensity o...

Test data quality at scale with AWS Deequ

Image
You generally write unit tests for your code, but do you also test your data? Incorrect or malformed data can have a large impact on production systems. Examples of data quality issues are: Missing values can lead to failures in production system that require non-null values (NullPointerException). Changes in the distribution of data can lead to unexpected outputs of machine learning models. Aggregations of incorrect data can lead to wrong business decisions. In this blog post, we introduce Deequ, an open source tool developed and used at Amazon. Deequ allows you to calculate data quality metrics on your dataset, define and verify data quality constraints, and be informed about changes in the data distribution. Instead of implementing checks and verification algorithms on your own, you can focus on describing how your data should look. Deequ supports you by suggesting checks for you. Deequ is implemented on top of Apache Spark and is designed to scale with large datasets (th...

What’s Behind Lyft’s Choices in Big Data Tech

Image
Lyft was a late entrant to the ride-sharing business model, at least compared to its competitor Uber, which pioneered the concept and remains the largest provider. That delay in starting out actually gave Lyft a bit of an advantage in terms of architecting its big data infrastructure in the cloud, as it was able to sidestep some of the challenges that Uber faced in building out its on-prem system. Lyft and Uber, like many of the young Silicon Valley companies shaking up established business models, aren’t shy about sharing information about their computer infrastructure. They both share an ethos of openness in regards to using and developing technology. That openness is also pervasive at Google, Facebook, Twitter, and other Valley outfits that created much of the big data ecosystem, most of which is, of course, open source. So when the folks at Lyft were blueprinting how to construct a system that could do all the things that a ride-sharing app has to do – tracking and connectin...

Scalable Log Analytics with Apache Spark: A Comprehensive Case-Study

Image
Introduction One of the most popular and effective enterprise case-studies which leverage analytics today is log analytics. Almost every small and big organization today have multiple systems and infrastructure running day in and day out. To effectively keep their business running, organizations need to know if their infrastructure is performing to its maximum potential. This involves analyzing system and application logs and maybe even apply predictive analytics on log data. The amount of log data is typically massive, depending on the type of organizational infrastructure and applications running on it. Gone are the days when we were limited by just trying to analyze a sample of data on a single machine due to compute constraints. Powered by big data, better and distributed computing, big data processing and open-source analytics frameworks like Spark, we can perform scalable log analytics on potentially millions and billions of log messages daily. The i...

Building and Scaling Data Lineage at Netflix

Image
Netflix Data Landscape Freedom & Responsibility (F&R) is the lynchpin of Netflix’s culture empowering teams to move fast to deliver on innovation and operate with freedom to satisfy their mission. Central engineering teams provide paved paths (secure, vetted and supported options) and guard rails to help reduce variance in choices available for tools and technologies to support the development of scalable technical architectures. Nonetheless, Netflix data landscape (see below) is complex and many teams collaborate effectively for sharing the responsibility of our data system management. Therefore, building a complete and accurate data lineage system to map out all the data-artifacts (including in-motion and at-rest data repositories, Kafka topics, apps, reports and dashboards, interactive and ad-hoc analysis queries, ML and experimentation models) is a monumental task and requires a scalable architecture, robust design, a strong engineering team and above all, amazing cross-f...

Effective Spark DataFrames With Alluxio

Image
Many organizations deploy Alluxio together with Spark for performance gains and data manageability benefits. Qunar recently deployed Alluxio in production, and their Spark streaming jobs sped up by 15x on average and up to 300x during peak times. They noticed that some Spark jobs would slow down or would not finish, but with Alluxio, those jobs could finish quickly. In this blog post, we investigate how Alluxio helps Spark be more effective. Alluxio increases performance of Spark jobs, helps Spark jobs perform more predictably, and enables multiple Spark jobs to share the same data from memory. Previously, we investigated how Alluxio is used for Spark RDDs. In this article, we investigate how to effectively use Spark DataFrames with Alluxio. Alluxio and Spark Cache Storing Spark DataFrames in Alluxio memory is very simple, and only requires saving the DataFrame as a file to Alluxio. This is very simple with the Spark DataFrame write API. DataFrames are commonly written as parquet fi...

The Forrester Wave™: Cloud Hadoop/Spark Platforms, Q1 2019

Image
Cloud Hadoop/Spark (HARK) platforms accelerate insights by automating the storage, processing, and accessing of big data. In our 25-criterion evaluation of HARK providers, we identified the 11 most significant ones — Amazon Web Services (AWS), Cloudera, Google, Hortonworks, Huawei, MapR, Microsoft, Oracle, Qubole, Rackspace, and SAP — and researched, analyzed, and scored them.  This report shows how each provider measures up and helps enterprise architecture (EA) professionals select the right one for their needs. Note: Cloudera and Hortonworks completed their planned merger on January 3, 2019, and will continue as Cloudera. This Forrester Wave reflects our evaluation of each company's independent HARK platforms prior to the completion of the merger. Full report available here >>>

Google announces Kubernetes Operator for Apache Spark

Image
The beta release of "Spark Operator" allows native execution of Spark applications on Kubernetes clusters -- no Hadoop or Mesos required. Apache Spark is a hugely popular execution framework for running data engineering and machine learning workloads. It powers the Databricks platform and is available in both on-premises and cloud-based Hadoop services, like Azure HDInsight, Amazon EMR and Google Cloud Dataproc. It can run on Mesos clusters too. But what of you just want to run your Spark workloads on a Kubernetres (k8s) cluster sans Mesos, and without the Hadoop YARN strings attached? While Spark first added Kubernetes-specific features in its 2.3 release, and improved them in 2.4, getting Spark to run natively on k8s, in a fully integrated fashion, can still be a challenge. KUBE OPERATOR Today, Google, which created Kubernetes in the first place, is announcing the beta release of the Kubernetes Operator for Apache Spark -- "Spark Operator" for short. Sp...

Real-Time Stock Processing With Apache NiFi and Apache Kafka

Image
Implementing Streaming Use Case From REST to Hive With Apache NiFi and Apache Kafka Part 1 With Apache Kafka 2.0 and Apache NiFi 1.8, there are many new features and abilities coming out. It's time to put them to the test. So to plan out what we are going to do, I have a high-level architecture diagram. We are going to ingest a number of sources including REST feeds, Social Feeds, Messages, Images, Documents, and Relational Data. We will ingest with NiFi and then filter, process, and segment it into Kafka topics. Kafka data will be in Apache Avro format with schemas specified in the Hortonworks Schema Registry. Spark and NiFi will do additional event processing along with machine learning and deep learning. This will be stored in Druid for real-time analytics and summaries. Hive, HDFS, and S3 will store the data for permanent storage. We will do dashboards with Superset and Spark SQL + Zeppelin. We will also push back cleaned and aggregated data to subscribers via Kafka ...

Azure HDInsight brings next generation Apache Hadoop 3.0

Preview of Apache Hadoop 3.0 in Azure HDInsight 4.0 Led by Hortonworks, Apache Hadoop 3.0 represents over 5 years of work across the community since the last major update to the Hadoop stack. Enterprises can now realize their data lake vision while efficiently incorporating deep learning frameworks in to their applications all on the same Hadoop stack that they are comfortable with. Some of the key enhancements include: With ACID semantics enabled by default, Apache Hive 3.0 becomes more like a traditional database, making it easier for customers to build LOB applications on top of very large data sets. Apache Druid is an open source data store with indexing/caching capabilities on top of a column-oriented storage layout. With Apache Hive and Apache Druid (now available by default), customers can do near real time exploratory analytics on incoming data. With Tensorflow, available by default, and GPU support, Apache Hadoop 3.0 squarely targets the machine learning...

Processing streams of data with Apache Kafka and Spark

Image
Data Data is produced every second, it comes from millions of sources and is constantly growing. Have you ever thought how much data you personally are generating every day? Data: direct result of our actions There’s data generated as a direct result of our actions and activities: Browsing twitter Using mobile apps Performing financial transactions Using a navigator in your car Booking a train ticket Creating an online document Starting a YouTube live stream Obviously, that’s not it. Data: produced as a side effect For example, performing a purchase where it seems like we’re buying just one thing – might generat...