Posts

Showing posts from August, 2019

Gartner Hype Cycle for Emerging Technologies, 2019

Image
The Gartner Hype Cycle highlights the 29 emerging technologies CIOs should experiment with over the next year. Today, companies detect insurance fraud using a combination of claim analysis, computer programs and private investigators. The FBI estimates the total cost of non-healthcare-related insurance fraud to be around $40 billion per year. But a maturing emerging technology called emotion artificial intelligence (AI) might make it possible to detect insurance fraud based on audio analysis of the caller. Some technologies will provide “superhuman capabilities” In addition to catching fraud, this technology can improve customer experience by tracking happiness, more accurately directing callers, enabling better diagnostics for dementia, detecting distracted drivers, and even adapting education to a student’s current emotional state. Though still relatively new, emotion AI is one of 21 new technologies added to the Gartner Hype Cycle for Emerging Technologies, 2019. Original article &g

Who is interested in Apache Pulsar?

Image
With companies producing data from an increasing number of systems and devices, messaging and event streaming solutions—particularly Apache Kafka—have gained widespread adoption. Over the past year, we’ve been tracking the progress of Apache Pulsar (Pulsar), a less well-known but highly capable open source solution originated by Yahoo. Pulsar is designed to intelligently process, analyze, and deliver data from an expanding array of services and applications, and thus it fits nicely into modern data platforms. Pulsar is also designed to ease the operational burdens normally associated with complex, distributed systems. Of the thousands of recent visitors to the site: 33% are from the Americas, 36% from Asia-Pacific, and 27% were based in the EMEA region. While Apache Kafka is by far the most popular pub/sub solution, over the last year, we’ve started to come across numerous companies that use Pulsar. It turns out that Pulsar has a few features these companies value, including: Mu

The Future of Data Engineering

Image
Data engineering’s job is to help an organization move and process data. This generally requires two different systems, broadly speaking: a data pipeline, and a data warehouse. The data pipeline is responsible for moving the data, and the data warehouse is responsible for processing it. I acknowledge that this is a bit overly simplistic. You can do processing in the pipeline itself by doing transformations between extraction and loading with batch and stream processing. The “data warehouse” now includes many storage and processing systems (Flink, Spark, Presto, Hive, BigQuery, Redshift, etc), as well as auxiliary systems such as data catalogs, job schedulers, and so on. Still, I believe the paradigm holds. The industry is working through changes in how these systems are built and managed. There are four areas, in particular, where I expect to see shifts over the next few years. Timeliness: From batch to realtime Connectivity: From one:one bespoke integrations to many:many Cen

Deep dive into how Uber uses Spark

Image
Apache Spark is a foundational piece of Uber’s Big Data infrastructure that powers many critical aspects of our business. We currently run more than one hundred thousand Spark applications per day, across multiple different compute environments. Spark’s versatility, which allows us to build applications and run them everywhere that we need, makes this scale possible. However, our ever-growing infrastructure means that these environments are constantly changing, making it increasingly difficult for both new and existing users to give their applications reliable access to data sources, compute resources, and supporting tools. Also, as the number of users grow, it becomes more challenging for the data team to communicate these environmental changes to users, and for us to understand exactly how Spark is being used. We built the Uber Spark Compute Service (uSCS) to help manage the complexities of running Spark at this scale. This Spark-as-a-service solution leverages Apache Livy, cu

Understanding Apache Spark Failures and Bottlenecks

Image
Apache Spark is a powerful open-source distributed computing framework for scalable and efficient analysis of big data apps running on commodity compute clusters. Spark provides a framework for programming entire clusters with built-in data parallelism and fault tolerance while hiding the underlying complexities of using distributed systems. Spark has seen a massive spike in adoption by enterprises across a wide swath of verticals, applications, and use cases. Spark provides speed (up to 100x faster in-memory execution than Hadoop MapReduce) and easy access to all Spark components (write apps in R, Python, Scala, and Java) via unified high-level APIs. Spark also handles a wide range of workloads (ETL, BI, analytics, ML, graph processing, etc.) and performs interactive SQL queries, batch processing, streaming data analytics, and data pipelines. Spark is also replacing MapReduce as the processing engine component of Hadoop. Spark applications are easy to write and easy to understa