Notes about Cutting-Edge Technologies and Everything

Posts

Showing posts with the label usecase

Data Reliability at Scale: How Fox Digital Architected its Modern Data Stack

- March 04, 2022

As distributed architectures continue to become a new gold standard for data driven organizations, this kind of self-serve motion would be a dream come true for many data leaders. So when the Monte Carlo team got the chance to sit down with Alex, we took a deep dive into how he made it happen. Here’s how his team architected a hybrid data architecture that prioritizes democratization and access, while ensuring reliability and trust at every turn. Exercise “Controlled Freedom” when dealing with stakeholders Alex has built decentralized access to data at Fox on a foundation he calls “controlled freedom.” In fact, he believes using your data team as the single source of truth within an organization actually creates the biggest silo. So instead of becoming a guardian and bottleneck, Alex and his data team focus on setting certain parameters around how data is ingested and supplied to stakeholders. Within the framework, internal data consumers at Fox have the freedom to cr...

Cost-Efficient Open Source Big Data Platform at Uber

- August 15, 2021

In this blog post, we shared efforts and ideas in improving the platform efficiency of Uber’s Big Data Platform, including file format improvements, HDFS erasure coding, YARN scheduling policy improvements, load balancing, query engines, and Apache Hudi. These improvements have resulted in significant savings. In addition, we explored some open challenges like analytics and online colocation, and pricing mechanisms. However, as the framework outlined in our previous post established, platform efficiency improvements alone do not guarantee efficient operation. Controlling the supply and the demand of data is equally important, which we will address in an upcoming post. As Uber’s business has expanded, the underlying pool of data that powers it has grown exponentially, and thus ever more expensive to process. When Big Data rose to become one of our largest operational expenses, we began an initiative to reduce costs on our data platform, which divides challe...

Visualizing Data Timeliness at Airbnb

- March 19, 2021

Imagine you are a business leader ready to start your day, but you wake up to find that your daily business report is empty — the data is late, so now you are blind. Over the last year, multiple teams came together to build SLA Tracker , a visual analytics tool to facilitate a culture of data timeliness at Airbnb. This data product enabled us to address and systematize the following challenges of data timeliness: When should a dataset be considered late? How frequently are datasets late? Why is a dataset late? This project is a critical part of our efforts to achieve high data quality and required overcoming many technical, product, and organizational challenges in order to build. In this article, we focus on the product design : the journey of how we designed and built data visualizations that could make sense of the deeply complex data of data timeliness. Continue reading >>>

Ten Use Cases to Enable an Organization with Metadata and Catalogs

- March 11, 2021

Enterprises are modernizing their data platforms and associated tool-sets to serve the fast needs of data practitioners, including data scientists, data analysts, business intelligence and reporting analysts, and self-service-embracing business and technology personnel. However, as the tool-stack in most organizations is getting modernized, so is the variety of metadata generated. As the volume of data is increasing every day, thereupon, the metadata associated with data is expanding, as is the need to manage it. The first thought that strikes us when we look at a data landscape and hear about a catalog is, “It scans any database ranging from Relational to NoSQL or Graph and gives out useful information.” Name Modeled data-type Inferred data types Patterns of data Length with minimum and largest threshold Minimal and maximum values Other profiling characteristics of data like frequency of values and their distribution What Is the Basic Benefit of Metadata Managed in Catal...

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

- February 02, 2021

Netflix has more than 195 million subscribers that generate petabytes of data everyday. Data scientists and engineers collect this data from our subscribers and videos, and implement data analytics models to discover customer behaviour with the goal of maximizing user joy. Usually Data scientists and engineers write Extract-Transform-Load (ETL) jobs and pipelines using big data compute technologies, like Spark or Presto, to process this data and periodically compute key information for a member or a video. The processed data is typically stored as data warehouse tables in AWS S3. Iceberg is widely adopted in Netflix as a data warehouse table format that addresses many of the usability and performance problems with Hive tables. At Netflix, we also heavily embrace a microservice architecture that emphasizes separation of concerns. Many of these services often have the requirement to do a fast lookup for this fine-grained data which is generated periodically. For example, in order to enha...

Turbocharging Analytics at Uber with Data Science Workbench

- January 11, 2021

Millions of Uber trips take place each day across nearly 80 countries, generating information on traffic, preferred routes, estimated times of arrival/delivery, drop-off locations, and more that enables us to facilitate better experiences for users. To make our data exploration and analysis more streamlined and efficient, we built Uber’s data science workbench (DSW), an all-in-one toolbox for interactive analytics and machine learning that leverages aggregate data. DSW centralizes everything a data scientist needs to perform data exploration, data preparation, ad-hoc analyses, model exploration, workflow scheduling, dashboarding, and collaboration in a single-pane, web-based graphical user interface (GUI). Leveraged by data science, engineering, and operations teams across the company, DSW has quickly scaled to become Uber’s go-to data analytics solution. Current DSW use cases include pricing, safety, fraud detection, and navigation, among other foundational elements of the trip experi...

Shopify's approach to data discovery

- July 24, 2020

Humans generate a lot of data. Every two days we create as much data as we did from the beginning of time until 2003! The International Data Corporation estimates the global datasphere totaled 33 zettabytes (one trillion gigabytes) in 2018. The estimate for 2025 is 175 ZBs, an increase of 430%. This growth is challenging organizations across all industries to rethink their data pipelines. The nature of data usage is problem driven, meaning data assets (tables, reports, dashboards, etc.) are aggregated from underlying data assets to help decision making about a particular business problem, feed a machine learning algorithm, or serve as an input to another data asset. This process is repeated multiple times, sometimes for the same problems, and results in a large number of data assets serving a wide variety of purposes. Data discovery and management is the practice of cataloguing these data assets and all of the applicable metadata that saves time for data professionals, increasing data ...

Open sourcing DataHub: LinkedIn’s metadata search and discovery platform

- June 01, 2020

Finding the right data quickly is critical for any company that relies on big data insights to make data-driven decisions. Not only does this impact the productivity of data users (including analysts, machine learning developers, data scientists, and data engineers), but it also has a direct impact on end products that rely on a quality machine learning (ML) pipeline. Additionally, the trend towards adopting or building ML platforms naturally begs the question: what is your method for internal discovery of ML features, models, metrics, datasets, etc.? In this blog post, we will share the journey of open sourcing DataHub, our metadata search and discovery platform, starting with the project’s early days as WhereHows. LinkedIn maintains an in-house version of DataHub separate from the open source version. We will start by explaining why we need two separate development environments, followed by a discussion on the early approaches for open sourcing WhereHows, and a comparison of our inte...

What is the Microsoft's Team Data Science Process?

- January 15, 2020

The Team Data Science Process (TDSP) is an agile, iterative data science methodology to deliver predictive analytics solutions and intelligent applications efficiently. TDSP helps improve team collaboration and learning by suggesting how team roles work best together. TDSP includes best practices and structures from Microsoft and other industry leaders to help toward successful implementation of data science initiatives. The goal is to help companies fully realize the benefits of their analytics program. This article provides an overview of TDSP and its main components. We provide a generic description of the process here that can be implemented with different kinds of tools. A more detailed description of the project tasks and roles involved in the lifecycle of the process is provided in additional linked topics. Guidance on how to implement the TDSP using a specific set of Microsoft tools and infrastructure that we use to implement the TDSP in our teams is also provi...

Operating a Large, Distributed System in a Reliable Way

- December 06, 2019

"The article is the collection of the practices I've found useful to reliably operate a large system at Uber, while working here. My experience is not unique - people working on similar sized systems go through a similar journey. I've talked with engineers at Google, Facebook, and Netflix, who shared similar experiences and solutions. Many of the ideas and processes listed here should apply to systems of similar scale, regardless of running on own data centers (like Uber mostly does) or on the cloud (where Uber sometimes scales to). However, the practices might be an overkill for smaller or less mission-critical systems." There's much ground to cover: Monitoring Oncall, Anomaly Detection & Alerting Outages & Incident Management Processes Postmortems, Incident Reviews & a Culture of Ongoing Improvements Failover Drills, Capacity Planning & Blackbox Testing SLOs, SLAs & Reporting on Them SRE as an Independent Team Reliability as an...

Deep dive into how Uber uses Spark

- August 15, 2019

Apache Spark is a foundational piece of Uber’s Big Data infrastructure that powers many critical aspects of our business. We currently run more than one hundred thousand Spark applications per day, across multiple different compute environments. Spark’s versatility, which allows us to build applications and run them everywhere that we need, makes this scale possible. However, our ever-growing infrastructure means that these environments are constantly changing, making it increasingly difficult for both new and existing users to give their applications reliable access to data sources, compute resources, and supporting tools. Also, as the number of users grow, it becomes more challenging for the data team to communicate these environmental changes to users, and for us to understand exactly how Spark is being used. We built the Uber Spark Compute Service (uSCS) to help manage the complexities of running Spark at this scale. This Spark-as-a-service solution leverages Apache Livy, cu...

Microsoft best practices of software engineering for machine learning

This paper explains best practices that Microsoft teams discovered and compiled in creating large-scale AI solutions for the marketplace.

What’s Behind Lyft’s Choices in Big Data Tech

Lyft was a late entrant to the ride-sharing business model, at least compared to its competitor Uber, which pioneered the concept and remains the largest provider. That delay in starting out actually gave Lyft a bit of an advantage in terms of architecting its big data infrastructure in the cloud, as it was able to sidestep some of the challenges that Uber faced in building out its on-prem system. Lyft and Uber, like many of the young Silicon Valley companies shaking up established business models, aren’t shy about sharing information about their computer infrastructure. They both share an ethos of openness in regards to using and developing technology. That openness is also pervasive at Google, Facebook, Twitter, and other Valley outfits that created much of the big data ecosystem, most of which is, of course, open source. So when the folks at Lyft were blueprinting how to construct a system that could do all the things that a ride-sharing app has to do – tracking and connectin...

Python at Netflix

As many of us prepare to go to PyCon, we wanted to share a sampling of how Python is used at Netflix. We use Python through the full content lifecycle, from deciding which content to fund all the way to operating the CDN that serves the final video to 148 million members. We use and contribute to many open-source Python packages, some of which are mentioned below. If any of this interests you, check out the jobs site or find us at PyCon. We have donated a few Netflix Originals posters to the PyLadies Auction and look forward to seeing you all there. Open Connect Open Connect is Netflix’s content delivery network (CDN). An easy, though imprecise, way of thinking about Netflix infrastructure is that everything that happens before you press Play on your remote control (e.g., are you logged in? what plan do you have? what have you watched so we can recommend new titles to you? what do you want to watch?) takes place in Amazon Web Services (AWS), whereas everything that happens after...

Scalable Log Analytics with Apache Spark: A Comprehensive Case-Study

Introduction One of the most popular and effective enterprise case-studies which leverage analytics today is log analytics. Almost every small and big organization today have multiple systems and infrastructure running day in and day out. To effectively keep their business running, organizations need to know if their infrastructure is performing to its maximum potential. This involves analyzing system and application logs and maybe even apply predictive analytics on log data. The amount of log data is typically massive, depending on the type of organizational infrastructure and applications running on it. Gone are the days when we were limited by just trying to analyze a sample of data on a single machine due to compute constraints. Powered by big data, better and distributed computing, big data processing and open-source analytics frameworks like Spark, we can perform scalable log analytics on potentially millions and billions of log messages daily. The i...

Amundsen — Lyft’s data discovery & metadata engine

The problem Unprecedented growth in Data volumes has led to 2 big challenges: Productivity — Whether it’s building a new model, instrumenting a new metric, or doing adhoc analysis, how can I most productively and effectively make use of this data? Compliance — When collecting data about a company’s users, how do organizations comply with increasing regulatory and compliance demands and uphold the trust of their users? The key to solving these problems lies not in data, but in the metadata. And, to show you how, let’s go through a journey of how we solved a part of the productivity problem at Lyft using metadata. Productivity At a 50,000 feet level, the data scientist workflow looks like the following. Read full article >>>

Building and Scaling Data Lineage at Netflix

- March 29, 2019

Netflix Data Landscape Freedom & Responsibility (F&R) is the lynchpin of Netflix’s culture empowering teams to move fast to deliver on innovation and operate with freedom to satisfy their mission. Central engineering teams provide paved paths (secure, vetted and supported options) and guard rails to help reduce variance in choices available for tools and technologies to support the development of scalable technical architectures. Nonetheless, Netflix data landscape (see below) is complex and many teams collaborate effectively for sharing the responsibility of our data system management. Therefore, building a complete and accurate data lineage system to map out all the data-artifacts (including in-motion and at-rest data repositories, Kafka topics, apps, reports and dashboards, interactive and ad-hoc analysis queries, ML and experimentation models) is a monumental task and requires a scalable architecture, robust design, a strong engineering team and above all, amazing cross-f...

How Pinterest runs Kafka at scale

- December 22, 2018

Pinterest runs one of the largest Kafka deployments in the cloud. We use Apache Kafka extensively as a message bus to transport data and to power real-time streaming services, ultimately helping more than 250 million Pinners around the world discover and do what they love. As mentioned in an earlier post, we use Kafka to transport data to our data warehouse, including critical events like impressions, clicks, close-ups, and repins. We also use Kafka to transport visibility metrics for our internal services. If the metrics-related Kafka clusters have any glitches, we can’t accurately monitor our services or generate alerts that signal issues. On the real-time streaming side, Kafka is used to power many streaming applications, such as fresh content indexing and recommendation, spam detection and filtering, real-time advertiser budget computation, and so on. We’ve shared out experiences at the Kafka Summit 2018 on incremental db ingestion using Kafka, and building real-time ads pl...

Long short-term memory (LSTM) networks with TensorFlow

- February 05, 2018

How to build a multilayered LSTM network to infer stock market sentiment from social conversation using TensorFlow: https://www.oreilly.com/ideas/introduction-to-lstms-with-tensorflow

Adopting Self-Service BI with Tableau - Notes from the field

- April 03, 2016

(originally this article was created and posted by me on March 7, 2016 at datasciencecentral.com, now I am transferring it here) I have spent many hours planning and executing in-company self-service BI implementation. This enabled me to gain several insights. Now that the ideas became mature enough and field-proven, I believe they are worth sharing. No matter how far you are in toying with potential approaches (possibly you are already in the thick of it!), I hope my attempt of describing feasible scenarios would provide a decent foundation. All scenarios presume that IT plays its main role by owning the infrastructure, managing scalability, data security, and governance. Scenario 1. Tableau Desktop + departmental/cross-functional data schemas. This scenario involves gaining insights by data analysts on a daily basis. They might be either independent individuals or a team. Business users’ interaction with published workbooks is applicable, but limited to simple filterin...