Posts

Showing posts with the label Data Platform

Emerging Architectures for Modern Data Infrastructure

Image
As an industry, we’ve gotten exceptionally good at building large, complex software systems. We’re now starting to see the rise of massive, complex systems built around data – where the primary business value of the system comes from the analysis of data, rather than the software directly. We’re seeing quick-moving impacts of this trend across the industry, including the emergence of new roles, shifts in customer spending, and the emergence of new startups providing infrastructure and tooling around data. In fact, many of today’s fastest growing infrastructure startups build products to manage data. These systems enable data-driven decision making (analytic systems) and drive data-powered products, including with machine learning (operational systems). They range from the pipes that carry data, to storage solutions that house data, to SQL engines that analyze data, to dashboards that make data easy to understand – from data science and machine learning libraries, to automated data pipe...

A Rudimentary Guide to Metadata Management

Image
 According to IDC, the size of the global datasphere is projected to reach 163 ZB by 2025, leading to disparate data sources in legacy systems, new system deployments, and the creation of data lakes and data warehouses. Most organizations do not utilize the entirety of the data at their disposal for strategic and executive decision making.  Identifying, classifying, and analyzing data historically has relied on manual processes and therefore, in the current age consumes a lot of resources, with respect to time and monitory value. Defining metadata for the data owned by the organization is the first step in unleashing the organizational data’s maximum potential.  The numerous data types and data sources that are embedded in different systems and technologies over time are seldomly designed to work together. Thus, the applications or models used on multiple data types and data sources can potentially be compromised, rendering inaccurate analysis and conclusions.  Havin...

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

Image
Netflix has more than 195 million subscribers that generate petabytes of data everyday. Data scientists and engineers collect this data from our subscribers and videos, and implement data analytics models to discover customer behaviour with the goal of maximizing user joy. Usually Data scientists and engineers write Extract-Transform-Load (ETL) jobs and pipelines using big data compute technologies, like Spark or Presto, to process this data and periodically compute key information for a member or a video. The processed data is typically stored as data warehouse tables in AWS S3. Iceberg is widely adopted in Netflix as a data warehouse table format that addresses many of the usability and performance problems with Hive tables. At Netflix, we also heavily embrace a microservice architecture that emphasizes separation of concerns. Many of these services often have the requirement to do a fast lookup for this fine-grained data which is generated periodically. For example, in order to enha...

Data Mesh Principles and Logical Architecture v2

Image
Our aspiration to augment and improve every aspect of business and life with data, demands a paradigm shift in how we manage data at scale. While the technology advances of the past decade have addressed the scale of volume of data and data processing compute, they have failed to address scale in other dimensions: changes in the data landscape, proliferation of sources of data, diversity of data use cases and users, and speed of response to change. Data mesh addresses these dimensions, founded in four principles: domain-oriented decentralized data ownership and architecture, data as a product, self-serve data infrastructure as a platform, and federated computational governance. Each principle drives a new logical view of the technical architecture and organizational structure. The original writeup, How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh - which I encourage you to read before joining me back here - empathized with today’s pain points of architectural and or...

Data Processing Pipeline Patterns

Image
Data produced by applications, devices, or humans must be processed before it is consumed. By definition, a data pipeline represents the flow of data between two or more systems. It is a set of instructions that determine how and when to move data between these systems. My last blog conveyed how connectivity is foundational to a data platform. In this blog, I will describe the different data processing pipelines that leverage different capabilities of the data platform, such as connectivity and data engines for processing. There are many data processing pipelines. One may: “Integrate” data from multiple sources Perform data quality checks or standardize data Apply data security-related transformations, which include masking, anonymizing, or encryption Match, merge, master, and do entity resolution Share data with partners and customers in the required format, such as HL7 Consumers or “targets” of data pipelines may include: Data warehouses like ...

Move Beyond a Monolithic Data Lake to a Distributed Data Mesh (Martin Fowler)

Image
Many enterprises are investing in their next generation data lake, with the hope of democratizing data at scale to provide business insights and ultimately make automated intelligent decisions. Data platforms based on the data lake architecture have common failure modes that lead to unfulfilled promises at scale. To address these failure modes we need to shift from the centralized paradigm of a lake, or its predecessor data warehouse. We need to shift to a paradigm that draws from modern distributed architecture: considering domains as the first class concern, applying platform thinking to create self-serve data infrastructure, and treating data as a product. Becoming a data-driven organization remains one of the top strategic goals of many companies I work with. My clients are well aware of the benefits of becoming intelligently empowered: providing the best customer experience based on data and hyper-personalization; reducing operational costs and time through data-driven optimi...