Posts

Showing posts from February, 2021

Data Management maturity models: a comparative analysis

Image
From the first glance, you can see that there are seven key Subject Areas where the Subject domains are located. These are: Data Data and System Design Technology, Governance Data Quality Security Related Capabilities. You can see that the difference in approaches to define the key Domains are rather big. It is not the purpose of this article to deliver a detailed analysis, but there is one striking observation I would like to share: the Subject domains and deliverables of these domains are being mixed with one another.  For example, let us have a look at Data governance. The domain ‘Data governance’ exists in four different models. Some other domains like ‘Data management strategy’, that appears in three models, is considered as a deliverable of Data Governance domain in other models, for example in DAMA model. Such a big difference of opinions on key Subject domains is rather confusing. Subject domain dimensions Subject domain dimensions are characteristics of (sub-) domains. It is i

Data Discovery Platforms and Their Open Source Solutions

Image
In the past year or two, many companies have shared their data discovery platforms (the latest being Facebook’s Nemo). Based on this list, we now know of more than 10 implementations. I haven’t been paying much attention to these developments in data discovery and wanted to catch up. I was interested in: The questions these platforms help answer The features developed to answer these questions How they compare with each other What open source solutions are available By the end of this, we’ll learn about the key features that solve 80% of data discoverability problems. We’ll also see how the platforms compare on these features, and take a closer look at open source solutions available. Questions we ask in the data discovery process Before discussing platform features, let’s briefly go over some common questions in the data discovery process. Where can I find data about ____? If we don’t know the right terms, this is especially challenging. For user browsing behavior, do we search for “c

A Rudimentary Guide to Metadata Management

Image
 According to IDC, the size of the global datasphere is projected to reach 163 ZB by 2025, leading to disparate data sources in legacy systems, new system deployments, and the creation of data lakes and data warehouses. Most organizations do not utilize the entirety of the data at their disposal for strategic and executive decision making.  Identifying, classifying, and analyzing data historically has relied on manual processes and therefore, in the current age consumes a lot of resources, with respect to time and monitory value. Defining metadata for the data owned by the organization is the first step in unleashing the organizational data’s maximum potential.  The numerous data types and data sources that are embedded in different systems and technologies over time are seldomly designed to work together. Thus, the applications or models used on multiple data types and data sources can potentially be compromised, rendering inaccurate analysis and conclusions.  Having consistency acros

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

Image
Netflix has more than 195 million subscribers that generate petabytes of data everyday. Data scientists and engineers collect this data from our subscribers and videos, and implement data analytics models to discover customer behaviour with the goal of maximizing user joy. Usually Data scientists and engineers write Extract-Transform-Load (ETL) jobs and pipelines using big data compute technologies, like Spark or Presto, to process this data and periodically compute key information for a member or a video. The processed data is typically stored as data warehouse tables in AWS S3. Iceberg is widely adopted in Netflix as a data warehouse table format that addresses many of the usability and performance problems with Hive tables. At Netflix, we also heavily embrace a microservice architecture that emphasizes separation of concerns. Many of these services often have the requirement to do a fast lookup for this fine-grained data which is generated periodically. For example, in order to enha