Posts

The Growing Importance of Metadata Management Systems

Image
As companies embrace digital technologies to transform their operations and products, many are using best-of-breed software, open source tools, and software as a service (SaaS) platforms to rapidly and efficiently integrate new technologies. This often means that data required for reports, analytics, and machine learning (ML) reside on disparate systems and platforms. As such, IT initiatives in companies increasingly involve tools and frameworks for data fusion and integration. Examples include tools for building data pipelines, data quality and data integration solutions, customer data platform ( CDP ) ,   master data management , and   data markets . Collecting, unifying, preparing, and managing data from diverse sources and formats has become imperative in this era of rapid digital transformation. Organizations that invest in  foundational data technologies  are much more likely to build solid foundation applications, ranging from BI and analytics to machine learn...

Visualizing Data Timeliness at Airbnb

Image
  Imagine you are a business leader ready to start your day, but you wake up to find that your daily business report is empty — the data is late, so now you are blind. Over the last year, multiple teams came together to build  SLA Tracker , a visual analytics tool to facilitate a culture of data timeliness at Airbnb. This data product enabled us to address and systematize the following challenges of data timeliness: When  should a dataset be considered late? How   frequently  are datasets late? Why  is a dataset late? This project is a critical part of our efforts to achieve high data quality and required overcoming many technical, product, and organizational challenges in order to build. In this article, we focus on the  product design : the journey of how we designed and built data visualizations that could make sense of the deeply complex data of data timeliness. Continue reading >>>

ThoughtWorks Decoder puts tech into a business context

Image
The tech landscape changes pretty fast. There are always new terms, techniques and tools emerging. But don't let tech be an enigma: ThoughtWorks Decoder is here to help Simply search for the term you're interested in, and we'll give you the lowdown on what it is, what it can do for your enterprise and what the potential drawbacks are. ThoughtWorks Decoder >>>

Ten Use Cases to Enable an Organization with Metadata and Catalogs

Image
Enterprises are modernizing their data platforms and associated tool-sets to serve the fast needs of data practitioners, including data scientists, data analysts, business intelligence and reporting analysts, and self-service-embracing business and technology personnel. However, as the tool-stack in most organizations is getting modernized, so is the variety of metadata generated. As the volume of data is increasing every day, thereupon, the metadata associated with data is expanding, as is the need to manage it. The first thought that strikes us when we look at a data landscape and hear about a catalog is, “It scans any database ranging from Relational to NoSQL or Graph and gives out useful information.” Name Modeled data-type Inferred data types Patterns of data Length with minimum and largest threshold Minimal and maximum values Other profiling characteristics of data like frequency of values and their distribution What Is the Basic Benefit of Metadata Managed in Catal...

Gartner Magic Quadrant for Data Science and Machine Learning Platforms 2021

Image
This report assesses 20 vendors of platforms that data scientists and others can use to source data, build models and operationalize machine learning. It will help them make the right choice from a crowded field in a maturing DSML platform market that continues to show rapid product development. Market Definition/Description Gartner  defines a data science and machine learning (DSML) platform as a core product and supporting portfolio of coherently integrated products, components, libraries and frameworks (including proprietary, partner-sourced and open-source). Its primary users are data science professionals, including expert data scientists, citizen data scientists, data engineers, application developers and machine learning (ML) specialists. The core product and supporting portfolio: Are sufficiently well-integrated to provide a consistent “look and feel.” Create a user experience in which all components are reasonably interoperable in support of an analytics pipeline. The...

Data Management maturity models: a comparative analysis

Image
From the first glance, you can see that there are seven key Subject Areas where the Subject domains are located. These are: Data Data and System Design Technology, Governance Data Quality Security Related Capabilities. You can see that the difference in approaches to define the key Domains are rather big. It is not the purpose of this article to deliver a detailed analysis, but there is one striking observation I would like to share: the Subject domains and deliverables of these domains are being mixed with one another.  For example, let us have a look at Data governance. The domain ‘Data governance’ exists in four different models. Some other domains like ‘Data management strategy’, that appears in three models, is considered as a deliverable of Data Governance domain in other models, for example in DAMA model. Such a big difference of opinions on key Subject domains is rather confusing. Subject domain dimensions Subject domain dimensions are characteristics of (sub-) domains. It ...

Data Discovery Platforms and Their Open Source Solutions

Image
In the past year or two, many companies have shared their data discovery platforms (the latest being Facebook’s Nemo). Based on this list, we now know of more than 10 implementations. I haven’t been paying much attention to these developments in data discovery and wanted to catch up. I was interested in: The questions these platforms help answer The features developed to answer these questions How they compare with each other What open source solutions are available By the end of this, we’ll learn about the key features that solve 80% of data discoverability problems. We’ll also see how the platforms compare on these features, and take a closer look at open source solutions available. Questions we ask in the data discovery process Before discussing platform features, let’s briefly go over some common questions in the data discovery process. Where can I find data about ____? If we don’t know the right terms, this is especially challenging. For user browsing behavior, do we search for “c...

A Rudimentary Guide to Metadata Management

Image
 According to IDC, the size of the global datasphere is projected to reach 163 ZB by 2025, leading to disparate data sources in legacy systems, new system deployments, and the creation of data lakes and data warehouses. Most organizations do not utilize the entirety of the data at their disposal for strategic and executive decision making.  Identifying, classifying, and analyzing data historically has relied on manual processes and therefore, in the current age consumes a lot of resources, with respect to time and monitory value. Defining metadata for the data owned by the organization is the first step in unleashing the organizational data’s maximum potential.  The numerous data types and data sources that are embedded in different systems and technologies over time are seldomly designed to work together. Thus, the applications or models used on multiple data types and data sources can potentially be compromised, rendering inaccurate analysis and conclusions.  Havin...

Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores

Image
Netflix has more than 195 million subscribers that generate petabytes of data everyday. Data scientists and engineers collect this data from our subscribers and videos, and implement data analytics models to discover customer behaviour with the goal of maximizing user joy. Usually Data scientists and engineers write Extract-Transform-Load (ETL) jobs and pipelines using big data compute technologies, like Spark or Presto, to process this data and periodically compute key information for a member or a video. The processed data is typically stored as data warehouse tables in AWS S3. Iceberg is widely adopted in Netflix as a data warehouse table format that addresses many of the usability and performance problems with Hive tables. At Netflix, we also heavily embrace a microservice architecture that emphasizes separation of concerns. Many of these services often have the requirement to do a fast lookup for this fine-grained data which is generated periodically. For example, in order to enha...
Image
 O ver the past few years, companies have been massively shifting their data and applications to the cloud that ended up raising a community of data users. They are encouraged to capture, gather, analyze, and save data for business insights and decision-making. More organizations are leading towards the use of multi-cloud, and the threat of losing data and securing has become challenging. Therefore, managing security policies, rules, metadata details, content traits is becoming critical for the multi-cloud. In this regard, the enterprises are in search of expertise and cloud tool vendors that are capable of providing the fundamental cloud security data governance competencies with excellence. Start with building policies and write them into code, or scripts that can be executed. This requires compliance and cloud security experts working together to build a framework for your complex business. You cannot start from scratch as it will be error-prone and will take too long. Try to in...