Posts

Showing posts with the label metadata

The Growing Importance of Metadata Management Systems

Image
As companies embrace digital technologies to transform their operations and products, many are using best-of-breed software, open source tools, and software as a service (SaaS) platforms to rapidly and efficiently integrate new technologies. This often means that data required for reports, analytics, and machine learning (ML) reside on disparate systems and platforms. As such, IT initiatives in companies increasingly involve tools and frameworks for data fusion and integration. Examples include tools for building data pipelines, data quality and data integration solutions, customer data platform ( CDP ) ,   master data management , and   data markets . Collecting, unifying, preparing, and managing data from diverse sources and formats has become imperative in this era of rapid digital transformation. Organizations that invest in  foundational data technologies  are much more likely to build solid foundation applications, ranging from BI and analytics to machine learn...

Ten Use Cases to Enable an Organization with Metadata and Catalogs

Image
Enterprises are modernizing their data platforms and associated tool-sets to serve the fast needs of data practitioners, including data scientists, data analysts, business intelligence and reporting analysts, and self-service-embracing business and technology personnel. However, as the tool-stack in most organizations is getting modernized, so is the variety of metadata generated. As the volume of data is increasing every day, thereupon, the metadata associated with data is expanding, as is the need to manage it. The first thought that strikes us when we look at a data landscape and hear about a catalog is, “It scans any database ranging from Relational to NoSQL or Graph and gives out useful information.” Name Modeled data-type Inferred data types Patterns of data Length with minimum and largest threshold Minimal and maximum values Other profiling characteristics of data like frequency of values and their distribution What Is the Basic Benefit of Metadata Managed in Catal...

Data Management maturity models: a comparative analysis

Image
From the first glance, you can see that there are seven key Subject Areas where the Subject domains are located. These are: Data Data and System Design Technology, Governance Data Quality Security Related Capabilities. You can see that the difference in approaches to define the key Domains are rather big. It is not the purpose of this article to deliver a detailed analysis, but there is one striking observation I would like to share: the Subject domains and deliverables of these domains are being mixed with one another.  For example, let us have a look at Data governance. The domain ‘Data governance’ exists in four different models. Some other domains like ‘Data management strategy’, that appears in three models, is considered as a deliverable of Data Governance domain in other models, for example in DAMA model. Such a big difference of opinions on key Subject domains is rather confusing. Subject domain dimensions Subject domain dimensions are characteristics of (sub-) domains. It ...

A Rudimentary Guide to Metadata Management

Image
 According to IDC, the size of the global datasphere is projected to reach 163 ZB by 2025, leading to disparate data sources in legacy systems, new system deployments, and the creation of data lakes and data warehouses. Most organizations do not utilize the entirety of the data at their disposal for strategic and executive decision making.  Identifying, classifying, and analyzing data historically has relied on manual processes and therefore, in the current age consumes a lot of resources, with respect to time and monitory value. Defining metadata for the data owned by the organization is the first step in unleashing the organizational data’s maximum potential.  The numerous data types and data sources that are embedded in different systems and technologies over time are seldomly designed to work together. Thus, the applications or models used on multiple data types and data sources can potentially be compromised, rendering inaccurate analysis and conclusions.  Havin...

Gartner - 2020 Magic Quadrant for Metadata Management Solutions

Image
Metadata management is a core aspect of an organization’s ability to manage its data and information assets. The term “metadata” describes the various facets of an information asset that can improve its usability throughout its life cycle. Metadata and its uses go far beyond technical matters. Metadata is used as a reference for business-oriented and technical projects, and lays the foundations for describing, inventorying and understanding data for multiple use cases. Use-case examples include data governance, security and risk, data analysis and data value. The market for metadata management solutions is complex because these solutions are not all identical in scope or capability. Vendors include companies with one or more of the following functional capabilities in their stand-alone metadata management products (not all vendors offer all these capabilities, and not all vendor solutions offer these capabilities in one product): Metadata repositories — Used to document and manage meta...

The Forrester Wave™: Machine Learning Data Catalogs, Q4 2020

Image
Key Takeaways  Alation, Collibra, Alex Solutions, And IBM Lead The Pack Forrester’s research uncovered a market in which Alation, Collibra, Alex Solutions, and IBM are Leaders; data.world, Informatica, Io-Tahoe, and Hitachi Vantara are Strong Performers; and Infogix and erwin are Contenders.  Collaboration, Lineage, And Data Variety Are Key Differentiators As metadata and business glossary technology becomes outdated and less effective, improved machine learning will dictate which providers lead the pack. Vendors that can provide scaleout collaboration, offer detailed data lineage, and interpret any type of data will position themselves to successfully deliver contextualized, trusted, accessible data to their customers. Read full report >>>

Open sourcing DataHub: LinkedIn’s metadata search and discovery platform

Image
Finding the right data quickly is critical for any company that relies on big data insights to make data-driven decisions. Not only does this impact the productivity of data users (including analysts, machine learning developers, data scientists, and data engineers), but it also has a direct impact on end products that rely on a quality machine learning (ML) pipeline. Additionally, the trend towards adopting or building ML platforms naturally begs the question: what is your method for internal discovery of ML features, models, metrics, datasets, etc.? In this blog post, we will share the journey of open sourcing DataHub, our metadata search and discovery platform, starting with the project’s early days as WhereHows. LinkedIn maintains an in-house version of DataHub separate from the open source version. We will start by explaining why we need two separate development environments, followed by a discussion on the early approaches for open sourcing WhereHows, and a comparison of our inte...

Amundsen — Lyft’s data discovery & metadata engine

Image
The problem Unprecedented growth in Data volumes has led to 2 big challenges: Productivity — Whether it’s building a new model, instrumenting a new metric, or doing adhoc analysis, how can I most productively and effectively make use of this data?   Compliance — When collecting data about a company’s users, how do organizations comply with increasing regulatory and compliance demands and uphold the trust of their users? The key to solving these problems lies not in data, but in the metadata. And, to show you how, let’s go through a journey of how we solved a part of the productivity problem at Lyft using metadata. Productivity At a 50,000 feet level, the data scientist workflow looks like the following.  Read full article >>>

The role of Apache Atlas in the open metadata ecosystem

Image
Introducing Apache Atlas Apache Atlas emerged as an Apache incubator project in May 2015. It is scoped to provide an open source implementation for metadata management and governance. The initial focus was the Apache Hadoop environment although Apache Atlas has no dependencies on the Hadoop platform itself.  At its core, Apache Atlas has a graph database for storing metadata, a search capability based on Apache Lucene and a simple notification service based on Apache Kafka. There is a type definition language for describing the metadata stored in the graph and standard APIs for populating metadata, from business glossary terms, classification tags, data sources and lineage.   The start of an ecosystem What makes Apache Atlas different from other metadata solutions is that it is designed to ship with the platform where the data is stored. It is, in fact, a core component of the data platform. This means the different processes and engines that run on the ...