Notes about Cutting-Edge Technologies and Everything

Posts

Showing posts with the label Data Governance

Data Reliability at Scale: How Fox Digital Architected its Modern Data Stack

- March 04, 2022

As distributed architectures continue to become a new gold standard for data driven organizations, this kind of self-serve motion would be a dream come true for many data leaders. So when the Monte Carlo team got the chance to sit down with Alex, we took a deep dive into how he made it happen. Here’s how his team architected a hybrid data architecture that prioritizes democratization and access, while ensuring reliability and trust at every turn. Exercise “Controlled Freedom” when dealing with stakeholders Alex has built decentralized access to data at Fox on a foundation he calls “controlled freedom.” In fact, he believes using your data team as the single source of truth within an organization actually creates the biggest silo. So instead of becoming a guardian and bottleneck, Alex and his data team focus on setting certain parameters around how data is ingested and supplied to stakeholders. Within the framework, internal data consumers at Fox have the freedom to cr...

Top 9 Data Modeling Tools & Software 2021

- October 06, 2021

Data modeling is the procedure of crafting a visual representation of an entire information system or portions of it in order to convey connections between data points and structures. The objective is to portray the types of data used and stored within the system, the ways the data can be organized and grouped, the relationships among these data types, and their attributes and formats. Data modeling uses abstraction to better understand and represent the nature of the flow of data within an enterprise-level information system. The types of data models include: Conceptual data models. Logical data models. Physical data models. Database and information system design begins with the creation of these data models. What is a Data Modeling Tool? A data modeling tool enables quick and efficient database design while minimizing human error. A data modeling software helps craft a high-performance database, generate reports that can be useful for stakeholders and create data de...

The Growing Importance of Metadata Management Systems

- March 21, 2021

As companies embrace digital technologies to transform their operations and products, many are using best-of-breed software, open source tools, and software as a service (SaaS) platforms to rapidly and efficiently integrate new technologies. This often means that data required for reports, analytics, and machine learning (ML) reside on disparate systems and platforms. As such, IT initiatives in companies increasingly involve tools and frameworks for data fusion and integration. Examples include tools for building data pipelines, data quality and data integration solutions, customer data platform ( CDP ) , master data management , and data markets . Collecting, unifying, preparing, and managing data from diverse sources and formats has become imperative in this era of rapid digital transformation. Organizations that invest in foundational data technologies are much more likely to build solid foundation applications, ranging from BI and analytics to machine learn...

Ten Use Cases to Enable an Organization with Metadata and Catalogs

- March 11, 2021

Enterprises are modernizing their data platforms and associated tool-sets to serve the fast needs of data practitioners, including data scientists, data analysts, business intelligence and reporting analysts, and self-service-embracing business and technology personnel. However, as the tool-stack in most organizations is getting modernized, so is the variety of metadata generated. As the volume of data is increasing every day, thereupon, the metadata associated with data is expanding, as is the need to manage it. The first thought that strikes us when we look at a data landscape and hear about a catalog is, “It scans any database ranging from Relational to NoSQL or Graph and gives out useful information.” Name Modeled data-type Inferred data types Patterns of data Length with minimum and largest threshold Minimal and maximum values Other profiling characteristics of data like frequency of values and their distribution What Is the Basic Benefit of Metadata Managed in Catal...

Data Management maturity models: a comparative analysis

- February 20, 2021

From the first glance, you can see that there are seven key Subject Areas where the Subject domains are located. These are: Data Data and System Design Technology, Governance Data Quality Security Related Capabilities. You can see that the difference in approaches to define the key Domains are rather big. It is not the purpose of this article to deliver a detailed analysis, but there is one striking observation I would like to share: the Subject domains and deliverables of these domains are being mixed with one another. For example, let us have a look at Data governance. The domain ‘Data governance’ exists in four different models. Some other domains like ‘Data management strategy’, that appears in three models, is considered as a deliverable of Data Governance domain in other models, for example in DAMA model. Such a big difference of opinions on key Subject domains is rather confusing. Subject domain dimensions Subject domain dimensions are characteristics of (sub-) domains. It ...

A Rudimentary Guide to Metadata Management

- February 05, 2021

According to IDC, the size of the global datasphere is projected to reach 163 ZB by 2025, leading to disparate data sources in legacy systems, new system deployments, and the creation of data lakes and data warehouses. Most organizations do not utilize the entirety of the data at their disposal for strategic and executive decision making. Identifying, classifying, and analyzing data historically has relied on manual processes and therefore, in the current age consumes a lot of resources, with respect to time and monitory value. Defining metadata for the data owned by the organization is the first step in unleashing the organizational data’s maximum potential. The numerous data types and data sources that are embedded in different systems and technologies over time are seldomly designed to work together. Thus, the applications or models used on multiple data types and data sources can potentially be compromised, rendering inaccurate analysis and conclusions. Havin...

- January 25, 2021

O ver the past few years, companies have been massively shifting their data and applications to the cloud that ended up raising a community of data users. They are encouraged to capture, gather, analyze, and save data for business insights and decision-making. More organizations are leading towards the use of multi-cloud, and the threat of losing data and securing has become challenging. Therefore, managing security policies, rules, metadata details, content traits is becoming critical for the multi-cloud. In this regard, the enterprises are in search of expertise and cloud tool vendors that are capable of providing the fundamental cloud security data governance competencies with excellence. Start with building policies and write them into code, or scripts that can be executed. This requires compliance and cloud security experts working together to build a framework for your complex business. You cannot start from scratch as it will be error-prone and will take too long. Try to in...

Protegrity Announces Support for Amazon Redshift to Secure Sensitive Cloud Data

- January 09, 2021

Protegrity, the data-security solutions provider, today announced support for Amazon Redshift, a fully-managed petabyte scale cloud data warehouse. Organizations with high data-security and IT requirements can now deploy Protegrity’s data de-identification technology in the Amazon Redshift environment. With its format-preserving vaultless tokenization capabilities, Protegrity goes beyond encryption to ensure data is protected at every step of the data lifecycle—from storing and moving to analyzing—no matter where it lives. By allowing protected data to be fully utilized without risk, Protegrity for Amazon Redshift enables customers to drive significantly more value and insights from sensitive data in the cloud. Building on Amazon Redshift’s comprehensive, built-in security capabilities available to customers at no extra cost, Protegrity protects the privacy of individuals by anonymizing data before it reaches Amazon Redshift. The combination of Protegrity and Amazon Redshift allows bus...

Gartner - 2020 Magic Quadrant for Metadata Management Solutions

- November 12, 2020

Metadata management is a core aspect of an organization’s ability to manage its data and information assets. The term “metadata” describes the various facets of an information asset that can improve its usability throughout its life cycle. Metadata and its uses go far beyond technical matters. Metadata is used as a reference for business-oriented and technical projects, and lays the foundations for describing, inventorying and understanding data for multiple use cases. Use-case examples include data governance, security and risk, data analysis and data value. The market for metadata management solutions is complex because these solutions are not all identical in scope or capability. Vendors include companies with one or more of the following functional capabilities in their stand-alone metadata management products (not all vendors offer all these capabilities, and not all vendor solutions offer these capabilities in one product): Metadata repositories — Used to document and manage meta...

Barriers to Effective Information Asset Management (Research)

- October 23, 2020

In the knowledge-based economy the wealth-creating capacity of organisations is no longer based on tangible assets such as buildings, equipment, and vehicles alone. Intangible assets are key contributors to securing sustainable competitive advantage. It is therefore critically important that intangible Information Assets (IA) such as data, documents, content on web sites, and knowledge are understood and well managed. The sound management of these assets allows an organisation to run faster and better, resulting in products and services that are of a higher quality at a lower cost with the benefits of reduced risk, improved competitive position, and higher return on investment. The initial stage of this research found that executive level managers acknowledge the existence and importance of Information Assets in their organisations, but that hardly any mechanisms are in place for the management and governance of these valuable assets. This paper discusses the reasons for this situation...

The Forrester Wave™: Machine Learning Data Catalogs, Q4 2020

- October 17, 2020

Key Takeaways Alation, Collibra, Alex Solutions, And IBM Lead The Pack Forrester’s research uncovered a market in which Alation, Collibra, Alex Solutions, and IBM are Leaders; data.world, Informatica, Io-Tahoe, and Hitachi Vantara are Strong Performers; and Infogix and erwin are Contenders. Collaboration, Lineage, And Data Variety Are Key Differentiators As metadata and business glossary technology becomes outdated and less effective, improved machine learning will dictate which providers lead the pack. Vendors that can provide scaleout collaboration, offer detailed data lineage, and interpret any type of data will position themselves to successfully deliver contextualized, trusted, accessible data to their customers. Read full report >>>

Only 3% of Companies’ Data Meets Basic Quality Standards

- September 15, 2020

Our analyses confirm that data is in far worse shape than most managers realize — and than we feared — and carry enormous implications for managers everywhere: On average, 47% of newly-created data records have at least one critical (e.g., work-impacting) error. A full quarter of the scores in our sample are below 30% and half are below 57%. In today’s business world, work and data are inextricably tied to one another. No manager can claim that his area is functioning properly in the face of data quality issues. It is hard to see how businesses can survive, never mind thrive, under such conditions. Only 3% of the DQ scores in our study can be rated “acceptable” using the loosest-possible standard. We often ask managers (both in these classes and in consulting engagements) how good their data needs to be. While a fine-grained answer depends on their uses of the data, how much an error costs them, and other company- and department-specific considerations, none has ever thought...

Open sourcing DataHub: LinkedIn’s metadata search and discovery platform

- June 01, 2020

Finding the right data quickly is critical for any company that relies on big data insights to make data-driven decisions. Not only does this impact the productivity of data users (including analysts, machine learning developers, data scientists, and data engineers), but it also has a direct impact on end products that rely on a quality machine learning (ML) pipeline. Additionally, the trend towards adopting or building ML platforms naturally begs the question: what is your method for internal discovery of ML features, models, metrics, datasets, etc.? In this blog post, we will share the journey of open sourcing DataHub, our metadata search and discovery platform, starting with the project’s early days as WhereHows. LinkedIn maintains an in-house version of DataHub separate from the open source version. We will start by explaining why we need two separate development environments, followed by a discussion on the early approaches for open sourcing WhereHows, and a comparison of our inte...

The Ins and Outs of Data Acquisition: Beliefs and Best Practices

Data acquisition involves the set of activities that are required to qualify and obtain external data — and also data that may be available elsewhere in an organization — and then to arrange for it to be brought into or accessed by the company. This strategy is on the rise as organizations leverage this data to get access to prospects, learn information about customers they already work with, be more competitive, develop new products and more. Standardizing data acquisition can be an afterthought at the tail-end of a costly data journey — if it’s considered at all. Now we see companies starting to pay more attention to the finer, critical points of data acquisition as the need for more data grows. And this is a good thing because ignoring acquisition best practices and proper oversight will lead to a whole host of problems that can outweigh the benefits of bringing in new data. The costly, problematic issues we see organizations grapple with include: Data purchases that, once brought i...

Model governance and model operations: building and deploying robust, production-ready machine learning models

- June 21, 2019

O'Reilly's surveys over the past couple of years have shown growing interest in machine learning (ML) among organizations from diverse industries. A few factors are contributing to this strong interest in implementing ML in products and services. First, the machine learning community has conducted groundbreaking research in many areas of interest to companies, and much of this research has been conducted out in the open via preprints and conference presentations. We are also beginning to see researchers share sample code written in popular open source libraries, and some even share pre-trained models. Organizations now also have more use cases and case studies from which to draw inspiration—no matter what industry or domain you are interested in, chances are there are many interesting ML applications you can learn from. Finally, modeling tools are improving, and automation is beginning to allow new users to tackle problems that used to be the province of experts. With the s...

Test data quality at scale with AWS Deequ

- June 04, 2019

You generally write unit tests for your code, but do you also test your data? Incorrect or malformed data can have a large impact on production systems. Examples of data quality issues are: Missing values can lead to failures in production system that require non-null values (NullPointerException). Changes in the distribution of data can lead to unexpected outputs of machine learning models. Aggregations of incorrect data can lead to wrong business decisions. In this blog post, we introduce Deequ, an open source tool developed and used at Amazon. Deequ allows you to calculate data quality metrics on your dataset, define and verify data quality constraints, and be informed about changes in the data distribution. Instead of implementing checks and verification algorithms on your own, you can focus on describing how your data should look. Deequ supports you by suggesting checks for you. Deequ is implemented on top of Apache Spark and is designed to scale with large datasets (th...

Building and Scaling Data Lineage at Netflix

- March 29, 2019

Netflix Data Landscape Freedom & Responsibility (F&R) is the lynchpin of Netflix’s culture empowering teams to move fast to deliver on innovation and operate with freedom to satisfy their mission. Central engineering teams provide paved paths (secure, vetted and supported options) and guard rails to help reduce variance in choices available for tools and technologies to support the development of scalable technical architectures. Nonetheless, Netflix data landscape (see below) is complex and many teams collaborate effectively for sharing the responsibility of our data system management. Therefore, building a complete and accurate data lineage system to map out all the data-artifacts (including in-motion and at-rest data repositories, Kafka topics, apps, reports and dashboards, interactive and ad-hoc analysis queries, ML and experimentation models) is a monumental task and requires a scalable architecture, robust design, a strong engineering team and above all, amazing cross-f...

Collection of data governance resources

- September 13, 2018

Learning about data governance Use these introductory books, videos, and articles to understand the basics of data governance. Data Governance: What You Need to Know — Jon Bruner explains how a data governance program provides the intellectual and institutional grounding to address the data needs across an organization, anticipate new issues, and provide for development according to the company’s strategic plan. Data Governance — John Adler leads you through the maze of data governance issues facing companies today—security breaches, regulatory agencies, in-house turf battles over who controls the data, monetizing data, and more. The Rise of Big Data Governance: Insight on this Emerging Trend from Active Open Source Initiatives — John Mertic and Maryna Strelchuk detail the benefits of a vendor-neutral approach to data governance. Understanding the Chief Data Officer — Through interviews with current and former chief data officers (CDO), Julie Steele lo...

The DGI Data Governance Framework

- August 16, 2018

The DGI Data Governance Framework is a logical structure for classifying, organizing, and communicating complex activities involved in making decisions about and taking action on enterprise data.

Gartner - Market Guide for Information Stewardship Applications

- July 17, 2018

The critical need for information governance continues to drive a diversified market for information stewardship solutions that support it. Data and analytics leaders must assess the capabilities these solutions offer to select vendors that will best suit their needs. Key Findings Policy setting in information governance programs is still so different and inconsistent that no market of offerings is forming as yet. Furthermore, policy enforcement in information stewardship initiatives is conforming to a market, but now across a wider set of use cases. Information stewardship applications available in the market do not yet fully support the information steward's wider role and tasks. Growth in the market for information stewardship applications is being disrupted by new technology capabilities in adjacent markets, such as data quality and metadata management, and new regulatory requirements, such as GDPR. Recommendations For data and analytics leaders working with dat...