Data Discovery Platforms and Their Open Source Solutions
In the past year or two, many companies have shared their data discovery platforms (the latest being Facebook’s Nemo). Based on this list, we now know of more than 10 implementations.
I haven’t been paying much attention to these developments in data discovery and wanted to catch up. I was interested in:
- The questions these platforms help answer
- The features developed to answer these questions
- How they compare with each other
- What open source solutions are available
By the end of this, we’ll learn about the key features that solve 80% of data discoverability problems. We’ll also see how the platforms compare on these features, and take a closer look at open source solutions available.
Questions we ask in the data discovery process
Before discussing platform features, let’s briefly go over some common questions in the data discovery process.
Where can I find data about ____? If we don’t know the right terms, this is especially challenging. For user browsing behavior, do we search for “click”, “page views”, or “browse”? A common solution is free-text search on table names and even columns. (We’ll see how Nemo improves on this in the next section.)
What is the data about? What columns does the data have? What are the data types? What do they mean? Displaying table schemas and column descriptions go a long way here.
Who can I ask for access? Ownership and how to get permissions should be part of the metadata displayed for each table.
How is the data created? Can I trust it? Before using the data in production, we’ll want to ensure its reliability and quality. Who’s creating the data? Is it a scheduled data cleaning pipeline? Or does an analyst manually run it for monthly reporting? Also, how widely is the data used? Displaying usage statistics and data lineage helps with this.
How should I use the data? Which columns are relevant? What tables should I join on? What filters should I apply to clean the data? To address this, one way is to display the most frequent users of each table so people can ask them. Alternatively, we can provide statistics on column usage.
How frequently does the data refresh? If delays are common, what is the extent of it? Stale data can reduce the effectiveness of time-sensitive machine learning systems. Also, what is the period of data? If the table is only a few weeks old, we won’t have enough for machine learning. A simple solution is to show table creation dates, partition dates, and when it was last updated.
Comments
Post a Comment