Open sourcing DataHub: LinkedIn’s metadata search and discovery platform

datahub-logoFinding the right data quickly is critical for any company that relies on big data insights to make data-driven decisions. Not only does this impact the productivity of data users (including analysts, machine learning developers, data scientists, and data engineers), but it also has a direct impact on end products that rely on a quality machine learning (ML) pipeline. Additionally, the trend towards adopting or building ML platforms naturally begs the question: what is your method for internal discovery of ML features, models, metrics, datasets, etc.?

In this blog post, we will share the journey of open sourcing DataHub, our metadata search and discovery platform, starting with the project’s early days as WhereHows. LinkedIn maintains an in-house version of DataHub separate from the open source version. We will start by explaining why we need two separate development environments, followed by a discussion on the early approaches for open sourcing WhereHows, and a comparison of our internal (production) version of DataHub with the version on GitHub. We’ll also share details about our new automated solution for pushing and pulling open source updates to keep both repositories in sync. Finally, we’ll provide instructions on how to get started using the open source DataHub and briefly discuss its architecture.


Strata 2019 recording: https://bit.ly/36P25GY

Comments