Data Discovery for Data Scientists at Spotify


Music for everyone - Spotify

Diagnosing the problem

In 2016, as we started migrating to the Google Cloud Platform, we saw an explosion of dataset creation in BigQuery. At this time, we also drastically increased our hiring of insights specialists (data scientists, analysts, user researchers, etc.) at Spotify, resulting in more research and insights being produced across the company. However, research would often only have a localized impact in certain parts of the business, going unseen by others that might find it useful to influence their decision making. Datasets lacked clear ownership or documentation making it difficult for data scientists to find them. We believed that the crux of the problem was that we lacked a centralized catalog of these data and insights resources.

In early 2017, we released Lexikon, a library for data and insights, as the solution to this problem. The first release allowed users to search and browse available BigQuery tables (i.e. datasets)— as well as discover knowledge generated through past research and analysis. The insights community at Spotify was quite excited to have this new tool and it quickly became one of the most widely used tools amongst data scientists, with ~75% of data scientists using it regularly, and ~550 monthly active users.

However, months after the initial launch, we surveyed the insights community and learned that data scientists still reported data discovery as a major pain point, reporting significant time spent on finding the right dataset. The typical data scientist at Spotify works with ~25-30 different datasets in a month. If data discovery is time-consuming, it significantly increases the time it takes to produce insights, which means either it might take longer to make a decision informed by those insights, or worse, we won’t have enough data and insights to inform a decision.
Our team decided to focus on this specific issue by iterating on Lexikon, with the goal to improve the data discovery experience for data scientists and ultimately accelerate insights production. We were able to significantly improve the data discovery experience by (1) gaining a better understanding of our users intent, (2) enabling knowledge exchange through people, and (3) helping users get started with a dataset they’ve discovered.

Continue reading >>>

Comments