Nemo: Data discovery at Facebook


Large-scale companies serve millions or even billions of people who depend on the services these companies provide for their everyday needs. To keep these services running and delivering meaningful experiences, the teams behind them need to find the most relevant and accurate information quickly so that they can make informed decisions and take action. Finding the right information can be hard for several reasons. The problem might be discovery — the relevant table might have an obscure or nondescript name, or different teams might have constructed overlapping data sets. Or, the problem could be one of confidence — the dashboard someone is looking at might have been superseded by another source six months ago. 

Many companies, such as Airbnb, Lyft, Netflix, and Uber, have built their own custom solutions for this challenge. For us, it was important to make the data discovery process simple and fast. Funneling everything through data experts to locate the necessary data each time we need to make a decision was not scalable. So we built Nemo, an internal data discovery engine. Nemo allows engineers to quickly discover the information they need, with high confidence in the accuracy of the results. 

We have more than a dozen different types of data artifacts, including Hive tables that store raw data, Scuba tables, dashboards, AI data sets, and Cubrick. Before Nemo, internal surveys indicated that finding the right data was a major pain point for data engineers. Nemo has dramatically improved that, increasing the data search success rate by more than 50 percent, even as the total number of artifacts has more than tripled and queries per second (QPS) has more than doubled.

Continue reading >>>

Comments