Posts

Showing posts with the label BigQuery

Data Discovery for Data Scientists at Spotify

Image
Diagnosing the problem In 2016, as we started migrating to the Google Cloud Platform, we saw an explosion of dataset creation in BigQuery. At this time, we also drastically increased our hiring of insights specialists (data scientists, analysts, user researchers, etc.) at Spotify, resulting in more research and insights being produced across the company. However, research would often only have a localized impact in certain parts of the business, going unseen by others that might find it useful to influence their decision making. Datasets lacked clear ownership or documentation making it difficult for data scientists to find them. We believed that the crux of the problem was that we lacked a centralized catalog of these data and insights resources. In early 2017, we released Lexikon, a library for data and insights, as the solution to this problem. The first release allowed users to search and browse available BigQuery tables (i.e. datasets)— as well as discover k...

Beast: Moving Data from Kafka to BigQuery

Image
In order to serve customers across 19+ products, GOJEK places a lot of emphasis on data. Our Data Warehouse, built by integrating data from multiple applications and sources, helps our team of data scientists, as well as business and product analysts make solid, data-driven decisions. This post explains our open source solution for easy movement of data from Kafka to BigQuery. Data Warehouse setup at GOJEK. We use Google Bigquery (BQ) as our Data Warehouse, which serves as a powerful tool for interactive analysis. This has proven extremely valuable for our use cases. Our approach to push data to our warehouse is to first push the data to Kafka. We rely on multiple Kafka clusters to ingest relevant events across teams. A common approach to push data from Kafka to BigQuery is to first push it to GCS, and then import said data into BigQuery from GCS. While this solves the use case of running analytics on historical data, we also use BigQuery for near-real-time analytics & r...