Beast: Moving Data from Kafka to BigQuery


Image result for GOJEKIn order to serve customers across 19+ products, GOJEK places a lot of emphasis on data. Our Data Warehouse, built by integrating data from multiple applications and sources, helps our team of data scientists, as well as business and product analysts make solid, data-driven decisions.


This post explains our open source solution for easy movement of data from Kafka to BigQuery.
Data Warehouse setup at GOJEK.

We use Google Bigquery (BQ) as our Data Warehouse, which serves as a powerful tool for interactive analysis. This has proven extremely valuable for our use cases.
Our approach to push data to our warehouse is to first push the data to Kafka. We rely on multiple Kafka clusters to ingest relevant events across teams.

A common approach to push data from Kafka to BigQuery is to first push it to GCS, and then import said data into BigQuery from GCS. While this solves the use case of running analytics on historical data, we also use BigQuery for near-real-time analytics & reporting. This analysis in-turn provides valuable insights to make the right business decisions in a short time frame.

The initial approach
Our original implementation used an individual code base for each topic in Kafka and use them to push data to BQ.
This required a lot of maintenance in order to keep up with the new topics and new fields to existing topics being added to Kafka. Such changes required the manual intervention of a dev/analyst to update the schema in both code and BQ table. We also witnessed incidents of data loss on a few occasions, which required manually loading data from GCS.

The need for a new solution
New topics are added almost every other day to the several Kafka clusters in the organisation. Given that GOJEK has expanded its operations to several countries, managing the agglomeration of individual scripts for each topic was a massive ordeal.
In order to deal with scaling issues, we decided to write a new system from scratch and take into account learnings from our previous experiences. Our solution was a system that could ingest all the data pushed to Kafka and write it to Bigquery.

Read full article >>>

Comments