Posts

Showing posts from March, 2020

ETL and How it Changed Over Time

Image
Modern world data and its usage has drastically changed when compared to a decade ago. There is a gap caused by the traditional ETL processes when processing modern data. The following are some of the main reasons for this:  Modern data processes often include real-time streaming data, and organizations need real-time insights into processes.  The systems need to perform ETL on data streams without using batch processing, and they should handle high data rates by scaling the system. Some single-server databases are now replaced by distributed data platforms ( e.g., Cassandra, MongoDB, Elasticsearch, SAAS apps ), message brokers( e.g., Kafka, ActiveMQ, etc. ) and several other types of endpoints. The system should have the capability to plugin additional sources or sinks to connect on the go in a manageable way. Repeated data processing due to ad hoc architecture has to be eliminated. Change data capture technologies used with traditional ETL has to be integrated to

Building a Large-scale Distributed Storage System Based on Raft

Image
In recent years, building a large-scale distributed storage system has become a hot topic.  Distributed consensus algorithms like Paxos and Raft are the focus of many technical articles. But those articles tend to be introductory, describing the basics of the algorithm and log replication. They seldom cover how to build a large-scale distributed storage system based on the distributed consensus algorithm.  Since April 2015, we PingCAP have been building TiKV, a large-scale open source distributed database based on Raft. It’s the core storage component of TiDB, an open source distributed NewSQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. Earlier in 2019, we conducted an official Jepsen test on TiDB, and the Jepsen test report was published in June 2019. In July the same year, we announced that TiDB 3.0 reached general availability, delivering stability at scale and performance boost. In this article, I’d like to share some of our firsthand exp

Data Discovery for Data Scientists at Spotify

Image
Diagnosing the problem In 2016, as we started migrating to the Google Cloud Platform, we saw an explosion of dataset creation in BigQuery. At this time, we also drastically increased our hiring of insights specialists (data scientists, analysts, user researchers, etc.) at Spotify, resulting in more research and insights being produced across the company. However, research would often only have a localized impact in certain parts of the business, going unseen by others that might find it useful to influence their decision making. Datasets lacked clear ownership or documentation making it difficult for data scientists to find them. We believed that the crux of the problem was that we lacked a centralized catalog of these data and insights resources. In early 2017, we released Lexikon, a library for data and insights, as the solution to this problem. The first release allowed users to search and browse available BigQuery tables (i.e. datasets)— as well as discover k