Posts

Showing posts with the label Distributed Systems

Building a Large-scale Distributed Storage System Based on Raft

Image
In recent years, building a large-scale distributed storage system has become a hot topic.  Distributed consensus algorithms like Paxos and Raft are the focus of many technical articles. But those articles tend to be introductory, describing the basics of the algorithm and log replication. They seldom cover how to build a large-scale distributed storage system based on the distributed consensus algorithm.  Since April 2015, we PingCAP have been building TiKV, a large-scale open source distributed database based on Raft. It’s the core storage component of TiDB, an open source distributed NewSQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. Earlier in 2019, we conducted an official Jepsen test on TiDB, and the Jepsen test report was published in June 2019. In July the same year, we announced that TiDB 3.0 reached general availability, delivering stability at scale and performance boost. In this article, I’d like to share some of our firs...

Operating a Large, Distributed System in a Reliable Way

Image
"The article is the collection of the practices I've found useful to reliably operate a large system at Uber, while working here. My experience is not unique - people working on similar sized systems go through a similar journey. I've talked with engineers at Google, Facebook, and Netflix, who shared similar experiences and solutions. Many of the ideas and processes listed here should apply to systems of similar scale, regardless of running on own data centers (like Uber mostly does) or on the cloud (where Uber sometimes scales to). However, the practices might be an overkill for smaller or less mission-critical systems." There's much ground to cover: Monitoring Oncall, Anomaly Detection & Alerting Outages & Incident Management Processes Postmortems, Incident Reviews & a Culture of Ongoing Improvements Failover Drills, Capacity Planning & Blackbox Testing SLOs, SLAs & Reporting on Them SRE as an Independent Team Reliability as an...

Distributed SQL System Review: Snowflake vs Splice Machine

Image
After many years of Big Data, NoSQL, and Schema-on-Read detours, there is a clear return to SQL as the lingua franca for data operations. Developers need the comprehensive expressiveness that SQL provides. A world without SQL ignores more than 40 years of database research and results in hard-coded spaghetti code in applications to handle functionality that SQL handles extremely efficiently such as joins, groupings, aggregations, and (most importantly) rollback when updates go wrong. Luckily, there is a modern architecture for SQL called Distributed SQL that no longer suffers from the challenges of traditional SQL systems (cost, scalability, performance, elasticity, and schema flexibility). The key attribute of Distributed SQL is that data is stored across many distributed storage locations and computation takes place across a cluster of networked servers. This yields unprecedented performance and scalability because it distributes work on each worker node in the cluster in parall...

The big interview with Martin Kleppmann: “Figuring out the future of distributed data systems”

Image
Dr. Martin Kleppmann is a researcher in distributed systems at the University of Cambridge, and the author of the highly acclaimed “Designing Data-Intensive Applications” (O’Reilly Media, 2017). Kevin Scott, CTO at Microsoft once said: “This book should be required reading for software engineers. Designing Data-Intensive Applications is a rare resource that connects theory and practice to help developers make smart decisions as they design and implement data infrastructure and systems.” Martin’s main research interests include collaboration software, CRDTs, and formal verification of distributed algorithms. Previously he was a software engineer and an entrepreneur at several Internet companies including LinkedIn and Rapportive, where he worked on large-scale data infrastructure. Vadim Tsesko (@incubos) is a lead software engineer at Odnoklassniki who works in Core Platform team. Vadim’s scientific and engineering interests include distributed systems, data warehouses and verification o...
Image
Here’s a curated list of resources for data engineers, with sections for algorithms and data structures, SQL, databases, programming, tools, distributed systems, and more. Useful articles The AI Hierarchy of Needs The Rise of Data Engineer The Downfall of the Data Engineer A Beginner’s Guide to Data Engineering Part I Part II Part III Functional Data Engineering — a modern paradigm for batch data processing How to become a Data Engineer (in Russian) Talks Data Engineering Principles - Build frameworks not pipelines by Gatis Seja Functional Data Engineering - A Set of Best Practices by Maxime Beauchemin Advanced Data Engineering Patterns with Apache Airflow by Maxime Beauchemin Creating a Data Engineering Culture by Jesse Anderson Algorithms & Data Structures Algorithmic Toolbox in Russian Data Structures in Russian Data Structures & Algorithms Specialization on Coursera Algorithms Specialization from Stanford on Coursera SQL Com...

Move Beyond a Monolithic Data Lake to a Distributed Data Mesh (Martin Fowler)

Image
Many enterprises are investing in their next generation data lake, with the hope of democratizing data at scale to provide business insights and ultimately make automated intelligent decisions. Data platforms based on the data lake architecture have common failure modes that lead to unfulfilled promises at scale. To address these failure modes we need to shift from the centralized paradigm of a lake, or its predecessor data warehouse. We need to shift to a paradigm that draws from modern distributed architecture: considering domains as the first class concern, applying platform thinking to create self-serve data infrastructure, and treating data as a product. Becoming a data-driven organization remains one of the top strategic goals of many companies I work with. My clients are well aware of the benefits of becoming intelligently empowered: providing the best customer experience based on data and hyper-personalization; reducing operational costs and time through data-driven optimi...

Serverless Computing: One Step Forward, Two Steps Back

Image
Serverless computing offers the potential to program the cloud in an autoscaling, pay-as-you go manner. In this paper we address critical gaps in first-generation serverless computing, which place its autoscaling potential at odds with dominant trends in modern computing: notably data-centric and distributed computing, but also open source and custom hardware. Put together, these gaps make current serverless offerings a bad fit for cloud innovation and particularly bad for data systems innovation. In addition to pinpointing some of the main shortfalls of current serverless architectures, we raise a set of challenges we believe must be met to unlock the radical potential that the cloud---with its exabytes of storage and millions of cores---should offer to innovative developers. Read full article >>>

Confluo: Millisecond-level Queries on Large-scale Live Data

Image
Confluo is a system for real-time distributed analysis of multiple data streams. Confluo simultaneously supports high throughput concurrent writes, online queries at millisecond timescales, and CPU-efficient ad-hoc queries  via a combination of data structures carefully designed for the specialized case of multiple data streams, and an end-to-end optimized system design. We are excited to release Confluo as an open-source C++ project, comprising: Confluo’s data structure library, that supports high throughput ingestion of logs, along with a wide range of online (live aggregates, conditional trigger executions, etc.) and offline (ad-hoc filters, aggregates, etc.) queries, and, A Confluo server implementation, that encapsulates the data structures and exposes its operations via an RPC interface, along with client libraries in C++, Java and Python. We have evaluated Confluo for several different application scenarios, including: A network monitoring and diagnosis framewor...

Ray: Application-level scheduling with custom resources

Image
Ray intends to be a universal framework for a wide range of machine learning applications. This includes distributed training, machine learning inference, data processing, latency-sensitive applications, and throughput-oriented applications. Each of these applications has different, and, at times, conflicting requirements for resource management. Ray intends to cater to all of them, as the newly emerging microkernel for distributed machine learning. In order to achieve that kind of generality, Ray enables explicit developer control with respect to the task and actor placement by using custom resources. In this blog post we are going to talk about use cases and provide examples. This article is intended for readers already familiar with Ray. If you are new to Ray are are looking to easily and elegantly parallelize your Python code, please take a look at this tutorial.  USE CASES Load Balancing.  In many cases, the preferred behavior is to distribute tasks across all...

Dremio 2.1 is shipped with many new features!

Image
This is a major release that includes many new features, performance improvements, and hundreds of stability enhancements - see the highlights and more details below. • Elasticsearch 6.  Dremio now supports the latest versions of Elasticsearch. Enjoy full SQL support, including JOINs, Window functions, and accelerated analytics through any BI tool, including Tableau and Power BI. We also added support for compressing Elasticsearch responses to minimize network traffic.  • Approximate count distinct acceleration.  Dremio now supports accelerating count distinct queries based on an approximation-based algorithm (HyperLogLog). This provides a faster and more memory efficient way of providing distinct counts and is especially useful in high cardinality scenarios with very large datasets.  • Faster ORC performance.  Data encoded in ORC is now significantly faster to access and more memory efficient for ORC managed in Hive sources.  • Support for AWS GovClou...

Translytical Data Platforms

Image
Analytics at the speed of transactions has become an important agenda item for organizations. Translytical data platforms, an emerging technology, deliver faster access to business data to support various workloads and use cases. EA pros can use them to drive new business initiatives. Forrester identified the 12 most significant translytical vendors — Aerospike, DataStax, GigaSpaces, IBM, MemSQL, Microsoft, NuoDB, Oracle, Redis Labs, SAP, Splice Machine, and VoltDB — and researched, analyzed, and scored them against 25 criteria. Details >> (the link is provided by DataStax here ) The Forrester Wave™: Translytical Data Platforms, Q4 2017:

Unreliable clocks

Why time-of-day clocks are unsuitable for measuring elapsed time: https://blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns

Synchronizing Clocks In a Cassandra Cluster

Image
The Problem (part 1) Cassandra is a highly-distributable NoSQL database with tunable consistency. What makes it highly distributable makes it also, in part, vulnerable: the whole deployment must run on synchronized clocks. It’s quite surprising that, given how crucial this is, it is not covered sufficiently in literature. And, if it is, it simply refers to installation of a NTP daemon on each node which – if followed blindly – leads to really bad consequences. You will find blog posts by users who got burned by clock drifting. In the first installment of this two part series, it is covered how important clocks are and how bad clocks can be in virtualized systems (like Amazon EC2) today. Details: https://blog.rapid7.com/2014/03/14/synchronizing-clocks-in-a-cassandra-cluster-pt-1-the-problem/ Solutions (part 2) Some disadvantages of off-the-shelf NTP installations, and how to overcome them. Details: https://blog.rapid7.com/2014/03/17/synchronizing-clocks-in-a-cassandra-cluster...

Benchmarking and Latency

The article by Tyler Treat (bravenewgeek.com) explaining why you should be very conscious of your monitoring and benchmarking tools and the data they report. HdrHistogram  is a tool which allows you to capture latency and retain high resolution. It also includes facilities for correcting coordinated omission and plotting latency distributions. The original version of HdrHistogram was written in Java, but there are versions for many other languages. Details: https://bravenewgeek.com/2015/12/

The evolution of cluster scheduler architectures

Cluster schedulers are an important component of modern infrastructure, and have evolved significantly in the last few years. Their architecture has moved from monolithic designs to much more flexible, disaggregated and distributed designs. However, many current open-source offerings are either still monolithic, or otherwise lack key features. These features matter to real-world users, as they are required to achieve good utilization. Scheduling is an important topic because it directly affects the cost of operating a cluster: a poor scheduler results in low utilization, which costs money as expensive machines are left idle. High utilization, however, is not sufficient on its own: antagonistic workloads interfere with other workloads unless the decisions are made carefully. Details: http://www.firmament.io/blog/scheduler-architectures.html