Posts

Showing posts with the label Cassandra

Building and Scaling Data Lineage at Netflix

Image
Netflix Data Landscape Freedom & Responsibility (F&R) is the lynchpin of Netflix’s culture empowering teams to move fast to deliver on innovation and operate with freedom to satisfy their mission. Central engineering teams provide paved paths (secure, vetted and supported options) and guard rails to help reduce variance in choices available for tools and technologies to support the development of scalable technical architectures. Nonetheless, Netflix data landscape (see below) is complex and many teams collaborate effectively for sharing the responsibility of our data system management. Therefore, building a complete and accurate data lineage system to map out all the data-artifacts (including in-motion and at-rest data repositories, Kafka topics, apps, reports and dashboards, interactive and ad-hoc analysis queries, ML and experimentation models) is a monumental task and requires a scalable architecture, robust design, a strong engineering team and above all, amazing cross-f...

What Open Source Software Do You Use?

To gather insights on the current and future state of open source software (OSS), we talked to 31 executives. This is nearly double the number we speak to for a research guide and believe this reiterates the popularity of, acceptance of, and demand for OSS. We began by asking, "What Open Source software do you use?" As you would expect, most respondents are using several versions of open source software. Here's what they told us: Apache Apache Cassandra, Elassandra  (ElasticSearch + Cassandra) , Spark, and Kafka  (as the core tech we provide through our managed service) are the big ones for us. We find that the governance arrangements and independence of the Apache Foundation make a great foundation for strong open source projects. 95% of what we do with big data is open source. We use  Apache Hadoop  and contribute back to grow skills and expertise. We use so much that it would be impossible to list. The core of our software is based on  Apache So...

Synchronizing Clocks In a Cassandra Cluster

Image
The Problem (part 1) Cassandra is a highly-distributable NoSQL database with tunable consistency. What makes it highly distributable makes it also, in part, vulnerable: the whole deployment must run on synchronized clocks. It’s quite surprising that, given how crucial this is, it is not covered sufficiently in literature. And, if it is, it simply refers to installation of a NTP daemon on each node which – if followed blindly – leads to really bad consequences. You will find blog posts by users who got burned by clock drifting. In the first installment of this two part series, it is covered how important clocks are and how bad clocks can be in virtualized systems (like Amazon EC2) today. Details: https://blog.rapid7.com/2014/03/14/synchronizing-clocks-in-a-cassandra-cluster-pt-1-the-problem/ Solutions (part 2) Some disadvantages of off-the-shelf NTP installations, and how to overcome them. Details: https://blog.rapid7.com/2014/03/17/synchronizing-clocks-in-a-cassandra-cluster...

New token allocation algorithm in Cassandra 3.0

The central idea of the algorithm is to generate candidate tokens, and figure out what would be the effect of adding each of them to the ring as part of the new node. The new token will become primary for part of the range of the next one in the ring, but it will also affect the replication of preceding ones. The algorithm is able to quickly assess the effects thanks to some observations which lead to a simplified but equivalent version of the replication topology2: Replication is defined per datacentre and replicas for data for this datacentre are only picked from local nodes. That is, no matter how we change nodes in other datacentres, this cannot affect what replicates where in the local one. Therefore in analysing the effects of adding a new token to the ring, we can work with a local version of the ring that only contains the tokens belonging to local nodes. If there are no defined racks (or the datacentre is a single rack), data must be replicated in distinct nodes. If racks ...

Last-Write-Wins conflict resolution in Cassandra

This approach is widely used in both multi-leader replication and leaderless databases such as Cassandra. Details: https://aphyr.com/posts/294-jepsen-cassandra

8 ways to replace HDFS

Hadoop is on its way to becoming the de facto platform for the next-generation of data-based applications, but it’s not without flaws. Ironically, one of Hadoop’s biggest shortcomings now is also one of its biggest strengths going forward — the Hadoop Distributed File System. Within the Apache Software Foundation, HDFS is always improving in terms of performance and availability. Honestly, it’s probably fine for the majority of Hadoop workloads that are running in pilot projects, skunkworks projects or generally non-demanding environments. And technologies such as HBase that are built atop HDFS speak to its versatility as storage system even for non-MapReduce applications. But if the growing number of options for replacing HDFS signifies anything, it’s that HDFS isn’t quite where it needs to be. Some Hadoop users have strict demands around performance, availability and enterprise-grade features, while others aren’t keen of its direct-attached storage (DAS) architecture. Concerns arou...