Posts

Showing posts from 2015

Who Coined "Cloud Computing"?

Cloud Computing as a term is not very well defined but is often associated with multi-tenant datacentres, commodity computers connected with an IP network, elastic/on-demand resource allocation, and metered billing. Antonio Regalado, a senior editor for biomedicine for MIT Technology Review decided to find out where it all began, https://www.technologyreview.com/s/425970/who-coined-cloud-computing/

Phi Accrual Failure Detector

Rather than using configured constant timeouts, systems can continually measure response times and their variability, and automatically adjust timeouts according to the observed response time distribution. This can be done with a Phi Accrual failure detector, which is used by Cassandra and Akka. http://ternarysearch.blogspot.ca/2013/08/phi-accrual-failure-detector.html

Building Reliable Systems from Unreliable Components

Sounds like Hadoop, right? 2006? No! The idea is over 60 years old. 1956 - J. von Neumann. Probabilistic logics and synthesis of reliable organisms from unreliable components: https://ece.uwaterloo.ca/~ssundara/courses/prob_logics.pdf

Not ideal case for MongoDB

Great article by Sarah Mei explaining when and how using a document model can lead to complex application code and bad performance. " When you’re picking a data store, the most important thing to understand is where in your data — and where in its connections — the business value lies. If you don’t know yet, which is perfectly reasonable, then choose something that won’t paint you into a corner. Pushing arbitrary JSON into your database sounds flexible, but true flexibility is easily adding the features your business needs. " Why You Should Never Use MongoDB: http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/

Time on multi-core, multi-socket servers

In Distributed Computing the notion of "when-ness" is fundamental; Lamport's " Time, Clocks, and the. Ordering of Events in a Distributed System"  paper is considered one of the foundational pieces of work. But what about locally? The conclusion is that except for the special case of using nanoTime() in micro benchmarks, you may as well stick to currentTimeMillis() —knowing that it may sporadically jump forwards or backwards. Because if you switched to nanoTime(), you don't get any monotonicity guarantees, it doesn't relate to human time any more —and may be more likely to lead you into writing code which assumes a fast call with consistent, monotonic results. Details: http://steveloughran.blogspot.ca/2015/09/time-on-multi-core-multi-socket-servers.html 

HDFS Erasure Coding in Apache Hadoop

HDFS by default replicates each block three times. Replication provides a simple and robust form of redundancy to shield against most failure scenarios. It also eases scheduling compute tasks on locally stored data blocks by providing multiple replicas of each block to choose from. However, replication is expensive: the default 3x replication scheme incurs a 200% overhead in storage space and other resources (e.g., network bandwidth when writing the data). For datasets with relatively low I/O activity, the additional block replicas are rarely accessed during normal operations, but still consume the same amount of storage space. Therefore, a natural improvement is to use erasure coding (EC) in place of replication, which uses far less storage space while still providing the same level of fault tolerance. Under typical configurations, EC reduces the storage cost by ~50% compared with 3x replication. Motivated by this substantial cost saving opportunity, engineers from Cloudera and Intel ...

3D Computer Chips

Image
A new method of designing and building computer chips could lead to blisteringly quick processing at least 1,000 times faster than the best existing chips are capable of, researchers say. The new method, which relies on materials called carbon nanotubes, allows scientists to build the chip in three dimensions. The 3D design enables scientists to interweave memory, which stores data, and the number-crunching processors in the same tiny space, said Max Shulaker, one of the designers of the chip, and a doctoral candidate in electrical engineering at Stanford University in California. Details: https://www.livescience.com/52207-faster-3d-computer-chip.html

Debunking Myths About the VoltDB In-Memory Database

Myth #1: “VoltDB requires stored procedures.” This was true for 1.0, but no one seems to notice it’s been false since we shipped 1.1 in 2010. VoltDB supports unforeseen SQL without any stored procedure use. We have users in production who have never used a single stored procedure. Myth #2: “VoltDB doesn’t support ad-hoc SQL.” This is just a rephrasing of Myth #1 and is still false. Myth #3: “VoltDB is slow unless I use stored procedures.” Well, no. VoltDB can run faster with stored procedures, but it’s still fast if they are not used. In our internal benchmarks on pretty cheap single-socket hardware, we can run about 50k write statements per second, per host with full durability. Myth #4: “I have to know Java to use VoltDB.” As of VoltDB 3.0, released over a year ago, (we’re on V4.2 today), a user can build VoltDB apps and run the server without ever directly interacting with the Java CLI tools or any Java code. Myth #5: “VoltDB has garbage collection problems because it is wri...

Last-Write-Wins conflict resolution in Cassandra

This approach is widely used in both multi-leader replication and leaderless databases such as Cassandra. Details: https://aphyr.com/posts/294-jepsen-cassandra

Weak transaction isolation

Concurrency bugs caused by weak transaction isolation are not just a theoretical problem. They may cause customer data to be corrupted. Many popular relational databases, which are usually considered ACID use weak isolation, so they would not necessarily have prevented these bugs from occurring. What exactly the Isolation guarantee in the SQL standard means based on what they call “read phenomena”. There are three types of phenomena: Dirty reads – If another transaction writes, but does not commit, during your transaction, is it possible that you will see their data? Non-repeatable reads – If you read the same row twice, is it possible that you might get different data the second time? Phantom reads – If you read a collection of rows twice, is it possible that different rows will be returned the second time? In the SQL standard, there are four levels of transactional isolation based on which of these phenomena they prevent (from weakest to strongest): Read Uncommitted – A tra...

Data loss in replicated systems

Image
What happens if the data on disk is corrupted, or the data is wiped out due to hardware error or misconfiguration? Here is the problem that losing disk state really induces in an ensemble of ZooKeeper servers: https://fpj.me/2015/05/28/dude-wheres-my-metadata/

To Schema On Read or to Schema On Write, That is the Hadoop Data Lake Question

The Hadoop data lake concept can be summed up as, “Store it all in one place, figure out what to do with it later.” But while this might be the general idea of your Hadoop data lake, you won’t get any real value out of that data until you figure out a logical structure for it. And you’d better keep track of your metadata one way or another. It does no good to have a lake full of data, if you have no idea what lies under the shiny surface. At some point, you have to give that data a schema, especially if you want to query it with SQL or something like it. The eternal Hadoop question is whether to apply the brave new strategy of schema on read, or to stick with the tried and true method of schema on write. What is Schema on Write? Schema on write has been the standard for many years in relational databases. Before any data is written in the database, the structure of that data is strictly defined, and that metadata stored and tracked. Irrelevant data is discarded, data types, lengths and...

Apache Zookeeper's poison packet

Image
The leader election and failure detection mechanisms are fairly mature, and typically just work… until they don’t. Four different bugs resulting in random cluster-wide lockups. Two of those bugs laid in ZooKeeper, and the other two were lurking in the Linux kernel. This is the story. https://www.pagerduty.com/blog/the-discovery-of-apache-zookeepers-poison-packet/

Are SSDs that solid?

Here’s the run down of the firmware issues Etsy had over 5 or so years. Intel and Samsung turned out to be the most reliable. https://laur.ie/blog/2015/06/ssds-a-gift-and-a-curse/    The issue raised by Algolia is due to a Linux kernel error https://blog.algolia.com/when-solid-state-drives-are-not-that-solid/

Open-Source Service Discovery tools

Service discovery is a key component of most distributed systems and service oriented architectures. The problem seems simple at first: How do clients determine the IP and port for a service that exist on multiple hosts? This problem has been addressed in many different ways and is continuing to evolve. Let's look at some open-source or openly-discussed solutions to this problem to understand how they work. Specifically, let's look at how each solution uses strong or weakly consistent storage, runtime dependencies, client integration options and what the tradeoffs of those features might be.  We’ll start with some strongly consistent projects such as Zookeeper , Doozer and Etcd which are typically used as coordination services but are also used for service registries as well. We’ll then look at some interesting solutions specifically designed for service registration and discovery. We’ll examine Airbnb’s SmartStack , Netflix’s Eureka , Bitly’s NSQ , Serf , Spotify and DNS an...

Immutability, MVCC, and Garbage Collection

Interesting article about Datomic and its immutability with regard to MVCC databases. https://www.xaprb.com/blog/2013/12/28/immutability-mvcc-and-garbage-collection/

Apache Flink's Engine

Joins are prevalent operations in many data processing applications. Most data processing systems feature APIs that make joining data sets very easy. However, the internal algorithms for join processing are much more involved – especially if large data sets need to be efficiently handled. Therefore, join processing serves as a good example to discuss the salient design points and implementation details of a data processing system. In this blog post, we cut through Apache Flink’s layered architecture and take a look at its internals with a focus on how it handles joins. Specifically, I will: show how easy it is to join data sets using Flink’s fluent APIs, discuss basic distributed join strategies, Flink’s join implementations, and its memory management, talk about Flink’s optimizer that automatically chooses join strategies, show some performance numbers for joining data sets of different sizes, and finally briefly discuss joining of co-located and pre-sorted data sets. Details:...

HIVE 0.14 Cost Based Optimizer (CBO)

Analysts and data scientists⎯not to mention business executives⎯want Big Data not for the sake of the data itself, but for the ability to work with and learn from that data. As other users become more savvy, they also want more access. But too many inefficient queries can create a bottleneck in the system. The good news is that Apache™ Hive 0.14—the standard SQL interface for processing, accessing and analyzing Apache Hadoop® data sets—is now powered by Apache Calcite. Calcite is an open source, enterprise-grade Cost-Based Logical Optimizer (CBO) and query execution framework. The main goal of a CBO is to generate efficient execution plans by examining the tables and conditions specified in the query, ultimately cutting down on query execution time and reducing resource utilization. Calcite has an efficient plan pruner that can select the cheapest query plan. All SQL queries are converted by Hive to a physical operator tree, optimized and converted to Tez/MapReduce jobs, then executed ...

Adopting Microservices at Netflix: Lessons for Architectural Design

In some recent blog posts, we’ve explained why we believe it’s crucial to adopt a four‑tier application architecture in which applications are developed and deployed as sets of microservices . It’s becoming increasingly clear that if you keep using development processes and application architectures that worked just fine ten years ago, you simply can’t move fast enough to capture and hold the interest of mobile users who can choose from an ever‑growing number of apps. Switching to a microservices architecture creates exciting opportunities in the marketplace for companies. For system architects and developers, it promises an unprecedented level of control and speed as they deliver innovative new web experiences to customers. But at such a breathless pace, it can feel like there’s not a lot of room for error. In the real world, you can’t stop developing and deploying your apps as you retool the processes for doing so. You know that your future success depends on transitio...

8 ways to replace HDFS

Hadoop is on its way to becoming the de facto platform for the next-generation of data-based applications, but it’s not without flaws. Ironically, one of Hadoop’s biggest shortcomings now is also one of its biggest strengths going forward — the Hadoop Distributed File System. Within the Apache Software Foundation, HDFS is always improving in terms of performance and availability. Honestly, it’s probably fine for the majority of Hadoop workloads that are running in pilot projects, skunkworks projects or generally non-demanding environments. And technologies such as HBase that are built atop HDFS speak to its versatility as storage system even for non-MapReduce applications. But if the growing number of options for replacing HDFS signifies anything, it’s that HDFS isn’t quite where it needs to be. Some Hadoop users have strict demands around performance, availability and enterprise-grade features, while others aren’t keen of its direct-attached storage (DAS) architecture. Concerns arou...

HBase Bulk Loading

Apache HBase is all about giving you random, real-time, read/write access to your Big Data, but how do you efficiently get that data into HBase in the first place? Intuitively, a new user will try to do that via the client APIs or by using a MapReduce job with TableOutputFormat, but those approaches are problematic, as you will learn below. Instead, the HBase bulk loading feature is much easier to use and can insert the same amount of data more quickly. This blog post will introduce the basic concepts of the bulk loading feature, present two use cases, and propose two examples: https://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/

Apache Flink: API, runtime, and project roadmap

Detailed presentation on Apache Flink: https://www.slideshare.net/KostasTzoumas/apache-flink-api-runtime-and-project-roadmap