Posts

Showing posts with the label performance

Swarm64: Open source PostgreSQL on steroids

Image
PostgreSQL is a big deal. The most common SQL open source database that you have never heard of, as ZDNet's own Tony Baer called it. Besides being the framework on which a number of commercial offerings were built, PostgreSQL has a user base of its own. According to DB Engines, PostgreSQL is the 4th most popular database in the world. Swarm64, on the other hand, is a small vendor. So small, actually, that we have shared the stage with CEO Thomas Richter in a local Berlin Meetup a few years back. Back then, Richter was not CEO, and Swarm64 was even smaller. But its value proposition still sounded attractive: boost PostgreSQL's performance for free. Swarm64 is an acceleration layer for PostgreSQL. There's no such thing as a free lunch of course, so the "for free" part is a figure of speech. Swarm64 is a commercial vendor. Until recently, however, the real gotcha was hardware: Swarm64 Database Acceleration (DA) required a specialized chip called FPGA to be able ...

Distributed SQL System Review: Snowflake vs Splice Machine

Image
After many years of Big Data, NoSQL, and Schema-on-Read detours, there is a clear return to SQL as the lingua franca for data operations. Developers need the comprehensive expressiveness that SQL provides. A world without SQL ignores more than 40 years of database research and results in hard-coded spaghetti code in applications to handle functionality that SQL handles extremely efficiently such as joins, groupings, aggregations, and (most importantly) rollback when updates go wrong. Luckily, there is a modern architecture for SQL called Distributed SQL that no longer suffers from the challenges of traditional SQL systems (cost, scalability, performance, elasticity, and schema flexibility). The key attribute of Distributed SQL is that data is stored across many distributed storage locations and computation takes place across a cluster of networked servers. This yields unprecedented performance and scalability because it distributes work on each worker node in the cluster in parall...

Understanding Apache Spark Failures and Bottlenecks

Image
Apache Spark is a powerful open-source distributed computing framework for scalable and efficient analysis of big data apps running on commodity compute clusters. Spark provides a framework for programming entire clusters with built-in data parallelism and fault tolerance while hiding the underlying complexities of using distributed systems. Spark has seen a massive spike in adoption by enterprises across a wide swath of verticals, applications, and use cases. Spark provides speed (up to 100x faster in-memory execution than Hadoop MapReduce) and easy access to all Spark components (write apps in R, Python, Scala, and Java) via unified high-level APIs. Spark also handles a wide range of workloads (ETL, BI, analytics, ML, graph processing, etc.) and performs interactive SQL queries, batch processing, streaming data analytics, and data pipelines. Spark is also replacing MapReduce as the processing engine component of Hadoop. Spark applications are easy to write and easy to understa...

Tuning Snowflake Performance Using the Query Cache

Image
In terms of performance tuning in Snowflake, there are very few options available. However, it is worth understanding how the Snowflake architecture includes various levels of caching to help speed your queries. This article provides an overview of the techniques used, and some best practice tips on how to maximise system performance using caching. Snowflake Database Architecture Before starting it’s worth considering the underlying Snowflake architecture, and explaining when Snowflake caches data. The diagram below illustrates the overall architecture which consists of three layers:- Service Layer:   Which accepts SQL requests from users, coordinates queries, managing transactions and results.  Logically, this can be assumed to hold the  result cache  – a cached copy of the results of every query executed. Compute Layer:   Which actually does the heavy lifting.  ...

Effective Spark DataFrames With Alluxio

Image
Many organizations deploy Alluxio together with Spark for performance gains and data manageability benefits. Qunar recently deployed Alluxio in production, and their Spark streaming jobs sped up by 15x on average and up to 300x during peak times. They noticed that some Spark jobs would slow down or would not finish, but with Alluxio, those jobs could finish quickly. In this blog post, we investigate how Alluxio helps Spark be more effective. Alluxio increases performance of Spark jobs, helps Spark jobs perform more predictably, and enables multiple Spark jobs to share the same data from memory. Previously, we investigated how Alluxio is used for Spark RDDs. In this article, we investigate how to effectively use Spark DataFrames with Alluxio. Alluxio and Spark Cache Storing Spark DataFrames in Alluxio memory is very simple, and only requires saving the DataFrame as a file to Alluxio. This is very simple with the Spark DataFrame write API. DataFrames are commonly written as parquet fi...

Actian VectorH architecture and Amazon Redshift benchmark papers

Actian Vector in Hadoop (VectorH for short) is a new SQL-on-Hadoop  system  built  on  top  of  the  fast  Vectorwise  analytical database system.  VectorH achieves fault tolerance and storage scalability by relying on HDFS, and extends the state-of-the-art in SQL-on-Hadoop systems by instrumenting  the  HDFS  replication  policy  to  optimize  read  locality. VectorH integrates with YARN for workload management, achieving a high degree of elasticity.  Even though HDFS is an append-only filesystem, and VectorH supports (update-averse) ordered tables, trickle updates are possible thanks to Positional Delta Trees (PDTs), a differential update structure that can be queried efficiently.  The paper describes the changes made to single-server Vectorwise to turn it into a Hadoop-based  MPP  system,  encompassing  workload  management, parallel  query  optimizat...

Machine Learning algorithms and libraries overview

Nice brief overview of some Machine Learning algorithms highlighting their strengths and weaknesses. Big 3 machine learning tasks, which are by far the most common ones. They are:     Regression     Classification     Clustering Details: https://elitedatascience.com/machine-learning-algorithms Here are also some observations on the top five characteristics of ML libraries that developers should consider when deciding what library to use: Programming paradigm Symbolic: Spark MLlib, MMLSpark, BigDL, CNTK, H2O.ai, Keras, Caffe2 Imperative: scikit-learn, auto sklearn, TPOT, PyTorch Hybrid: MXNet, TensorFlow Machine learning algorithms Supervised and unsupervised: Spark MLlib, scikit-learn, H2O.ai, MMLSpark, Mahout Deep learning: TensorFlow, PyTorch, Caffe2 (image), Keras, MXNet, CNTK, BigDL, MMLSpark (image and text), H2O.ai (via the deepwater plugin) Recommendation system: Spark MLlib, H2O.ai (via the sparkling-water plugin), Mah...

Tableau 10.5 with Hyper and server on Linux

Excited about new Tableau 10.5 with Hyper added as a data engine and Linux support. New features: https://www.tableau.com/products/new-features Hyper: https://www.tableau.com/products/technology

HDFS scalability: the limits to growth

Some time ago I came across very interesting article by Konstantin V. Shvachko (now Senior Staff Software Engineer at LinkedIn) concerning the limits of hadoop scalability. The main conclusion of it is that "a 10,000 node HDFS cluster with a single name-node is expected to handle well a workload of 100,000 readers, but even 10,000 writers can produce enough workload to saturate the name-node, making it a bottleneck for linear scaling. Such a large difference in performance is attributed to get block locations (read workload) being a memory-only operation, while creates (write workload) require journaling, which is bounded by the local hard drive performance. There are ways to improve the single name-node performance, but any solution intended for single namespace server optimization lacks scalability." Konstantin continues: "The most promising solutions seem to be based on distributing the namespace server itself both for workload balancing and for reducing the si...

Unreliable clocks

Why time-of-day clocks are unsuitable for measuring elapsed time: https://blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns

Random Data Generator by Hortonworks

Useful tool for generating large amounts of random data for demos, exploring new tools and performance benchmark testing. https://github.com/jgalilee/data Install the data package with go get github.com/jgalilee/data . There are two sub-packages included with this the transactions package, and the points package.

Benchmarking and Latency

The article by Tyler Treat (bravenewgeek.com) explaining why you should be very conscious of your monitoring and benchmarking tools and the data they report. HdrHistogram  is a tool which allows you to capture latency and retain high resolution. It also includes facilities for correcting coordinated omission and plotting latency distributions. The original version of HdrHistogram was written in Java, but there are versions for many other languages. Details: https://bravenewgeek.com/2015/12/

HDFS Erasure Coding in Apache Hadoop

HDFS by default replicates each block three times. Replication provides a simple and robust form of redundancy to shield against most failure scenarios. It also eases scheduling compute tasks on locally stored data blocks by providing multiple replicas of each block to choose from. However, replication is expensive: the default 3x replication scheme incurs a 200% overhead in storage space and other resources (e.g., network bandwidth when writing the data). For datasets with relatively low I/O activity, the additional block replicas are rarely accessed during normal operations, but still consume the same amount of storage space. Therefore, a natural improvement is to use erasure coding (EC) in place of replication, which uses far less storage space while still providing the same level of fault tolerance. Under typical configurations, EC reduces the storage cost by ~50% compared with 3x replication. Motivated by this substantial cost saving opportunity, engineers from Cloudera and Intel ...

Debunking Myths About the VoltDB In-Memory Database

Myth #1: “VoltDB requires stored procedures.” This was true for 1.0, but no one seems to notice it’s been false since we shipped 1.1 in 2010. VoltDB supports unforeseen SQL without any stored procedure use. We have users in production who have never used a single stored procedure. Myth #2: “VoltDB doesn’t support ad-hoc SQL.” This is just a rephrasing of Myth #1 and is still false. Myth #3: “VoltDB is slow unless I use stored procedures.” Well, no. VoltDB can run faster with stored procedures, but it’s still fast if they are not used. In our internal benchmarks on pretty cheap single-socket hardware, we can run about 50k write statements per second, per host with full durability. Myth #4: “I have to know Java to use VoltDB.” As of VoltDB 3.0, released over a year ago, (we’re on V4.2 today), a user can build VoltDB apps and run the server without ever directly interacting with the Java CLI tools or any Java code. Myth #5: “VoltDB has garbage collection problems because it is wri...

Are SSDs that solid?

Here’s the run down of the firmware issues Etsy had over 5 or so years. Intel and Samsung turned out to be the most reliable. https://laur.ie/blog/2015/06/ssds-a-gift-and-a-curse/    The issue raised by Algolia is due to a Linux kernel error https://blog.algolia.com/when-solid-state-drives-are-not-that-solid/

Immutability, MVCC, and Garbage Collection

Interesting article about Datomic and its immutability with regard to MVCC databases. https://www.xaprb.com/blog/2013/12/28/immutability-mvcc-and-garbage-collection/

Apache Flink's Engine

Joins are prevalent operations in many data processing applications. Most data processing systems feature APIs that make joining data sets very easy. However, the internal algorithms for join processing are much more involved – especially if large data sets need to be efficiently handled. Therefore, join processing serves as a good example to discuss the salient design points and implementation details of a data processing system. In this blog post, we cut through Apache Flink’s layered architecture and take a look at its internals with a focus on how it handles joins. Specifically, I will: show how easy it is to join data sets using Flink’s fluent APIs, discuss basic distributed join strategies, Flink’s join implementations, and its memory management, talk about Flink’s optimizer that automatically chooses join strategies, show some performance numbers for joining data sets of different sizes, and finally briefly discuss joining of co-located and pre-sorted data sets. Details:...

HIVE 0.14 Cost Based Optimizer (CBO)

Analysts and data scientists⎯not to mention business executives⎯want Big Data not for the sake of the data itself, but for the ability to work with and learn from that data. As other users become more savvy, they also want more access. But too many inefficient queries can create a bottleneck in the system. The good news is that Apache™ Hive 0.14—the standard SQL interface for processing, accessing and analyzing Apache Hadoop® data sets—is now powered by Apache Calcite. Calcite is an open source, enterprise-grade Cost-Based Logical Optimizer (CBO) and query execution framework. The main goal of a CBO is to generate efficient execution plans by examining the tables and conditions specified in the query, ultimately cutting down on query execution time and reducing resource utilization. Calcite has an efficient plan pruner that can select the cheapest query plan. All SQL queries are converted by Hive to a physical operator tree, optimized and converted to Tez/MapReduce jobs, then executed ...

Adopting Microservices at Netflix: Lessons for Architectural Design

In some recent blog posts, we’ve explained why we believe it’s crucial to adopt a four‑tier application architecture in which applications are developed and deployed as sets of microservices . It’s becoming increasingly clear that if you keep using development processes and application architectures that worked just fine ten years ago, you simply can’t move fast enough to capture and hold the interest of mobile users who can choose from an ever‑growing number of apps. Switching to a microservices architecture creates exciting opportunities in the marketplace for companies. For system architects and developers, it promises an unprecedented level of control and speed as they deliver innovative new web experiences to customers. But at such a breathless pace, it can feel like there’s not a lot of room for error. In the real world, you can’t stop developing and deploying your apps as you retool the processes for doing so. You know that your future success depends on transitio...

8 ways to replace HDFS

Hadoop is on its way to becoming the de facto platform for the next-generation of data-based applications, but it’s not without flaws. Ironically, one of Hadoop’s biggest shortcomings now is also one of its biggest strengths going forward — the Hadoop Distributed File System. Within the Apache Software Foundation, HDFS is always improving in terms of performance and availability. Honestly, it’s probably fine for the majority of Hadoop workloads that are running in pilot projects, skunkworks projects or generally non-demanding environments. And technologies such as HBase that are built atop HDFS speak to its versatility as storage system even for non-MapReduce applications. But if the growing number of options for replacing HDFS signifies anything, it’s that HDFS isn’t quite where it needs to be. Some Hadoop users have strict demands around performance, availability and enterprise-grade features, while others aren’t keen of its direct-attached storage (DAS) architecture. Concerns arou...