Posts

Showing posts with the label HDFS

Deep dive into how Uber uses Spark

Image
Apache Spark is a foundational piece of Uber’s Big Data infrastructure that powers many critical aspects of our business. We currently run more than one hundred thousand Spark applications per day, across multiple different compute environments. Spark’s versatility, which allows us to build applications and run them everywhere that we need, makes this scale possible. However, our ever-growing infrastructure means that these environments are constantly changing, making it increasingly difficult for both new and existing users to give their applications reliable access to data sources, compute resources, and supporting tools. Also, as the number of users grow, it becomes more challenging for the data team to communicate these environmental changes to users, and for us to understand exactly how Spark is being used. We built the Uber Spark Compute Service (uSCS) to help manage the complexities of running Spark at this scale. This Spark-as-a-service solution leverages Apache Livy, cu...

Dremio 2.1 is shipped with many new features!

Image
This is a major release that includes many new features, performance improvements, and hundreds of stability enhancements - see the highlights and more details below. • Elasticsearch 6.  Dremio now supports the latest versions of Elasticsearch. Enjoy full SQL support, including JOINs, Window functions, and accelerated analytics through any BI tool, including Tableau and Power BI. We also added support for compressing Elasticsearch responses to minimize network traffic.  • Approximate count distinct acceleration.  Dremio now supports accelerating count distinct queries based on an approximation-based algorithm (HyperLogLog). This provides a faster and more memory efficient way of providing distinct counts and is especially useful in high cardinality scenarios with very large datasets.  • Faster ORC performance.  Data encoded in ORC is now significantly faster to access and more memory efficient for ORC managed in Hive sources.  • Support for AWS GovClou...

Actian VectorH architecture and Amazon Redshift benchmark papers

Actian Vector in Hadoop (VectorH for short) is a new SQL-on-Hadoop  system  built  on  top  of  the  fast  Vectorwise  analytical database system.  VectorH achieves fault tolerance and storage scalability by relying on HDFS, and extends the state-of-the-art in SQL-on-Hadoop systems by instrumenting  the  HDFS  replication  policy  to  optimize  read  locality. VectorH integrates with YARN for workload management, achieving a high degree of elasticity.  Even though HDFS is an append-only filesystem, and VectorH supports (update-averse) ordered tables, trickle updates are possible thanks to Positional Delta Trees (PDTs), a differential update structure that can be queried efficiently.  The paper describes the changes made to single-server Vectorwise to turn it into a Hadoop-based  MPP  system,  encompassing  workload  management, parallel  query  optimizat...

Machine Learning algorithms and libraries overview

Nice brief overview of some Machine Learning algorithms highlighting their strengths and weaknesses. Big 3 machine learning tasks, which are by far the most common ones. They are:     Regression     Classification     Clustering Details: https://elitedatascience.com/machine-learning-algorithms Here are also some observations on the top five characteristics of ML libraries that developers should consider when deciding what library to use: Programming paradigm Symbolic: Spark MLlib, MMLSpark, BigDL, CNTK, H2O.ai, Keras, Caffe2 Imperative: scikit-learn, auto sklearn, TPOT, PyTorch Hybrid: MXNet, TensorFlow Machine learning algorithms Supervised and unsupervised: Spark MLlib, scikit-learn, H2O.ai, MMLSpark, Mahout Deep learning: TensorFlow, PyTorch, Caffe2 (image), Keras, MXNet, CNTK, BigDL, MMLSpark (image and text), H2O.ai (via the deepwater plugin) Recommendation system: Spark MLlib, H2O.ai (via the sparkling-water plugin), Mah...

HDFS scalability: the limits to growth

Some time ago I came across very interesting article by Konstantin V. Shvachko (now Senior Staff Software Engineer at LinkedIn) concerning the limits of hadoop scalability. The main conclusion of it is that "a 10,000 node HDFS cluster with a single name-node is expected to handle well a workload of 100,000 readers, but even 10,000 writers can produce enough workload to saturate the name-node, making it a bottleneck for linear scaling. Such a large difference in performance is attributed to get block locations (read workload) being a memory-only operation, while creates (write workload) require journaling, which is bounded by the local hard drive performance. There are ways to improve the single name-node performance, but any solution intended for single namespace server optimization lacks scalability." Konstantin continues: "The most promising solutions seem to be based on distributing the namespace server itself both for workload balancing and for reducing the si...

HDFS Erasure Coding in Apache Hadoop

HDFS by default replicates each block three times. Replication provides a simple and robust form of redundancy to shield against most failure scenarios. It also eases scheduling compute tasks on locally stored data blocks by providing multiple replicas of each block to choose from. However, replication is expensive: the default 3x replication scheme incurs a 200% overhead in storage space and other resources (e.g., network bandwidth when writing the data). For datasets with relatively low I/O activity, the additional block replicas are rarely accessed during normal operations, but still consume the same amount of storage space. Therefore, a natural improvement is to use erasure coding (EC) in place of replication, which uses far less storage space while still providing the same level of fault tolerance. Under typical configurations, EC reduces the storage cost by ~50% compared with 3x replication. Motivated by this substantial cost saving opportunity, engineers from Cloudera and Intel ...

8 ways to replace HDFS

Hadoop is on its way to becoming the de facto platform for the next-generation of data-based applications, but it’s not without flaws. Ironically, one of Hadoop’s biggest shortcomings now is also one of its biggest strengths going forward — the Hadoop Distributed File System. Within the Apache Software Foundation, HDFS is always improving in terms of performance and availability. Honestly, it’s probably fine for the majority of Hadoop workloads that are running in pilot projects, skunkworks projects or generally non-demanding environments. And technologies such as HBase that are built atop HDFS speak to its versatility as storage system even for non-MapReduce applications. But if the growing number of options for replacing HDFS signifies anything, it’s that HDFS isn’t quite where it needs to be. Some Hadoop users have strict demands around performance, availability and enterprise-grade features, while others aren’t keen of its direct-attached storage (DAS) architecture. Concerns arou...

HBase Bulk Loading

Apache HBase is all about giving you random, real-time, read/write access to your Big Data, but how do you efficiently get that data into HBase in the first place? Intuitively, a new user will try to do that via the client APIs or by using a MapReduce job with TableOutputFormat, but those approaches are problematic, as you will learn below. Instead, the HBase bulk loading feature is much easier to use and can insert the same amount of data more quickly. This blog post will introduce the basic concepts of the bulk loading feature, present two use cases, and propose two examples: https://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/

Apache Flink: API, runtime, and project roadmap

Detailed presentation on Apache Flink: https://www.slideshare.net/KostasTzoumas/apache-flink-api-runtime-and-project-roadmap