Posts

Showing posts with the label scalability

Distributed SQL System Review: Snowflake vs Splice Machine

Image
After many years of Big Data, NoSQL, and Schema-on-Read detours, there is a clear return to SQL as the lingua franca for data operations. Developers need the comprehensive expressiveness that SQL provides. A world without SQL ignores more than 40 years of database research and results in hard-coded spaghetti code in applications to handle functionality that SQL handles extremely efficiently such as joins, groupings, aggregations, and (most importantly) rollback when updates go wrong. Luckily, there is a modern architecture for SQL called Distributed SQL that no longer suffers from the challenges of traditional SQL systems (cost, scalability, performance, elasticity, and schema flexibility). The key attribute of Distributed SQL is that data is stored across many distributed storage locations and computation takes place across a cluster of networked servers. This yields unprecedented performance and scalability because it distributes work on each worker node in the cluster in parall...

Improving Netflix’s Operational Visibility with Real-Time Insight Tools

Image
For Netflix to be successful, we have to be vigilant in supporting the tens of millions of connected devices that are used by our 40+ million members throughout 40+ countries. These members consume more than one billion hours of content every month and account for nearly a third of the downstream Internet traffic in North America during peak hour From an operational perspective, our system environments at Netflix are large, complex, and highly distributed. And at our scale, humans cannot continuously monitor the status of all of our systems. To maintain high availability across such a complicated system, and to help us continuously improve the experience for our customers, it is critical for us to have exceptional tools coupled with intelligent analysis to proactively detect and communicate system faults and identify areas of improvement. In this post, we will talk about our plans to build a new set of insight tools and systems that create greater visibility int...

The Forrester Wave Big Data Fabric, Q2 2018

Image
Key Takeaways Talend, Denodo Technologies, Oracle, IBM, And Paxata Lead The Pack Forrester's research uncovered a market in which Talend, Denodo Technologies, Oracle, IBM, and Paxata are Leaders; Hortonworks, Cambridge Semantics, SAP, Trifacta, Cloudera, and Syncsort are Strong Performers; and Podium Data, TIBCO Software, Informatica, and Hitachi Vantara are Contenders. EA Pros Are Looking To Support Multiple Use Cases With Big Data Fabric The big data fabric market is growing because more EA pros see big data fabric as critical for their enterprise big data strategy. Scale, Performance, AI/Machine Learning, And Use-Case Support Are Key Differentiators The Leaders we identified support a broader set of use cases, enhanced AI and machine learning capabilities, and offer good scalability features. ...

Machine Learning algorithms and libraries overview

Nice brief overview of some Machine Learning algorithms highlighting their strengths and weaknesses. Big 3 machine learning tasks, which are by far the most common ones. They are:     Regression     Classification     Clustering Details: https://elitedatascience.com/machine-learning-algorithms Here are also some observations on the top five characteristics of ML libraries that developers should consider when deciding what library to use: Programming paradigm Symbolic: Spark MLlib, MMLSpark, BigDL, CNTK, H2O.ai, Keras, Caffe2 Imperative: scikit-learn, auto sklearn, TPOT, PyTorch Hybrid: MXNet, TensorFlow Machine learning algorithms Supervised and unsupervised: Spark MLlib, scikit-learn, H2O.ai, MMLSpark, Mahout Deep learning: TensorFlow, PyTorch, Caffe2 (image), Keras, MXNet, CNTK, BigDL, MMLSpark (image and text), H2O.ai (via the deepwater plugin) Recommendation system: Spark MLlib, H2O.ai (via the sparkling-water plugin), Mah...

HDFS scalability: the limits to growth

Some time ago I came across very interesting article by Konstantin V. Shvachko (now Senior Staff Software Engineer at LinkedIn) concerning the limits of hadoop scalability. The main conclusion of it is that "a 10,000 node HDFS cluster with a single name-node is expected to handle well a workload of 100,000 readers, but even 10,000 writers can produce enough workload to saturate the name-node, making it a bottleneck for linear scaling. Such a large difference in performance is attributed to get block locations (read workload) being a memory-only operation, while creates (write workload) require journaling, which is bounded by the local hard drive performance. There are ways to improve the single name-node performance, but any solution intended for single namespace server optimization lacks scalability." Konstantin continues: "The most promising solutions seem to be based on distributing the namespace server itself both for workload balancing and for reducing the si...