Posts

Showing posts from 2017

Gartner 2017 Market Guide for Data Preparation

Data preparation — the most time-consuming task in analytics and BI — is evolving from a self-service activity to an enterprise imperative. We profile 28 data preparation tools for data and analytics leaders to consider to accelerate agile data preparation for a range of distributed content authors. Overview Key Findings The market for data preparation has now evolved from tools supporting only self-service use cases into platforms that enable data and analytics teams to build agile and searchable datasets at an enterprise scale for distributed content authors. Most vendor offerings support data profiling, data exploration, transformation, modeling and curation, and metadata support. More than 80% of the vendors surveyed embed some data cataloging features and offer varying degrees of machine-learning capabilities. The market is crowded with a range of choices, from stand-alone specialists to vendors that embed data preparation as a capability into analyti...

How to choose algorithms for Microsoft Azure Machine Learning

The Microsoft Azure Machine Learning Algorithm Cheat Sheet helps you choose the right machine learning algorithm for your predictive analytics solutions from the Microsoft Azure Machine Learning library of algorithms. This article walks you through how to use it. Details:  https://docs.microsoft.com/en-us/azure/machine-learning/studio/algorithm-choice

Translytical Data Platforms

Image
Analytics at the speed of transactions has become an important agenda item for organizations. Translytical data platforms, an emerging technology, deliver faster access to business data to support various workloads and use cases. EA pros can use them to drive new business initiatives. Forrester identified the 12 most significant translytical vendors — Aerospike, DataStax, GigaSpaces, IBM, MemSQL, Microsoft, NuoDB, Oracle, Redis Labs, SAP, Splice Machine, and VoltDB — and researched, analyzed, and scored them against 25 criteria. Details >> (the link is provided by DataStax here ) The Forrester Wave™: Translytical Data Platforms, Q4 2017:

Software 2.0

(by Andrej Karpathy, Director of AI at Tesla) Neural networks are not just another classifier, they represent the beginning of a fundamental shift in how we write software. They are Software 2.0. The “classical stack” of Software 1.0 is what we’re all familiar with — it is written in languages such as Python, C++, etc. It consists of explicit instructions to the computer written by a programmer. By writing each line of code, the programmer is identifying a specific point in program space with some desirable behavior. In contrast, Software 2.0 is written in neural network weights. No human is involved in writing this code because there are a lot of weights (typical networks might have millions), and coding directly in weights is kind of hard (I tried). Instead, we specify some constraints on the behavior of a desirable program (e.g., a dataset of input output pairs of examples) and use the computational resources at our disposal to search the program space for a pr...

Gartner Top 10 Strategic Technology Trends for 2018

Image
Details:  https://www.gartner.com/smarterwithgartner/gartner-top-10-strategic-technology-trends-for-2018/

Perils of Network Partitions

Hey I just met you The network's laggy But here's my data So store it maybe (Kyle Kingsbury, Carly Rae Jepsen and the Perils of Network Partitions)

HDFS scalability: the limits to growth

Some time ago I came across very interesting article by Konstantin V. Shvachko (now Senior Staff Software Engineer at LinkedIn) concerning the limits of hadoop scalability. The main conclusion of it is that "a 10,000 node HDFS cluster with a single name-node is expected to handle well a workload of 100,000 readers, but even 10,000 writers can produce enough workload to saturate the name-node, making it a bottleneck for linear scaling. Such a large difference in performance is attributed to get block locations (read workload) being a memory-only operation, while creates (write workload) require journaling, which is bounded by the local hard drive performance. There are ways to improve the single name-node performance, but any solution intended for single namespace server optimization lacks scalability." Konstantin continues: "The most promising solutions seem to be based on distributing the namespace server itself both for workload balancing and for reducing the si...

Recommendation System Algorithms

An overview of the main existing recommendation system algorithms: Collaborative filtering Matrix decomposition Clustering Deep learning approach Details:  https://blog.statsbot.co/recommendation-system-algorithms-ba67f39ac9a3

Apache Livy - A REST Interface for Apache Spark

Image
Livy is an open source REST interface for interacting with Apache Spark from anywhere. It supports executing snippets of code or programs in a Spark context that runs locally or in Apache Hadoop YARN. Livy provides the following features:     Interactive Scala, Python, and R shells     Batch submissions in Scala, Java, Python     Multiple users can share the same server (impersonation support)     Can be used for submitting jobs from anywhere with REST     Does not require any code change to your programs     Support Spark1/ Spark2, Scala 2.10/2.11 within one build. Architecture Core Functionalities Livy offers three modes to run Spark jobs:     Using programmatic API     Running interactive statements through REST API     Submitting batch applications with REST API In the following sections, I will provide the details of these 3 modes. Details:  https://hortonworks.c...

2017 Gartner Magic Quadrant for Data Management Solutions for Analytics

Image
Details >> (shared by MemSQL here )

2017 Gartner Magic Quadrant for Business Intelligence and Analytics

Image
Details >> (provided by Tableau here )

Unreliable clocks

Why time-of-day clocks are unsuitable for measuring elapsed time: https://blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns