Notes about Cutting-Edge Technologies and Everything

Posts

Showing posts from 2017

Gartner 2017 Market Guide for Data Preparation

- December 28, 2017

Data preparation — the most time-consuming task in analytics and BI — is evolving from a self-service activity to an enterprise imperative. We profile 28 data preparation tools for data and analytics leaders to consider to accelerate agile data preparation for a range of distributed content authors. Overview Key Findings The market for data preparation has now evolved from tools supporting only self-service use cases into platforms that enable data and analytics teams to build agile and searchable datasets at an enterprise scale for distributed content authors. Most vendor offerings support data profiling, data exploration, transformation, modeling and curation, and metadata support. More than 80% of the vendors surveyed embed some data cataloging features and offer varying degrees of machine-learning capabilities. The market is crowded with a range of choices, from stand-alone specialists to vendors that embed data preparation as a capability into analyti...

See post »

How to choose algorithms for Microsoft Azure Machine Learning

- December 23, 2017

The Microsoft Azure Machine Learning Algorithm Cheat Sheet helps you choose the right machine learning algorithm for your predictive analytics solutions from the Microsoft Azure Machine Learning library of algorithms. This article walks you through how to use it. Details: https://docs.microsoft.com/en-us/azure/machine-learning/studio/algorithm-choice

See post »

Translytical Data Platforms

- November 30, 2017

Analytics at the speed of transactions has become an important agenda item for organizations. Translytical data platforms, an emerging technology, deliver faster access to business data to support various workloads and use cases. EA pros can use them to drive new business initiatives. Forrester identified the 12 most significant translytical vendors — Aerospike, DataStax, GigaSpaces, IBM, MemSQL, Microsoft, NuoDB, Oracle, Redis Labs, SAP, Splice Machine, and VoltDB — and researched, analyzed, and scored them against 25 criteria. Details >> (the link is provided by DataStax here ) The Forrester Wave™: Translytical Data Platforms, Q4 2017:

See post »

Software 2.0

- November 15, 2017

(by Andrej Karpathy, Director of AI at Tesla) Neural networks are not just another classifier, they represent the beginning of a fundamental shift in how we write software. They are Software 2.0. The “classical stack” of Software 1.0 is what we’re all familiar with — it is written in languages such as Python, C++, etc. It consists of explicit instructions to the computer written by a programmer. By writing each line of code, the programmer is identifying a specific point in program space with some desirable behavior. In contrast, Software 2.0 is written in neural network weights. No human is involved in writing this code because there are a lot of weights (typical networks might have millions), and coding directly in weights is kind of hard (I tried). Instead, we specify some constraints on the behavior of a desirable program (e.g., a dataset of input output pairs of examples) and use the computational resources at our disposal to search the program space for a pr...

See post »

Gartner Top 10 Strategic Technology Trends for 2018

- October 07, 2017

Details: https://www.gartner.com/smarterwithgartner/gartner-top-10-strategic-technology-trends-for-2018/

See post »

Perils of Network Partitions

- September 21, 2017

Hey I just met you The network's laggy But here's my data So store it maybe (Kyle Kingsbury, Carly Rae Jepsen and the Perils of Network Partitions)

See post »

HDFS scalability: the limits to growth

- September 08, 2017

Some time ago I came across very interesting article by Konstantin V. Shvachko (now Senior Staff Software Engineer at LinkedIn) concerning the limits of hadoop scalability. The main conclusion of it is that "a 10,000 node HDFS cluster with a single name-node is expected to handle well a workload of 100,000 readers, but even 10,000 writers can produce enough workload to saturate the name-node, making it a bottleneck for linear scaling. Such a large difference in performance is attributed to get block locations (read workload) being a memory-only operation, while creates (write workload) require journaling, which is bounded by the local hard drive performance. There are ways to improve the single name-node performance, but any solution intended for single namespace server optimization lacks scalability." Konstantin continues: "The most promising solutions seem to be based on distributing the namespace server itself both for workload balancing and for reducing the si...

See post »

Recommendation System Algorithms

- June 10, 2017

An overview of the main existing recommendation system algorithms: Collaborative filtering Matrix decomposition Clustering Deep learning approach Details: https://blog.statsbot.co/recommendation-system-algorithms-ba67f39ac9a3

See post »

Apache Livy - A REST Interface for Apache Spark

- May 13, 2017

Livy is an open source REST interface for interacting with Apache Spark from anywhere. It supports executing snippets of code or programs in a Spark context that runs locally or in Apache Hadoop YARN. Livy provides the following features: Interactive Scala, Python, and R shells Batch submissions in Scala, Java, Python Multiple users can share the same server (impersonation support) Can be used for submitting jobs from anywhere with REST Does not require any code change to your programs Support Spark1/ Spark2, Scala 2.10/2.11 within one build. Architecture Core Functionalities Livy offers three modes to run Spark jobs: Using programmatic API Running interactive statements through REST API Submitting batch applications with REST API In the following sections, I will provide the details of these 3 modes. Details: https://hortonworks.c...

See post »