Posts

Recommendation System Algorithms

An overview of the main existing recommendation system algorithms: Collaborative filtering Matrix decomposition Clustering Deep learning approach Details:  https://blog.statsbot.co/recommendation-system-algorithms-ba67f39ac9a3

Apache Livy - A REST Interface for Apache Spark

Image
Livy is an open source REST interface for interacting with Apache Spark from anywhere. It supports executing snippets of code or programs in a Spark context that runs locally or in Apache Hadoop YARN. Livy provides the following features:     Interactive Scala, Python, and R shells     Batch submissions in Scala, Java, Python     Multiple users can share the same server (impersonation support)     Can be used for submitting jobs from anywhere with REST     Does not require any code change to your programs     Support Spark1/ Spark2, Scala 2.10/2.11 within one build. Architecture Core Functionalities Livy offers three modes to run Spark jobs:     Using programmatic API     Running interactive statements through REST API     Submitting batch applications with REST API In the following sections, I will provide the details of these 3 modes. Details:  https://hortonworks.c...

2017 Gartner Magic Quadrant for Data Management Solutions for Analytics

Image
Details >> (shared by MemSQL here )

2017 Gartner Magic Quadrant for Business Intelligence and Analytics

Image
Details >> (provided by Tableau here )

Unreliable clocks

Why time-of-day clocks are unsuitable for measuring elapsed time: https://blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns

The Forrester Big Data Fabric, Q4 2016

Image
The Forrester Wave™: Big Data Fabric, Q4 2016   Details >>   (the link provided by Informatica here )

File Format Benchmark - Avro, JSON, ORC & Parquet (Owen O’Malley, 2016 Strata + Hadoop World)

Image
Benchmarks from Owen O’Malley presented at 2016 Strata + Hadoop World conference in New York. 3 datasets: 1. NYC Taxi Data (Every taxi cab ride in NYC from 2009) 2. Github Logs (All actions on Github public repositories) 3. Sales (Generated data)   3 compression mechanisms: 1. None 2. Snappy 3. Zlib (Gzip) 3 usecases: 1. Full Table Scans 2. Column Projection 3. Predicate Pushdown  Compression results     Use cases results Full Table Scans   Column Projection Dataset format compression us/row projection Percent time github orc zlib 21.319 0.185 0.87% github parquet zlib 72.494 0.585 0.81% sales orc zlib 1.866 0.056 3.00% sales parquet zlib 12.893 0.329 2.55% taxi orc zlib 2.766 0.063 2.28% taxi parquet zlib 3.496 ...