Notes about Cutting-Edge Technologies and Everything

Posts

Recommendation System Algorithms

- June 10, 2017

An overview of the main existing recommendation system algorithms: Collaborative filtering Matrix decomposition Clustering Deep learning approach Details: https://blog.statsbot.co/recommendation-system-algorithms-ba67f39ac9a3

See post »

Apache Livy - A REST Interface for Apache Spark

- May 13, 2017

Livy is an open source REST interface for interacting with Apache Spark from anywhere. It supports executing snippets of code or programs in a Spark context that runs locally or in Apache Hadoop YARN. Livy provides the following features: Interactive Scala, Python, and R shells Batch submissions in Scala, Java, Python Multiple users can share the same server (impersonation support) Can be used for submitting jobs from anywhere with REST Does not require any code change to your programs Support Spark1/ Spark2, Scala 2.10/2.11 within one build. Architecture Core Functionalities Livy offers three modes to run Spark jobs: Using programmatic API Running interactive statements through REST API Submitting batch applications with REST API In the following sections, I will provide the details of these 3 modes. Details: https://hortonworks.c...

See post »

2017 Gartner Magic Quadrant for Data Management Solutions for Analytics

- February 25, 2017

Details >> (shared by MemSQL here )

See post »

2017 Gartner Magic Quadrant for Business Intelligence and Analytics

- February 19, 2017

Details >> (provided by Tableau here )

See post »

Unreliable clocks

- January 05, 2017

Why time-of-day clocks are unsuitable for measuring elapsed time: https://blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns

See post »

The Forrester Big Data Fabric, Q4 2016

- November 29, 2016

The Forrester Wave™: Big Data Fabric, Q4 2016 Details >> (the link provided by Informatica here )

See post »

File Format Benchmark - Avro, JSON, ORC & Parquet (Owen O’Malley, 2016 Strata + Hadoop World)

- October 12, 2016

Benchmarks from Owen O’Malley presented at 2016 Strata + Hadoop World conference in New York. 3 datasets: 1. NYC Taxi Data (Every taxi cab ride in NYC from 2009) 2. Github Logs (All actions on Github public repositories) 3. Sales (Generated data) 3 compression mechanisms: 1. None 2. Snappy 3. Zlib (Gzip) 3 usecases: 1. Full Table Scans 2. Column Projection 3. Predicate Pushdown Compression results Use cases results Full Table Scans Column Projection Dataset format compression us/row projection Percent time github orc zlib 21.319 0.185 0.87% github parquet zlib 72.494 0.585 0.81% sales orc zlib 1.866 0.056 3.00% sales parquet zlib 12.893 0.329 2.55% taxi orc zlib 2.766 0.063 2.28% taxi parquet zlib 3.496 ...

See post »