Posts

Showing posts from May, 2018

An Introduction to Deep Learning for Tabular Data

Image
There is a powerful technique that is winning Kaggle competitions and is widely used at Google (according to Jeff Dean), Pinterest, and Instacart, yet that many people don’t even realize is possible: the use of deep learning for tabular data, and in particular, the creation of embeddings for categorical variables. Despite what you may have heard, you can use deep learning for the type of data you might keep in a SQL database, a Pandas DataFrame, or an Excel spreadsheet (including time-series data). I will refer to this as tabular data , although it can also be known as relational data , structured data , or other terms (see my twitter poll and comments for more discussion).    From the Pinterest blog post 'Applying deep learning to Related Pins' Tabular data is the most commonly used type of data in industry, but deep learning on tabular data receives far less attention than deep learning for computer vision and natural language processing.  Details >>&

Data’s Inferno: 7 Circles of Data Testing Hell with Airflow

Image
Real data behaves in many unexpected ways that can break even the most well-engineered data pipelines. To catch as much of this weird behaviour as possible before users are affected, the ING Wholesale Banking Advanced Analytics team has created 7 layers of data testing that they use in their CI setup and Apache Airflow pipelines to stay in control of their data. The 7 layers are: Original image courtesy of Columbia Spectator ( http://spc.columbiaspectator.com/spectrum/2016/03/31/nine-circles-columbia-hell ) DAG Integrity Tests ; have your CI (Continuous Integration) check if you DAG is an actual DAG Split your ingestion from your deployment ; keep the logic you use to ingest data separate from the logic that deploys your application Data Tests ; check if your logic is outputting what you’d expect Alerting ; get slack alerts from your data pipelines when they blow up Git Enforcing ; always make sure you’re running your latest verified code Mock Pipeline Tests ; create

Xeno.graphics - a collection of unusual charts and maps

Image
Xeno.graphics is a collection of unusual charts and maps, managed by Maarten Lambrechts . Its objective is to create a repository of novel, innovative and experimental visualizations to inspire you, to fight xenographphobia and popularize new chart types. The xenographics collection will keep on growing. If you know of one that isn’t here already, please submit it . You can also expect some posts about certain topics around xenographics.

Announcing Great Expectations v0.4 (We have SQL…!)

Based on feedback from the past month, we’ve revised, improved, and extended Great Expectations. 284 commits, 103 files changed, and 7 new contributors later, we’ve just released v0.4 ! Here’s what’s new. #1 Native SQL By far the most common request we received was the ability to run expectations natively in SQL. This was always on the roadmap. The community response made it our top priority. We’ve introduced a new class called SQLAlchemyDataset . It contains all* the same expectations as the original PandasDataset class, but instead of executing them against a DataFrame in local memory, it executes them against a database table using the SQLAlchemy core API. This gets us several wins, all at once: Since SQLAlchemy binds to most popular databases, we get immediate integration with all of those systems. We’ve already heard from teams developing against postgresql, Presto/Hive, and SQL Server. We expect to see lots more adoption on this front soon. Since the