Posts

Showing posts with the label code

Data’s Inferno: 7 Circles of Data Testing Hell with Airflow

Image
Real data behaves in many unexpected ways that can break even the most well-engineered data pipelines. To catch as much of this weird behaviour as possible before users are affected, the ING Wholesale Banking Advanced Analytics team has created 7 layers of data testing that they use in their CI setup and Apache Airflow pipelines to stay in control of their data. The 7 layers are: Original image courtesy of Columbia Spectator ( http://spc.columbiaspectator.com/spectrum/2016/03/31/nine-circles-columbia-hell ) DAG Integrity Tests ; have your CI (Continuous Integration) check if you DAG is an actual DAG Split your ingestion from your deployment ; keep the logic you use to ingest data separate from the logic that deploys your application Data Tests ; check if your logic is outputting what you’d expect Alerting ; get slack alerts from your data pipelines when they blow up Git Enforcing ; always make sure you’re running your latest verified code Mock Pipeline Tests ; create ...