Data’s Inferno: 7 Circles of Data Testing Hell with Airflow

Real data behaves in many unexpected ways that can break even the most well-engineered data pipelines. To catch as much of this weird behaviour as possible before users are affected, the ING Wholesale Banking Advanced Analytics team has created 7 layers of data testing that they use in their CI setup and Apache Airflow pipelines to stay in control of their data. The 7 layers are:
Original image courtesy of Columbia Spectator (http://spc.columbiaspectator.com/spectrum/2016/03/31/nine-circles-columbia-hell)
  1. DAG Integrity Tests; have your CI (Continuous Integration) check if you DAG is an actual DAG
  2. Split your ingestion from your deployment; keep the logic you use to ingest data separate from the logic that deploys your application
  3. Data Tests; check if your logic is outputting what you’d expect
  4. Alerting; get slack alerts from your data pipelines when they blow up
  5. Git Enforcing; always make sure you’re running your latest verified code
  6. Mock Pipeline Tests; create fake data in your CI so you know exactly what to expect when testing your logic
  7. DTAP; split your data into four different environments, Development is really small, just to see if it runs, Test to take a representative sample of your data to do first sanity checks, Acceptance is a carbon copy of Production, allowing you to test performance and have a Product Owner do checks before releasing to Production
We have ordered the 7 layers in order of complexity of implementing them, where Circle 1 is relatively easy to implement, and Circle 7 is more complex. Examples of 5 of these circles can be found at: https://github.com/danielvdende/data-testing-with-airflow.
We cannot make all 7 public, 4 and 5 are missing, as that would allow everyone to push to our git repository, and post to our Slack channel :-).

Details >>

Comments