Data’s Inferno: 7 Circles of Data Testing Hell with Airflow
Real data
behaves in many unexpected ways that can break even the most
well-engineered data pipelines. To catch as much of this weird behaviour
as possible before users are affected, the ING Wholesale Banking
Advanced Analytics team has created 7 layers of data testing that they
use in their CI setup and Apache Airflow pipelines to stay in control of
their data. The 7 layers are:
- DAG Integrity Tests; have your CI (Continuous Integration) check if you DAG is an actual DAG
- Split your ingestion from your deployment; keep the logic you use to ingest data separate from the logic that deploys your application
- Data Tests; check if your logic is outputting what you’d expect
- Alerting; get slack alerts from your data pipelines when they blow up
- Git Enforcing; always make sure you’re running your latest verified code
- Mock Pipeline Tests; create fake data in your CI so you know exactly what to expect when testing your logic
- DTAP; split your data into four different environments, Development is really small, just to see if it runs, Test to take a representative sample of your data to do first sanity checks, Acceptance is a carbon copy of Production, allowing you to test performance and have a Product Owner do checks before releasing to Production
We
have ordered the 7 layers in order of complexity of implementing them,
where Circle 1 is relatively easy to implement, and Circle 7 is more
complex. Examples of 5 of these circles can be found at: https://github.com/danielvdende/data-testing-with-airflow.
We
cannot make all 7 public, 4 and 5 are missing, as that would allow
everyone to push to our git repository, and post to our Slack
channel :-).
Comments
Post a Comment