Test data quality at scale with AWS Deequ
You generally write unit tests for your code, but do you also test your data? Incorrect or malformed data can have a large impact on production systems. Examples of data quality issues are: Missing values can lead to failures in production system that require non-null values (NullPointerException). Changes in the distribution of data can lead to unexpected outputs of machine learning models. Aggregations of incorrect data can lead to wrong business decisions. In this blog post, we introduce Deequ, an open source tool developed and used at Amazon. Deequ allows you to calculate data quality metrics on your dataset, define and verify data quality constraints, and be informed about changes in the data distribution. Instead of implementing checks and verification algorithms on your own, you can focus on describing how your data should look. Deequ supports you by suggesting checks for you. Deequ is implemented on top of Apache Spark and is designed to scale with large datasets (th...