Posts

Showing posts with the label development

Model governance and model operations: building and deploying robust, production-ready machine learning models

Image
O'Reilly's surveys over the past couple of years have shown growing interest in machine learning (ML) among organizations from diverse industries. A few factors are contributing to this strong interest in implementing ML in products and services. First, the machine learning community has conducted groundbreaking research in many areas of interest to companies, and much of this research has been conducted out in the open via preprints and conference presentations. We are also beginning to see researchers share sample code written in popular open source libraries, and some even share pre-trained models. Organizations now also have more use cases and case studies from which to draw inspiration—no matter what industry or domain you are interested in, chances are there are many interesting ML applications you can learn from. Finally, modeling tools are improving, and automation is beginning to allow new users to tackle problems that used to be the province of experts. With the s...

Data’s Inferno: 7 Circles of Data Testing Hell with Airflow

Image
Real data behaves in many unexpected ways that can break even the most well-engineered data pipelines. To catch as much of this weird behaviour as possible before users are affected, the ING Wholesale Banking Advanced Analytics team has created 7 layers of data testing that they use in their CI setup and Apache Airflow pipelines to stay in control of their data. The 7 layers are: Original image courtesy of Columbia Spectator ( http://spc.columbiaspectator.com/spectrum/2016/03/31/nine-circles-columbia-hell ) DAG Integrity Tests ; have your CI (Continuous Integration) check if you DAG is an actual DAG Split your ingestion from your deployment ; keep the logic you use to ingest data separate from the logic that deploys your application Data Tests ; check if your logic is outputting what you’d expect Alerting ; get slack alerts from your data pipelines when they blow up Git Enforcing ; always make sure you’re running your latest verified code Mock Pipeline Tests ; create ...

Announcing Great Expectations v0.4 (We have SQL…!)

Based on feedback from the past month, we’ve revised, improved, and extended Great Expectations. 284 commits, 103 files changed, and 7 new contributors later, we’ve just released v0.4 ! Here’s what’s new. #1 Native SQL By far the most common request we received was the ability to run expectations natively in SQL. This was always on the roadmap. The community response made it our top priority. We’ve introduced a new class called SQLAlchemyDataset . It contains all* the same expectations as the original PandasDataset class, but instead of executing them against a DataFrame in local memory, it executes them against a database table using the SQLAlchemy core API. This gets us several wins, all at once: Since SQLAlchemy binds to most popular databases, we get immediate integration with all of those systems. We’ve already heard from teams developing against postgresql, Presto/Hive, and SQL Server. We expect to see lots more adoption on this front soon. Since ...