Getting started with Apache Airflow



In this post, I am going to discuss Apache Airflow, a workflow management system developed by Airbnb.

Earlier I had discussed writing basic ETL pipelines in Bonobo. Bonobo is cool for write ETL pipelines but the world is not all about writing ETL pipelines to automate things. There are other use cases in which you have to perform tasks in a certain order once or periodically. For instance:

  • Monitoring Cron jobs
  • transferring data from one place to other.
  • Automating your DevOps operations.
  • Periodically fetching data from websites and update the database for your awesome price comparison system.
  • Data processing for recommendation based systems.
  • Machine Learning Pipelines.

Possibilities are endless.

Before we move on further to implement Airflow in our systems, let’s discuss what actually is Airflow and it’s terminologies.

What is Airflow?


From the Website:

Airflow is a platform to programmatically author, schedule and monitor workflows.
Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.

Basically, it helps to automate scripts in order to perform tasks. Airflow is Python-based but you can execute a program irrespective of the language. For instance, the first stage of your workflow has to execute a C++ based program to perform image analysis and then a Python-based program to transfer that information to S3. Possibilities are endless.

This post is the part of Data Engineering Series.

Comments