Getting started with Apache Airflow
In this post, I am going to discuss Apache Airflow, a workflow management system developed by Airbnb.
Earlier I had discussed
writing basic ETL pipelines in Bonobo. Bonobo is cool for write ETL
pipelines but the world is not all about writing ETL pipelines to
automate things. There are other use cases in which you have to perform
tasks in a certain order once or periodically. For instance:
- Monitoring Cron jobs
- transferring data from one place to other.
- Automating your DevOps operations.
- Periodically fetching data from websites and update the database for your awesome price comparison system.
- Data processing for recommendation based systems.
- Machine Learning Pipelines.
Possibilities are endless.
Before we move on further to implement Airflow in our systems, let’s discuss what actually is Airflow and it’s terminologies.
What is Airflow?
From the Website:
Airflow is a platform to programmatically author, schedule and monitor workflows.
Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.
Basically, it helps to automate scripts in order
to perform tasks. Airflow is Python-based but you can execute a program
irrespective of the language. For instance, the first stage of your
workflow has to execute a C++ based program to perform image analysis
and then a Python-based program to transfer that information to S3.
Possibilities are endless.
This post is the part of Data Engineering Series.
This post is the part of Data Engineering Series.
Comments
Post a Comment