I have been looking for good workflow management software and found Apache Airflow to be superior to other solutions.
I’ve taken some time to write a pretty detailed blog post on using Airflow for development of ETL pipelines.
Airflow is a great tool which allows you to:
- centrally manage and track the execution of all your ETL jobs using a web UI
- manage shared connections to databases
- implement complex dependencies between various tasks in the form of a Directed Acyclic graph
In the blog post I cover a detailed implementation of two pipelines: one from Amazon S3 to Redshift and the other one from one table in S3 to another table using an upsert. I also show you how Airflow is used for administration of tasks and log tracking among other things.
You can read the full blog post at this link.
A small preview: