Apache Airflow for data pipelines and ETL management

I have been looking for good workflow management software and found Apache Airflow to be superior to other solutions.

I’ve taken some time to write a pretty detailed blog post on using Airflow for development of ETL pipelines.

Airflow is a great tool which allows you to:

  • centrally manage and track the execution of all your ETL jobs using a web UI
  • manage shared connections to databases
  • implement complex dependencies between various tasks in the form of a Directed Acyclic graph

In the blog post I cover a detailed implementation of two pipelines: one from Amazon S3 to Redshift and the other one from one table in S3 to another table using an upsert. I also show you how Airflow is used for administration of tasks and log tracking among other things.

You can read the full blog post at this link.

A small preview:

Screen Shot 2018-02-11 at 16.15.13.png

Advertisements
This entry was posted in Big Data, Data Engineering, Python. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s