What is Apache Airflow?
Apache Airflow is a workflow automation and scheduling system that can be used to author and manage data pipelines. Airflow uses workflows made of directed acyclic graphs (DAGs) of tasks.
Note: Airflow is currently in incubator status. Software in the Apache Incubator has not yet been fully endorsed by the Apache Software Foundation.
A DAG is a construct of nodes and connectors (also called “edges”) where the connectors have direction, and you can start at any arbitrary node to travel through all connectors. Each connector is traversed once. Trees and network topologies are types of DAGs.
Airflow workflows have tasks whose output is another task’s input. Therefore, the ETL process is also a type of DAG. In each step, the output is used as the input of the next step and you cannot loop back to a previous step.
Defining workflows in code provides easier maintenance, testing and versioning.
How is Apache Airflow different?
Airflow is not a data streaming platform. Tasks represent data movement, they do not move data in themselves. Thus, it is not an interactive ETL tool.
Airflow is a Python script that defines an Airflow DAG object. This object can then be used in Python to code the ETL process. Airflow uses Jinja Templating, which provides built-in parameters and macros (Jinja is a templating language for Python, modeled after Django templates) for Python programming.
Apache Airflow is a generic data toolbox that supports custom plugins. These plugins can add features, interact effectively with different data storage platforms (i.e. Amazon Redshift, MySQL), and handle more complex interactions with data and metadata.
Integrated with Amazon Web Services (AWS) and Google Cloud Platform (GCP) which includes BigQuery, Airflow has built in connections with these services. There are AWS and GCP hooks and operators available for Airflow and additional integrations may become available as Airflow matures.