Airflow has many contenders now but still continues to hold strong when it comes to orchestration and scheduling of tasks in the Data Engineering Field.
If you are running self hosted Airflow in your organization it helps to
understand what the system is made of so that when it breaks you can fix
it fast. Letโs look into its Architecture.
Airflow is composed of several microservices that work together to perform work. Here are the components:
๐ฆ๐ฐ๐ต๐ฒ๐ฑ๐๐น๐ฒ๐ฟ.
โก๏ธ Central piece of Airflow architecture.
โก๏ธ Performs triggering of scheduled workflows.
โก๏ธ Submits tasks to the executor.
๐๐
๐ฒ๐ฐ๐๐๐ผ๐ฟ.
โก๏ธ Part of the scheduler process.
โก๏ธ Handles task execution.
โก๏ธ In production workloads pushes tasks to be performed to workers.
โก๏ธ Can be configured to execute against different Systems (Celery, Kubernetes etc.)
๐ช๐ผ๐ฟ๐ธ๐ฒ๐ฟ.
โก๏ธ The unit that actually performs work.
โก๏ธ In production setups it usually takes work in the form of tasks from a queue placed between workers and the executor.
๐ ๐ฒ๐๐ฎ๐ฑ๐ฎ๐๐ฎ ๐๐ฎ๐๐ฎ๐ฏ๐ฎ๐๐ฒ.
โก๏ธ Database used to store the state by Scheduler, Executor and Webserver.
๐๐๐ ๐ฑ๐ถ๐ฟ๐ฒ๐ฐ๐๐ผ๐ฟ๐ถ๐ฒ๐.
โก๏ธ Airflow DAGs are defined in Python code.
โก๏ธ This is where you store the DAG code and configure Airflow to look for DAGs.
๐ช๐ฒ๐ฏ๐๐ฒ๐ฟ๐๐ฒ๐ฟ.
โก๏ธ This is a Flask Application that allows users to explore, debug and
partially manipulate Airflow DAGs, users and configuration.
โ๏ธ Two most important parts are the Scheduler and the Metadata DB.
โ๏ธ Even if Webserver is down - tasks will be executed as long as the Scheduler is healthy.
โ๏ธ Metadata DB transaction locks can cause problems for other services.