Apache Airflow is one of the most powerful workflow automation tools used in data engineering and ETL pipelines. But for beginners, setting it up on an AWS EC2 instance for the first time can feel confusing.

This guide explains EVERY step, from server setup to installing dependencies, understanding why each dependency is required, and finally creating a simple ETL DAG.

This is written so you can repeat this installation easily every time.

1. Launch an Ubuntu EC2 Instance

Choose:

Ubuntu 22.04
t2.micro or t3.micro (Free-tier)
Add a security group rule:
- Type: Custom TCP
- Port: 8080
- Source: 0.0.0.0/0 (or your IP)

SSH into the instance:

ssh -i your-key.pem ubuntu@<public-ip>

2. Update System Packages

sudo apt update
sudo apt upgrade -y

Why?

To ensure your machine has the latest secure and stable packages before installing Airflow.

3. Install Required System Libraries

sudo apt install -y python3-pip python3-venv libmysqlclient-dev libssl-dev libffi-dev

Lets understand why each dependency is required.

Why these dependencies?

1. `python3-pip`

Airflow is a Python framework, so you need pip to install it.

2. `python3-venv`

Airflow must be installed in a virtual environment to avoid conflicts with system Python.
This keeps Airflow isolated and safe.

3. `libmysqlclient-dev`

Many Airflow features depend on database drivers such as MySQL.
Even if you dont use MySQL, certain Airflow packages need this library to compile.

Without this, installation fails with:

mysql_config not found

4. `libssl-dev`

Provides OpenSSL.
Required for encrypted connections, authentication, and many Airflow Python dependencies.

Without this, cryptography packages break.

5. `libffi-dev`

Required for low-level Python C extensions used in cryptography and secure connections.

Missing this causes:

ffi.h not found
Failed building wheel for cryptography

4. Create Airflow Installation Folder

mkdir airflow
cd airflow

5. Create a Virtual Environment

python3 -m venv venv
source venv/bin/activate

Your prompt should now show:

(venv) ubuntu@ip-xx

6. Install Apache Airflow

Airflow requires a constraints file to avoid dependency conflicts.

AIRFLOW_VERSION=2.10.2
PYTHON_VERSION=3.10
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"

pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"

Why use constraint files?

Airflow has hundreds of dependencies.
Without constraints, pip may install incompatible versions installation fails.

7. Initialize Airflow Database

airflow db init

Creates metadata tables for DAGs, tasks, logs, variables, connections, etc.

8. Create Admin User

airflow users create \
  --username admin \
  --firstname dipak \
  --lastname mali \
  --role Admin \
  --email dipak@example.com \
  --password admin123

9. Start Airflow Webserver (Window 1)

airflow webserver -p 8080

This opens the UI on:

http://<your-ec2-ip>:8080

Keep this window open.

10. Start Airflow Scheduler (Window 2)

Open a new SSH window.

cd ~/airflow
source venv/bin/activate
airflow scheduler

The scheduler:

Detects DAG changes
Runs tasks in order
Monitors DAG runs

Both windows must stay open.

11. Install Pandas for ETL (Optional but required for example DAG)

pip install pandas

12. Create Your First ETL DAG

Create DAGs folder:

mkdir -p ~/airflow/dags

Create example DAG:

nano ~/airflow/dags/etl_example.py

Paste:

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
import pandas as pd

def extract():
    df = pd.read_csv('/home/ubuntu/airflow/data/employees.csv')
    df.to_csv('/home/ubuntu/airflow/data/extracted.csv', index=False)

def transform():
    df = pd.read_csv('/home/ubuntu/airflow/data/extracted.csv')
    df['salary'] = df['salary'] * 1.10
    df.to_csv('/home/ubuntu/airflow/data/transformed.csv', index=False)

def load():
    df = pd.read_csv('/home/ubuntu/airflow/data/transformed.csv')
    df.to_csv('/home/ubuntu/airflow/data/loaded.csv', index=False)

with DAG(
    dag_id='etl_example',
    start_date=datetime(2024, 1, 1),
    schedule_interval=None,
    catchup=False,
) as dag:

    extract_task = PythonOperator(task_id='extract', python_callable=extract)
    transform_task = PythonOperator(task_id='transform', python_callable=transform)
    load_task = PythonOperator(task_id='load', python_callable=load)

    extract_task >> transform_task >> load_task

Save CTRL+O, Enter, CTRL+X

13. Trigger the DAG

Go to the Airflow UI:

http://<your-ec2-ip>:8080

Find etl_example
Switch ON
Click Trigger DAG

You will see green checkmarks for:

extract
transform
load

14. Verify Output Files

cat ~/airflow/data/extracted.csv
cat ~/airflow/data/transformed.csv
cat ~/airflow/data/loaded.csv

transformed.csv and loaded.csv will show salaries increased by 10%.

Congratulations!

You have successfully:

Installed Airflow on AWS EC2
Understood every dependency
Learned Airflow components
Created your first real ETL pipeline

Next Steps (Optional Enhancements)

Run Airflow as a background systemd service
Use PostgreSQL instead of SQLite
Schedule daily ETL jobs
Fetch data from APIs
Upload ETL output to S3
Build a complete data pipeline

Installing Apache Airflow on AWS EC2 (Ubuntu)

1. Launch an Ubuntu EC2 Instance

2. Update System Packages

Why?

3. Install Required System Libraries

Why these dependencies?

1. `python3-pip`

2. `python3-venv`

3. `libmysqlclient-dev`

4. `libssl-dev`

5. `libffi-dev`

4. Create Airflow Installation Folder

5. Create a Virtual Environment

6. Install Apache Airflow

Why use constraint files?

7. Initialize Airflow Database

8. Create Admin User

9. Start Airflow Webserver (Window 1)

10. Start Airflow Scheduler (Window 2)

11. Install Pandas for ETL (Optional but required for example DAG)

12. Create Your First ETL DAG

13. Trigger the DAG

14. Verify Output Files

Congratulations!

Next Steps (Optional Enhancements)

Comments

More from this blog

AWS Glue Explained for Beginners

Powerful Python Features

Hello FastAPI

Docker: A Beginner’s Guide

Command Palette

1. Launch an Ubuntu EC2 Instance

2. Update System Packages

Why?

3. Install Required System Libraries

Why these dependencies?

1. python3-pip

2. python3-venv

3. libmysqlclient-dev

4. libssl-dev

5. libffi-dev

4. Create Airflow Installation Folder

5. Create a Virtual Environment

6. Install Apache Airflow

Why use constraint files?

7. Initialize Airflow Database

8. Create Admin User

9. Start Airflow Webserver (Window 1)

10. Start Airflow Scheduler (Window 2)

11. Install Pandas for ETL (Optional but required for example DAG)

12. Create Your First ETL DAG

13. Trigger the DAG

14. Verify Output Files

Congratulations!

Next Steps (Optional Enhancements)

Comments

More from this blog

1. `python3-pip`

2. `python3-venv`

3. `libmysqlclient-dev`

4. `libssl-dev`

5. `libffi-dev`