Installing Apache Airflow on AWS EC2 (Ubuntu)
A Complete Beginner-Friendly Guide to Apache Airflow

Apache Airflow is one of the most powerful workflow automation tools used in data engineering and ETL pipelines. But for beginners, setting it up on an AWS EC2 instance for the first time can feel confusing.
This guide explains EVERY step, from server setup to installing dependencies, understanding why each dependency is required, and finally creating a simple ETL DAG.
This is written so you can repeat this installation easily every time.
1. Launch an Ubuntu EC2 Instance
Choose:
Ubuntu 22.04
t2.micro or t3.micro (Free-tier)
Add a security group rule:
Type: Custom TCP
Port: 8080
Source: 0.0.0.0/0 (or your IP)
SSH into the instance:
ssh -i your-key.pem ubuntu@<public-ip>
2. Update System Packages
sudo apt update
sudo apt upgrade -y
Why?
To ensure your machine has the latest secure and stable packages before installing Airflow.
3. Install Required System Libraries
sudo apt install -y python3-pip python3-venv libmysqlclient-dev libssl-dev libffi-dev
Lets understand why each dependency is required.
Why these dependencies?
1. python3-pip
Airflow is a Python framework, so you need pip to install it.
2. python3-venv
Airflow must be installed in a virtual environment to avoid conflicts with system Python.
This keeps Airflow isolated and safe.
3. libmysqlclient-dev
Many Airflow features depend on database drivers such as MySQL.
Even if you dont use MySQL, certain Airflow packages need this library to compile.
Without this, installation fails with:
mysql_config not found
4. libssl-dev
Provides OpenSSL.
Required for encrypted connections, authentication, and many Airflow Python dependencies.
Without this, cryptography packages break.
5. libffi-dev
Required for low-level Python C extensions used in cryptography and secure connections.
Missing this causes:
ffi.h not found
Failed building wheel for cryptography
4. Create Airflow Installation Folder
mkdir airflow
cd airflow
5. Create a Virtual Environment
python3 -m venv venv
source venv/bin/activate
Your prompt should now show:
(venv) ubuntu@ip-xx
6. Install Apache Airflow
Airflow requires a constraints file to avoid dependency conflicts.
AIRFLOW_VERSION=2.10.2
PYTHON_VERSION=3.10
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"
Why use constraint files?
Airflow has hundreds of dependencies.
Without constraints, pip may install incompatible versions installation fails.
7. Initialize Airflow Database
airflow db init
Creates metadata tables for DAGs, tasks, logs, variables, connections, etc.
8. Create Admin User
airflow users create \
--username admin \
--firstname dipak \
--lastname mali \
--role Admin \
--email dipak@example.com \
--password admin123
9. Start Airflow Webserver (Window 1)
airflow webserver -p 8080
This opens the UI on:
http://<your-ec2-ip>:8080
Keep this window open.
10. Start Airflow Scheduler (Window 2)
Open a new SSH window.
cd ~/airflow
source venv/bin/activate
airflow scheduler
The scheduler:
Detects DAG changes
Runs tasks in order
Monitors DAG runs
Both windows must stay open.
11. Install Pandas for ETL (Optional but required for example DAG)
pip install pandas
12. Create Your First ETL DAG
Create DAGs folder:
mkdir -p ~/airflow/dags
Create example DAG:
nano ~/airflow/dags/etl_example.py
Paste:
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
import pandas as pd
def extract():
df = pd.read_csv('/home/ubuntu/airflow/data/employees.csv')
df.to_csv('/home/ubuntu/airflow/data/extracted.csv', index=False)
def transform():
df = pd.read_csv('/home/ubuntu/airflow/data/extracted.csv')
df['salary'] = df['salary'] * 1.10
df.to_csv('/home/ubuntu/airflow/data/transformed.csv', index=False)
def load():
df = pd.read_csv('/home/ubuntu/airflow/data/transformed.csv')
df.to_csv('/home/ubuntu/airflow/data/loaded.csv', index=False)
with DAG(
dag_id='etl_example',
start_date=datetime(2024, 1, 1),
schedule_interval=None,
catchup=False,
) as dag:
extract_task = PythonOperator(task_id='extract', python_callable=extract)
transform_task = PythonOperator(task_id='transform', python_callable=transform)
load_task = PythonOperator(task_id='load', python_callable=load)
extract_task >> transform_task >> load_task
Save CTRL+O, Enter, CTRL+X
13. Trigger the DAG
Go to the Airflow UI:
http://<your-ec2-ip>:8080
Find
etl_exampleSwitch ON
Click Trigger DAG
You will see green checkmarks for:
extract
transform
load
14. Verify Output Files
cat ~/airflow/data/extracted.csv
cat ~/airflow/data/transformed.csv
cat ~/airflow/data/loaded.csv
transformed.csv and loaded.csv will show salaries increased by 10%.
Congratulations!
You have successfully:
Installed Airflow on AWS EC2
Understood every dependency
Learned Airflow components
Created your first real ETL pipeline
Next Steps (Optional Enhancements)
Run Airflow as a background systemd service
Use PostgreSQL instead of SQLite
Schedule daily ETL jobs
Fetch data from APIs
Upload ETL output to S3
Build a complete data pipeline


