Skip to main content

Command Palette

Search for a command to run...

Installing Apache Airflow on AWS EC2 (Ubuntu)

A Complete Beginner-Friendly Guide to Apache Airflow

Published
4 min read
Installing Apache Airflow on AWS EC2 (Ubuntu)

Apache Airflow is one of the most powerful workflow automation tools used in data engineering and ETL pipelines. But for beginners, setting it up on an AWS EC2 instance for the first time can feel confusing.

This guide explains EVERY step, from server setup to installing dependencies, understanding why each dependency is required, and finally creating a simple ETL DAG.

This is written so you can repeat this installation easily every time.


1. Launch an Ubuntu EC2 Instance

Choose:

  • Ubuntu 22.04

  • t2.micro or t3.micro (Free-tier)

  • Add a security group rule:

    • Type: Custom TCP

    • Port: 8080

    • Source: 0.0.0.0/0 (or your IP)

SSH into the instance:

ssh -i your-key.pem ubuntu@<public-ip>

2. Update System Packages

sudo apt update
sudo apt upgrade -y

Why?

To ensure your machine has the latest secure and stable packages before installing Airflow.


3. Install Required System Libraries

sudo apt install -y python3-pip python3-venv libmysqlclient-dev libssl-dev libffi-dev

Lets understand why each dependency is required.


Why these dependencies?

1. python3-pip

Airflow is a Python framework, so you need pip to install it.


2. python3-venv

Airflow must be installed in a virtual environment to avoid conflicts with system Python.
This keeps Airflow isolated and safe.


3. libmysqlclient-dev

Many Airflow features depend on database drivers such as MySQL.
Even if you dont use MySQL, certain Airflow packages need this library to compile.

Without this, installation fails with:

mysql_config not found

4. libssl-dev

Provides OpenSSL.
Required for encrypted connections, authentication, and many Airflow Python dependencies.

Without this, cryptography packages break.


5. libffi-dev

Required for low-level Python C extensions used in cryptography and secure connections.

Missing this causes:

ffi.h not found
Failed building wheel for cryptography

4. Create Airflow Installation Folder

mkdir airflow
cd airflow

5. Create a Virtual Environment

python3 -m venv venv
source venv/bin/activate

Your prompt should now show:

(venv) ubuntu@ip-xx

6. Install Apache Airflow

Airflow requires a constraints file to avoid dependency conflicts.

AIRFLOW_VERSION=2.10.2
PYTHON_VERSION=3.10
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"

pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"

Why use constraint files?

Airflow has hundreds of dependencies.
Without constraints, pip may install incompatible versions installation fails.


7. Initialize Airflow Database

airflow db init

Creates metadata tables for DAGs, tasks, logs, variables, connections, etc.


8. Create Admin User

airflow users create \
  --username admin \
  --firstname dipak \
  --lastname mali \
  --role Admin \
  --email dipak@example.com \
  --password admin123

9. Start Airflow Webserver (Window 1)

airflow webserver -p 8080

This opens the UI on:

http://<your-ec2-ip>:8080

Keep this window open.


10. Start Airflow Scheduler (Window 2)

Open a new SSH window.

cd ~/airflow
source venv/bin/activate
airflow scheduler

The scheduler:

  • Detects DAG changes

  • Runs tasks in order

  • Monitors DAG runs

Both windows must stay open.


11. Install Pandas for ETL (Optional but required for example DAG)

pip install pandas

12. Create Your First ETL DAG

Create DAGs folder:

mkdir -p ~/airflow/dags

Create example DAG:

nano ~/airflow/dags/etl_example.py

Paste:

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
import pandas as pd

def extract():
    df = pd.read_csv('/home/ubuntu/airflow/data/employees.csv')
    df.to_csv('/home/ubuntu/airflow/data/extracted.csv', index=False)

def transform():
    df = pd.read_csv('/home/ubuntu/airflow/data/extracted.csv')
    df['salary'] = df['salary'] * 1.10
    df.to_csv('/home/ubuntu/airflow/data/transformed.csv', index=False)

def load():
    df = pd.read_csv('/home/ubuntu/airflow/data/transformed.csv')
    df.to_csv('/home/ubuntu/airflow/data/loaded.csv', index=False)

with DAG(
    dag_id='etl_example',
    start_date=datetime(2024, 1, 1),
    schedule_interval=None,
    catchup=False,
) as dag:

    extract_task = PythonOperator(task_id='extract', python_callable=extract)
    transform_task = PythonOperator(task_id='transform', python_callable=transform)
    load_task = PythonOperator(task_id='load', python_callable=load)

    extract_task >> transform_task >> load_task

Save CTRL+O, Enter, CTRL+X


13. Trigger the DAG

Go to the Airflow UI:

http://<your-ec2-ip>:8080
  • Find etl_example

  • Switch ON

  • Click Trigger DAG

You will see green checkmarks for:

  • extract

  • transform

  • load


14. Verify Output Files

cat ~/airflow/data/extracted.csv
cat ~/airflow/data/transformed.csv
cat ~/airflow/data/loaded.csv

transformed.csv and loaded.csv will show salaries increased by 10%.


Congratulations!

You have successfully:

  • Installed Airflow on AWS EC2

  • Understood every dependency

  • Learned Airflow components

  • Created your first real ETL pipeline


Next Steps (Optional Enhancements)

  • Run Airflow as a background systemd service

  • Use PostgreSQL instead of SQLite

  • Schedule daily ETL jobs

  • Fetch data from APIs

  • Upload ETL output to S3

  • Build a complete data pipeline