This content originally appeared on DEV Community and was authored by Chandrashekhar Kachawa
Data pipelines are the backbone of any modern data platform — but building them is only half the battle.
Keeping them efficient, observable, and trustworthy is where real engineering comes in.
In this post, we’ll build a complete, observable data pipeline using:
Apache Airflow — for orchestration
PostgreSQL — as our database
Polar — for continuous profiling and observability
Docker — to tie it all together
By the end, you’ll have a running system that not only moves data but also monitors itself in real time.
Prerequisites
Make sure you have the following installed before starting:
Project Structure
Let’s start with a clean, scalable structure for our Airflow project:
.
├── dags/
│ └── simple_etl_dag.py
├── docker-compose.yml
└── .env
-
dags/
→ Your Airflow DAGs live here. -
docker-compose.yml
→ Defines and connects your services. -
.env
→ Keeps environment variables separate from code.
Orchestrating with Docker Compose
Let’s define our infrastructure.
We’ll spin up PostgreSQL, Airflow, and Polar in one command.
Step 1: Environment file
Create a .env
file:
AIRFLOW_UID=50000
This ensures the Airflow user runs with the right permissions.
Step 2: Docker Compose setup
Here’s a minimal working setup (simplified for this guide):
version: '3'
services:
postgres:
image: postgres:13
environment:
- POSTGRES_USER=airflow
- POSTGRES_PASSWORD=airflow
- POSTGRES_DB=airflow
ports:
- "5432:5432"
healthcheck:
test: ["CMD", "pg_isready", "-U", "airflow"]
interval: 5s
retries: 5
airflow-webserver:
image: apache/airflow:2.8.1
depends_on:
- postgres
environment:
- AIRFLOW__CORE__EXECUTOR=LocalExecutor
- AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@postgres/airflow
- AIRFLOW__CORE__LOAD_EXAMPLES=false
volumes:
- ./dags:/opt/airflow/dags
ports:
- "8080:8080"
command: webserver
healthcheck:
test: ["CMD-SHELL", "curl --fail http://localhost:8080/health"]
interval: 30s
timeout: 30s
retries: 3
polar-agent:
image: polar-agent:latest
command:
- "agent"
- "--config-file=/etc/polar/agent.yaml"
volumes:
- ./polar-agent-config.yaml:/etc/polar/agent.yaml
depends_on:
- airflow-webserver
Polar setup here is conceptual — always refer to the official Polar docs for the latest integration method (usually via a sidecar or host-level agent).
Creating a Simple Airflow DAG
Time to build our first ETL pipeline.
This DAG will:
- Create a
customers
table in Postgres. - Insert a sample record.
Create dags/simple_etl_dag.py
:
from airflow.decorators import dag, task
from airflow.providers.postgres.hooks.postgres import PostgresHook
from pendulum import datetime
@dag(
dag_id="simple_postgres_etl",
start_date=datetime(2025, 1, 1),
schedule=None,
catchup=False,
tags=["etl", "postgres"],
)
def simple_postgres_etl():
@task
def create_customers_table():
pg_hook = PostgresHook(postgres_conn_id="postgres_default")
pg_hook.run("""
CREATE TABLE IF NOT EXISTS customers (
customer_id SERIAL PRIMARY KEY,
name VARCHAR NOT NULL,
signup_date DATE
);
""")
@task
def insert_new_customer():
pg_hook = PostgresHook(postgres_conn_id="postgres_default")
pg_hook.run("""
INSERT INTO customers (name, signup_date)
VALUES ('John Doe', '2025-09-26');
""")
create_customers_table() >> insert_new_customer()
simple_postgres_etl()
Now, run your stack:
docker-compose up
Head to http://localhost:8080 — you’ll find your DAG there, ready to trigger.
Observability with Polar
Once the pipeline runs, Polar starts profiling automatically.
Here’s what you can do in the Polar UI:
-
Filter by Service – Focus on
airflow-webserver
orscheduler
. - Analyze CPU & Memory – Spot heavy tasks and resource spikes.
- Identify Bottlenecks – Catch inefficiencies before they cause downtime.
This is where orchestration meets observability — you’re not just scheduling jobs, you’re understanding their runtime behavior.
Wrapping Up
You’ve built a small but powerful foundation for observable data engineering:
Airflow orchestrates
Postgres stores
Polar profiles
Docker glues it all together
This setup takes you from reactive debugging to proactive optimization.
When your data pipelines tell you what’s happening under the hood — you’re no longer guessing; you’re engineering.
If you enjoyed this, consider following for more hands-on data engineering guides like this one. Got questions? Drop them below or ping me on LinkedIn.
This content originally appeared on DEV Community and was authored by Chandrashekhar Kachawa