Orchestrating and Observing Data Pipelines with Airflow, PostgreSQL, and Polar



This content originally appeared on DEV Community and was authored by Chandrashekhar Kachawa

Data pipelines are the backbone of any modern data platform — but building them is only half the battle.

Keeping them efficient, observable, and trustworthy is where real engineering comes in.

In this post, we’ll build a complete, observable data pipeline using:

  • 🌀 Apache Airflow — for orchestration
  • 🐘 PostgreSQL — as our database
  • 🧊 Polar — for continuous profiling and observability
  • 🐳 Docker — to tie it all together

By the end, you’ll have a running system that not only moves data but also monitors itself in real time.

🧩 Prerequisites

Make sure you have the following installed before starting:

🗂 Project Structure

Let’s start with a clean, scalable structure for our Airflow project:

.
├── dags/
│   └── simple_etl_dag.py
├── docker-compose.yml
└── .env
  • dags/ → Your Airflow DAGs live here.
  • docker-compose.yml → Defines and connects your services.
  • .env → Keeps environment variables separate from code.

⚙ Orchestrating with Docker Compose

Let’s define our infrastructure.

We’ll spin up PostgreSQL, Airflow, and Polar in one command.

Step 1: Environment file

Create a .env file:

AIRFLOW_UID=50000

This ensures the Airflow user runs with the right permissions.

Step 2: Docker Compose setup

Here’s a minimal working setup (simplified for this guide):

version: '3'
services:
  postgres:
    image: postgres:13
    environment:
      - POSTGRES_USER=airflow
      - POSTGRES_PASSWORD=airflow
      - POSTGRES_DB=airflow
    ports:
      - "5432:5432"
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "airflow"]
      interval: 5s
      retries: 5

  airflow-webserver:
    image: apache/airflow:2.8.1
    depends_on:
      - postgres
    environment:
      - AIRFLOW__CORE__EXECUTOR=LocalExecutor
      - AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@postgres/airflow
      - AIRFLOW__CORE__LOAD_EXAMPLES=false
    volumes:
      - ./dags:/opt/airflow/dags
    ports:
      - "8080:8080"
    command: webserver
    healthcheck:
      test: ["CMD-SHELL", "curl --fail http://localhost:8080/health"]
      interval: 30s
      timeout: 30s
      retries: 3

  polar-agent:
    image: polar-agent:latest
    command:
      - "agent"
      - "--config-file=/etc/polar/agent.yaml"
    volumes:
      - ./polar-agent-config.yaml:/etc/polar/agent.yaml
    depends_on:
      - airflow-webserver

💡 Polar setup here is conceptual — always refer to the official Polar docs for the latest integration method (usually via a sidecar or host-level agent).

🧠 Creating a Simple Airflow DAG

Time to build our first ETL pipeline.

This DAG will:

  1. Create a customers table in Postgres.
  2. Insert a sample record.

Create dags/simple_etl_dag.py:

from airflow.decorators import dag, task
from airflow.providers.postgres.hooks.postgres import PostgresHook
from pendulum import datetime

@dag(
    dag_id="simple_postgres_etl",
    start_date=datetime(2025, 1, 1),
    schedule=None,
    catchup=False,
    tags=["etl", "postgres"],
)
def simple_postgres_etl():
    @task
    def create_customers_table():
        pg_hook = PostgresHook(postgres_conn_id="postgres_default")
        pg_hook.run("""
            CREATE TABLE IF NOT EXISTS customers (
                customer_id SERIAL PRIMARY KEY,
                name VARCHAR NOT NULL,
                signup_date DATE
            );
        """)

    @task
    def insert_new_customer():
        pg_hook = PostgresHook(postgres_conn_id="postgres_default")
        pg_hook.run("""
            INSERT INTO customers (name, signup_date)
            VALUES ('John Doe', '2025-09-26');
        """)

    create_customers_table() >> insert_new_customer()

simple_postgres_etl()

Now, run your stack:

docker-compose up

Head to http://localhost:8080 — you’ll find your DAG there, ready to trigger.

🔍 Observability with Polar

Once the pipeline runs, Polar starts profiling automatically.

Here’s what you can do in the Polar UI:

  1. Filter by Service – Focus on airflow-webserver or scheduler.
  2. Analyze CPU & Memory – Spot heavy tasks and resource spikes.
  3. Identify Bottlenecks – Catch inefficiencies before they cause downtime.

🎯 This is where orchestration meets observability — you’re not just scheduling jobs, you’re understanding their runtime behavior.

🏁 Wrapping Up

You’ve built a small but powerful foundation for observable data engineering:

✅ Airflow orchestrates

✅ Postgres stores

✅ Polar profiles

✅ Docker glues it all together

This setup takes you from reactive debugging to proactive optimization.

When your data pipelines tell you what’s happening under the hood — you’re no longer guessing; you’re engineering.

💬 If you enjoyed this, consider following for more hands-on data engineering guides like this one. Got questions? Drop them below or ping me on LinkedIn.


This content originally appeared on DEV Community and was authored by Chandrashekhar Kachawa