Building My First Production-Ready ELT Pipeline: A Student’s Journey with Docker, PostgreSQL, dbt, and Airflow

August 7, 2025

This content originally appeared on DEV Community and was authored by ashraf el houfi

How I built an end-to-end data pipeline from scratch using modern data engineering tools

Introduction

From Student to Data Engineer: My First Pipeline

As a student diving into the world of data engineering, I embarked on building my first complete ELT (Extract, Load, Transform) pipeline. This project taught me the fundamentals of modern data architecture and gave me hands-on experience with industry-standard tools.

What you’ll learn from this article:

How to design and implement an ELT pipeline from scratch
Docker containerization for data services
Data transformation with dbt (data build tool)
Workflow orchestration with Apache Airflow
Real-world best practices and lessons learned

Tech Stack:
Docker & Docker Compose
PostgreSQL (Source & Destination)
dbt (Data Build Tool)
Apache Airflow
Python

Section 1: Architecture Overview

The Architecture

My pipeline follows the modern ELT pattern:

Extract & Load: Python script extracts data from source PostgreSQL and loads into destination
Transform: dbt handles data transformations, testing, and documentation
Orchestrate: Airflow manages the entire workflow

Why ELT over ETL?

Scalability: Transform after loading leverages destination database power
Flexibility: Raw data available for ad-hoc analysis
Modern Approach: Aligns with cloud data warehouse patterns

Section 2: Implementation Deep Dive

Building the Pipeline Step by Step

Step 1: Containerized Database Setup

# docker-compose.yaml excerpt
services:
  source_postgres:
    image: postgres:15
    ports:
      - "5433:5432"
    environment:
      POSTGRES_DB: source_db
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: secret
    volumes:
      - ./source_db_init/init.sql:/docker-entrypoint-initdb.d/init.sql

Step 2: The ELT Script

# elt_script.py - The heart of data movement
import subprocess
import sys

def wait_for_postgres(host, max_retries=5, delay_seconds=5):
    # Connection logic here
    pass

def extract_and_load():
    # pg_dump for extraction
    dump_command = [
        'pg_dump',
        '-h', 'source_postgres',
        '-U', 'postgres',
        '-d', 'source_db',
        '--clean',
        '--if-exists'
    ]

    # psql for loading
    load_command = [
        'psql',
        '-h', 'destination_postgres',
        '-U', 'postgres',
        '-d', 'destination_db'
    ]

Section 3: Data Transformation with dbt

Smart Transformations with dbt

Custom Macro for Rating Classification

-- macros/classify_ratings.sql
{% macro classify_ratings(rating_column) %}
  CASE
    WHEN {{ rating_column }} >= 4.5 THEN 'Excellent'
    WHEN {{ rating_column }} >= 4.0 THEN 'Good'
    WHEN {{ rating_column }} >= 3.0 THEN 'Average'
    WHEN {{ rating_column }} >= 2.0 THEN 'Poor'
    ELSE 'Very Poor'
  END
{% endmacro %}

Transformation Models

-- models/film_classification.sql
SELECT
    film_id,
    title,
    user_rating,
    {{ classify_ratings('user_rating') }} as rating_category,
    release_date
FROM {{ ref('films') }}

Why dbt?

Version Control: SQL transformations in Git
Testing: Built-in data quality tests
Documentation: Auto-generated lineage
Modularity: Reusable macros and models

Section 4: Orchestration with Airflow

Orchestrating with Apache Airflow

The DAG Structure

# airflow/dags/elt_pipeline.py
from airflow import DAG
from airflow.operators.bash import BashOperator

dag = DAG(
    'elt_pipeline',
    default_args=default_args,
    description='Extract, Load, and Transform pipeline using dbt',
    schedule=timedelta(hours=1),
    catchup=False,
    tags=['elt', 'postgres', 'dbt'],
)

# Task dependencies
elt_task >> dbt_task >> quality_check

Pipeline Visualization

Section 5: Results & Monitoring

Results & What I Learned

Pipeline Performance

Data Volume: 20 films, 20 actors, 39 categories, 14 users
Execution Time: ~30 seconds end-to-end
Success Rate: 100% after debugging
Tests Passed: 20/20 dbt data quality tests

Key Metrics Dashboard

Sample Output

-- Transformed data example
SELECT film_id, title, rating_category, actors 
FROM film_rating 
LIMIT 3;

| film_id | title      | rating_category | actors           |
|---------|------------|-----------------|------------------|
| 1       | Inception  | Excellent       | Leonardo DiCaprio|
| 2       | Shawshank  | Excellent       | Tim Robbins      |
| 3       | Godfather  | Excellent       | Marlon Brando    |

Section 6: Lessons Learned

Student Insights & Lessons Learned

What Went Well

Containerization: Docker made development environment consistent
Version Control: Everything in Git from day one
Incremental Development: Built piece by piece, tested at each step
Documentation: Commented code saved me hours of debugging

Challenges & Solutions

PostgreSQL Version Mismatch
- Problem: pg_dump version didn’t match server
- Solution: Standardized on PostgreSQL 15 images
Airflow 3.0.3 Breaking Changes
- Problem: schedule_interval parameter deprecated
- Solution: Updated to schedule parameter
dbt Schema Validation
- Problem: Column names in tests didn’t match actual model
- Solution: Regular testing and validation

If I Started Over

Use Infrastructure as Code (Terraform) for cloud deployment
Implement CI/CD pipeline with GitHub Actions
Add data lineage tracking
Include more comprehensive logging

Section 7: Next Steps & Future Improvements

Short-term Improvements

Add data quality alerts
Implement incremental loading
Create Slack notifications for failures
Add more sophisticated dbt tests

Long-term Goals

Deploy to AWS/GCP with managed services
Implement streaming with Kafka
Add ML pipeline integration
Scale to handle GB+ datasets

For Fellow Students

If you’re starting your data engineering journey:

Start Small: Begin with simple transformations
Practice Regularly: Build something every week
Join Communities: dbt Slack, Airflow forums
Document Everything: Your future self will thank you
Share Your Work: Teaching others reinforces learning

Conclusion

Building this ELT pipeline taught me that data engineering is equal parts technical skill and problem-solving mindset. Every error message was a learning opportunity, and every successful run was a small victory.

Key Takeaways:

Modern data tools are powerful but require careful integration
Container orchestration simplifies complex deployments
Data quality testing is non-negotiable
Good documentation saves more time than you think

Resources That Helped Me

dbt Learn
Airflow Documentation
Docker Compose Guide
Data Engineering communities on Discord/Slack

Want to try this yourself?
Check out [https://github.com/el-houfi-achraf/elt-pipeline] with full source code and setup instructions.

This content originally appeared on DEV Community and was authored by ashraf el houfi