Building My First Production-Ready ELT Pipeline: A Student’s Journey with Docker, PostgreSQL, dbt, and Airflow



This content originally appeared on DEV Community and was authored by ashraf el houfi

How I built an end-to-end data pipeline from scratch using modern data engineering tools

Introduction

From Student to Data Engineer: My First Pipeline

As a student diving into the world of data engineering, I embarked on building my first complete ELT (Extract, Load, Transform) pipeline. This project taught me the fundamentals of modern data architecture and gave me hands-on experience with industry-standard tools.

What you’ll learn from this article:

  • How to design and implement an ELT pipeline from scratch
  • Docker containerization for data services
  • Data transformation with dbt (data build tool)
  • Workflow orchestration with Apache Airflow
  • Real-world best practices and lessons learned

Tech Stack:
🐳 Docker & Docker Compose
🐘 PostgreSQL (Source & Destination)
🔧 dbt (Data Build Tool)
✈ Apache Airflow
🐍 Python

Section 1: Architecture Overview

The Architecture

My pipeline follows the modern ELT pattern:

  1. Extract & Load: Python script extracts data from source PostgreSQL and loads into destination
  2. Transform: dbt handles data transformations, testing, and documentation
  3. Orchestrate: Airflow manages the entire workflow

Why ELT over ETL?

  • Scalability: Transform after loading leverages destination database power
  • Flexibility: Raw data available for ad-hoc analysis
  • Modern Approach: Aligns with cloud data warehouse patterns

Section 2: Implementation Deep Dive

Building the Pipeline Step by Step

Step 1: Containerized Database Setup

# docker-compose.yaml excerpt
services:
  source_postgres:
    image: postgres:15
    ports:
      - "5433:5432"
    environment:
      POSTGRES_DB: source_db
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: secret
    volumes:
      - ./source_db_init/init.sql:/docker-entrypoint-initdb.d/init.sql

Step 2: The ELT Script

# elt_script.py - The heart of data movement
import subprocess
import sys

def wait_for_postgres(host, max_retries=5, delay_seconds=5):
    # Connection logic here
    pass

def extract_and_load():
    # pg_dump for extraction
    dump_command = [
        'pg_dump',
        '-h', 'source_postgres',
        '-U', 'postgres',
        '-d', 'source_db',
        '--clean',
        '--if-exists'
    ]

    # psql for loading
    load_command = [
        'psql',
        '-h', 'destination_postgres',
        '-U', 'postgres',
        '-d', 'destination_db'
    ]

Section 3: Data Transformation with dbt

Smart Transformations with dbt

Custom Macro for Rating Classification

-- macros/classify_ratings.sql
{% macro classify_ratings(rating_column) %}
  CASE
    WHEN {{ rating_column }} >= 4.5 THEN 'Excellent'
    WHEN {{ rating_column }} >= 4.0 THEN 'Good'
    WHEN {{ rating_column }} >= 3.0 THEN 'Average'
    WHEN {{ rating_column }} >= 2.0 THEN 'Poor'
    ELSE 'Very Poor'
  END
{% endmacro %}

Transformation Models

-- models/film_classification.sql
SELECT
    film_id,
    title,
    user_rating,
    {{ classify_ratings('user_rating') }} as rating_category,
    release_date
FROM {{ ref('films') }}

Why dbt?

  • Version Control: SQL transformations in Git

  • Testing: Built-in data quality tests

  • Documentation: Auto-generated lineage

  • Modularity: Reusable macros and models

Section 4: Orchestration with Airflow

Orchestrating with Apache Airflow

The DAG Structure

# airflow/dags/elt_pipeline.py
from airflow import DAG
from airflow.operators.bash import BashOperator

dag = DAG(
    'elt_pipeline',
    default_args=default_args,
    description='Extract, Load, and Transform pipeline using dbt',
    schedule=timedelta(hours=1),
    catchup=False,
    tags=['elt', 'postgres', 'dbt'],
)

# Task dependencies
elt_task >> dbt_task >> quality_check

Pipeline Visualization

Section 5: Results & Monitoring

Results & What I Learned

Pipeline Performance

  • Data Volume: 20 films, 20 actors, 39 categories, 14 users
  • Execution Time: ~30 seconds end-to-end
  • Success Rate: 100% after debugging
  • Tests Passed: 20/20 dbt data quality tests

Key Metrics Dashboard

Sample Output

-- Transformed data example
SELECT film_id, title, rating_category, actors 
FROM film_rating 
LIMIT 3;

| film_id | title      | rating_category | actors           |
|---------|------------|-----------------|------------------|
| 1       | Inception  | Excellent       | Leonardo DiCaprio|
| 2       | Shawshank  | Excellent       | Tim Robbins      |
| 3       | Godfather  | Excellent       | Marlon Brando    |

Section 6: Lessons Learned

Student Insights & Lessons Learned

What Went Well

  • Containerization: Docker made development environment consistent
  • Version Control: Everything in Git from day one
  • Incremental Development: Built piece by piece, tested at each step
  • Documentation: Commented code saved me hours of debugging

Challenges & Solutions

  1. PostgreSQL Version Mismatch

    • Problem: pg_dump version didn’t match server
    • Solution: Standardized on PostgreSQL 15 images
  2. Airflow 3.0.3 Breaking Changes

    • Problem: schedule_interval parameter deprecated
    • Solution: Updated to schedule parameter
  3. dbt Schema Validation

    • Problem: Column names in tests didn’t match actual model
    • Solution: Regular testing and validation

If I Started Over

  • Use Infrastructure as Code (Terraform) for cloud deployment
  • Implement CI/CD pipeline with GitHub Actions
  • Add data lineage tracking
  • Include more comprehensive logging

Section 7: Next Steps & Future Improvements

Short-term Improvements

  • Add data quality alerts
  • Implement incremental loading
  • Create Slack notifications for failures
  • Add more sophisticated dbt tests

Long-term Goals

  • Deploy to AWS/GCP with managed services
  • Implement streaming with Kafka
  • Add ML pipeline integration
  • Scale to handle GB+ datasets

For Fellow Students

If you’re starting your data engineering journey:

  1. Start Small: Begin with simple transformations
  2. Practice Regularly: Build something every week
  3. Join Communities: dbt Slack, Airflow forums
  4. Document Everything: Your future self will thank you
  5. Share Your Work: Teaching others reinforces learning

Conclusion

Building this ELT pipeline taught me that data engineering is equal parts technical skill and problem-solving mindset. Every error message was a learning opportunity, and every successful run was a small victory.

Key Takeaways:

  • Modern data tools are powerful but require careful integration
  • Container orchestration simplifies complex deployments
  • Data quality testing is non-negotiable
  • Good documentation saves more time than you think

Resources That Helped Me

Want to try this yourself?
Check out [https://github.com/el-houfi-achraf/elt-pipeline] with full source code and setup instructions.


This content originally appeared on DEV Community and was authored by ashraf el houfi