Zero-Downtime Database Migration: The Definitive Guide



This content originally appeared on DEV Community and was authored by Arijit Ghosh

Migrating a production database is like replacing the engine of a plane mid-flight — users keep using your app while you swap the core infrastructure beneath it. If you mess up, downtime, data corruption, or revenue loss are almost guaranteed.

The gold standard: zero downtime migration (ZDM).

This post covers all major migration scenarios:

  • SQL → NoSQL
  • NoSQL → SQL
  • SQL → SQL
  • On-Premises → Cloud
  • Homogeneous Upgrades
  • Database Consolidation

We’ll break down:

  1. Why you’d migrate
  2. Phased strategy
  3. Tools, patterns, and failure handling
  4. Code + config examples
  5. Operational playbook (monitoring, rollback, validation)

🔑 Phases of a Zero-Downtime Migration

Every migration (regardless of DB type) follows the same 5-phase flow:

  1. Preparation & Planning
  • Schema mapping (source vs target).
  • Choosing tools (CDC, replication, ETL).
  • Rollback plan.
  • SLA definition: how much latency is acceptable?
  1. Bulk Load (Historical Data)
  • Snapshot existing data into the target DB.
  • Use ETL tools (pgloader, Spark, Dataflow).
  • Validate counts & checksums.
  1. Change Data Capture (CDC)
  • Stream live changes (inserts, updates, deletes) to the new DB.
  • Keep old and new in sync.
  • Tools: Debezium, AWS DMS, GoldenGate, Kafka Connect.
  1. Dual Writes (Optional but safer)
  • Application writes to both old & new DBs.
  • Protects against CDC lag/missed events.
  • Usually controlled via feature flags.
   def insert_user(user):
       # Old DB
       pg_conn.execute("INSERT INTO users (id, name) VALUES (%s, %s)", (user.id, user.name))

       # New DB
       cassandra_conn.execute("INSERT INTO users (id, name) VALUES (?, ?)", (user.id, user.name))
  1. Cutover & Verification
  • Switch read traffic to new DB.
  • Run validation jobs (row counts, checksums, business queries).
  • Keep old DB in sync until rollback window closes.

SQL → NoSQL (PostgreSQL → Cassandra)

Why

  • 🚀 Scalability → Handle billions of records across commodity servers.
  • ⚡ High write throughput → IoT sensors, clickstreams, logging.
  • 🧩 Flexible schema → User profiles, metadata that evolves.

Migration Architecture

  1. Bulk Load
  • Export from PostgreSQL → load into Cassandra.
  • Example: Spark job reading from JDBC + writing to Cassandra.
   from pyspark.sql import SparkSession

   spark = SparkSession.builder.appName("pg2cassandra").getOrCreate()

   pg_df = spark.read \
       .format("jdbc") \
       .option("url", "jdbc:postgresql://pg-host/mydb") \
       .option("dbtable", "users") \
       .option("user", "postgres") \
       .option("password", "secret") \
       .load()

   pg_df.write \
       .format("org.apache.spark.sql.cassandra") \
       .options(table="users", keyspace="app") \
       .save()
  1. CDC with Debezium + Kafka → Cassandra Sink
  • PostgreSQL WAL logs → Debezium → Kafka topic → Cassandra Sink Connector.
  • Ensures all updates flow in real-time.
  1. Dual Writes
  • New API writes to both databases.
  • Read traffic still on PostgreSQL.
  1. Read Cutover
  • Gradually shift reads to Cassandra behind a feature flag.
  • Use A/B rollout → 5% traffic, 20%, 50%, 100%.
  1. Validation
  • Count check:

     SELECT COUNT(*) FROM users; -- Postgres
    
 ```cql
 SELECT COUNT(*) FROM users; -- Cassandra
 ```
  • Sampled data comparison with hash functions.

Failure Handling

  • If Cassandra lags (CDC not caught up), keep PostgreSQL as the source of truth until caught up.
  • If data corruption detected → rollback reads to PostgreSQL instantly (via feature flag).

NoSQL → SQL (MongoDB → MySQL)

Why

  • 💰 Transactions (ACID) → Payments, banking.
  • 🔍 Complex queries → JOINs, reporting, analytics.
  • 🛡 Data integrity → Strong schemas, constraints.

Migration Architecture

  1. Schema Design
  • MongoDB nested docs → multiple relational tables.
  • Example:

     {
       "user_id": 1,
       "name": "Alice",
       "orders": [
         { "id": 101, "amount": 50 },
         { "id": 102, "amount": 75 }
       ]
     }
    


     CREATE TABLE users (user_id INT PRIMARY KEY, name VARCHAR(255));
     CREATE TABLE orders (id INT PRIMARY KEY, user_id INT, amount DECIMAL(10,2));
    
  1. Bulk Migration
  • Use mongoexport + mysqlimport, or Spark ETL job.
   mongoexport --db app --collection users --out users.json
   import json, pymysql

   conn = pymysql.connect(host="mysql", user="root", password="secret", db="app")

   with open("users.json") as f:
       for line in f:
           doc = json.loads(line)
           conn.cursor().execute("INSERT INTO users (user_id, name) VALUES (%s, %s)", (doc["user_id"], doc["name"]))
  1. Dual Writes
  • Application writes to MongoDB and MySQL.
  • Ensure transaction ordering is preserved.
  1. Read Cutover & Validation
  • Gradually switch analytics/reporting → MySQL.
  • Eventually shift entire app.

Failure Handling

  • Mongo → SQL mismatches (nulls, missing keys).
  • Use strict ETL validation: enforce schema during migration.

SQL → SQL (Oracle → PostgreSQL)

Why

  • 💸 Cost reduction → Oracle licenses are \$\$\$.
  • ⚡ Modern features → JSONB, extensions in PostgreSQL.
  • 🔓 Vendor freedom → Avoid lock-in.

Migration Architecture

  1. Schema Conversion
  • Use AWS Schema Conversion Tool (SCT) or ora2pg.
   ora2pg -c config/ora2pg.conf -t TABLE -o schema.sql
  1. Bulk Migration
  • Use pgloader:

     LOAD DATABASE
       FROM oracle://user:pass@oracle-db/mydb
       INTO postgresql://user:pass@pg-db/mydb
     WITH include drop, create tables, create indexes, reset sequences;
    
  1. CDC / Replication
  • Oracle GoldenGate → Kafka → PostgreSQL.
  1. Cutover
  • DNS/application config switch after lag = 0.

Failure Handling

  • Schema incompatibility (e.g., Oracle NUMBER vs PostgreSQL NUMERIC).
  • Performance regressions (query plans differ).
  • Fix with query rewrites, indexes, or caching.

On-Prem → Cloud (Postgres → RDS)

Why

  • ☁ Elasticity → scale up/down instantly.
  • 🔧 Managed services → backups, patching handled by AWS/GCP.
  • 💰 Opex over Capex → Pay-as-you-go vs hardware.

Migration Architecture

  1. Snapshot Load
  • Dump on-prem DB:

     pg_dump -Fc mydb > mydb.dump
     aws s3 cp mydb.dump s3://migration-bucket/
    
  • Restore into RDS:

     pg_restore -h rds-host -U postgres -d mydb mydb.dump
    
  1. Logical Replication (CDC)
  • Set up wal2json plugin → stream into RDS.
  • Or use AWS DMS for ongoing replication.
  1. Cutover
  • Point application DB connection string to RDS.
  • Use DNS switch for instant failover.

Homogeneous Upgrade (MySQL 5.7 → 8.0)

Why

  • 🔒 Security patches.
  • 🚀 Performance improvements.
  • 🆕 New features (CTEs, window functions, JSON ops).

Migration Architecture

  1. In-Place Upgrade (if supported).
  2. Replication Upgrade (safer):
  • Configure MySQL 8.0 as replica.
  • Sync binlogs.
  • Promote replica → new primary.
   CHANGE MASTER TO
     MASTER_HOST='mysql57',
     MASTER_USER='repl',
     MASTER_PASSWORD='secret',
     MASTER_LOG_FILE='binlog.000001',
     MASTER_LOG_POS=154;
   START SLAVE;

Database Consolidation

Why

  • 🔄 Merge systems after acquisitions.
  • 📊 Single source of truth.
  • 🛠 Simplify operations.

Migration Architecture

  1. Canonical Schema Design
  • Define golden schema across merged DBs.
  1. Data Cleaning & Deduplication
  • ETL pipeline (Airflow, Spark, dbt).
  • Resolve conflicts (e.g., duplicate customer IDs).
  1. Cutover
  • Redirect all applications to new consolidated DB.

🔍 Observability & Validation

No migration succeeds without visibility. You need:

  • Lag monitoring → replication delay, CDC offsets.
  • Data validation jobs → nightly row counts, hash comparisons.
  • Application metrics → error rates, query latency.
  • User-level canaries → test user accounts to verify correctness.

🛡 Rollback Strategies

Always assume failure. Options:

  1. Read-Only Freeze → stop new writes, replay missed events, retry.
  2. Feature-Flag Flip → route reads/writes back to old DB instantly.
  3. Dual DB Window → keep old DB warm for X days before decommission.

🎯 Final Thoughts

Zero-downtime migration is not a one-night project. It’s an iterative engineering effort involving:

  • Bulk + incremental sync (ETL + CDC).
  • Dual writes for safety.
  • Feature flags for cutover.
  • Continuous validation.
  • A tested rollback plan.

The actual migration isn’t the hard part — the hard part is:

  • Handling edge cases (schema mismatch, nulls, encoding issues).
  • Keeping the business running while you operate two DBs at once.
  • Convincing management to invest in testing and observability.

But done right → your users won’t even notice that your backend just swapped its heart out mid-flight.


This content originally appeared on DEV Community and was authored by Arijit Ghosh