“We’re moving the database” is the kind of sentence that makes a room go quiet. Ours was a self-managed MySQL instance that had become the single most expensive thing to reason about in the system — read load was climbing, and one bad query could nudge the whole portal toward the edge. We wanted Aurora for the read replicas and the operational sanity. We wanted it with no downtime and no lost writes. Here’s the playbook that got us there.
Step zero: make the database boring first
Before touching infrastructure, I spent two weeks reducing the blast radius. The hottest read path — SMS delivery status — got a Redis buffer in front of it, which alone cut direct database reads by around 60%. Migrating a calmer database is categorically easier than migrating one that’s already on fire.
Dual-write, then shadow-read
The core of a safe cutover is a window where both databases are live. We routed every write to MySQL and Aurora, but kept serving reads from MySQL. That gave us a running copy to validate against without betting any user-facing read on the new system yet.
async def write_sim_event(event):
await mysql.insert(event) # source of truth (for now)
try:
await aurora.insert(event) # shadow target
except Exception as e:
metrics.incr("aurora.write.fail") # log, never block the request
The crucial detail: a failed Aurora write must never fail the
request. During the shadow phase Aurora is a guest in your system, not a
dependency. You watch aurora.write.fail like a hawk and fix
divergence offline.
Reconcile, don’t assume
Dual-writing gets you eventually consistent, but “eventually” hides a lot of sins — a dropped write here, a race there. So a reconciliation job walked both datastores in time windows and compared row counts and checksums, surfacing drift before it could matter.
- Identical window → move on.
- Aurora missing rows → backfill from MySQL, alert if it recurs.
- Values differ → stop the line. This is a bug, not a blip.
The flip — and the switch back
Once reconciliation had been clean for days, the cutover itself was almost anticlimactic: a feature flag flipped reads to Aurora, tenant by tenant. We started with internal tenants, watched the dashboards, and widened the blast radius slowly.
The thing I’m proudest of isn’t the migration — it’s that I never needed the part I built next. The flag could flip back to MySQL instantly, and because we kept dual-writing through the whole ramp, MySQL stayed a valid fallback the entire time. A rollback you’ve actually tested is worth more than a migration you’re confident about.
What it cost, what it bought
It took about six weeks end to end, most of it spent not migrating — caching, reconciling, watching. The payoff: read load down 60% before we even cut over, real read replicas, and a database I no longer think about at 2 a.m. That last one doesn’t show up on a dashboard, but it’s the whole reason I did it.
Written by Nischal Shrestha — backend engineer, occasional optimist about distributed systems.