What a 60% drop in DB read load actually taught me about caching

The read that wouldn’t quit

Our SMS status path was the most read thing in the system. Every delivery event from a carrier webhook triggered a write, and then every downstream service — billing, alerting, the customer dashboard — went back to the database to check current status. MySQL was handling somewhere between 800 and 1,200 reads per minute on that table alone during peak hours.

This wasn’t a disaster. It was more like a slow drain. Query times were fine until they weren’t, and the table was growing. We were already planning a MySQL to Aurora migration. Before touching the migration, I wanted to make the database less interesting first.

Redis as write-through cache

The design is not clever. That’s intentional.

When a carrier webhook fires, we write status to Redis before (or alongside) writing to the database. Any service that wants current status asks Redis first. If Redis misses, it falls back to the database and backfills the key. TTL is five minutes as a safety net — not for performance, just so stale keys don’t accumulate forever if something goes sideways.

The key format: sim:{tenant_id}:{iccid}:status. Tenant-scoped, device-scoped, nothing fancy.

async def write_status(tenant_id: str, iccid: str, status: str) -> None:
    key = f"sim:{tenant_id}:{iccid}:status"
    await redis.setex(key, TTL_SECONDS, status)
    await db.upsert_status(tenant_id, iccid, status)


async def get_status(tenant_id: str, iccid: str) -> str | None:
    key = f"sim:{tenant_id}:{iccid}:status"
    cached = await redis.get(key)
    if cached:
        return cached.decode()
    # cache miss — fall back to DB and backfill
    status = await db.get_status(tenant_id, iccid)
    if status:
        await redis.setex(key, TTL_SECONDS, status)
    return status

That’s the whole thing. After rollout, DB read load on the status table dropped roughly 60%. MySQL stopped being the bottleneck for that path. Mission accomplished, mostly.

The gotcha: cache invalidation during a live migration

Here’s where it got less clean.

The Redis work landed a few weeks before the MySQL to Aurora cutover. During the dual-write phase, we were writing to both databases simultaneously to validate Aurora’s data integrity before flipping the read path. Carrier webhooks wrote to MySQL first, then Aurora. Two writes. Both had to succeed.

Except sometimes the Aurora write dropped silently. Retry logic caught most of them, but there was a window — short, but real — where MySQL and Aurora held different values for the same row.

Redis didn’t know any of this. Redis had the value from MySQL. Aurora didn’t have it yet. Services reading from Redis got a value that was correct relative to MySQL but wrong relative to the database we were migrating to. The cache was coherent with the wrong system.

This isn’t a theoretical problem. We had alerting services that were supposed to fire on status transitions — transitions that Aurora hadn’t registered yet. Some alerts fired late. Some didn’t fire at all. The root cause traced back to Redis serving values that Aurora hadn’t confirmed.

The actual problem: Cache coherence isn't just about your cache and your database. It's about your cache and every database involved in a migration, including the one you haven't finished building trust in yet.

What we did about it

Two things, neither of them elegant.

First, shorter TTLs during the migration window. We dropped from 300 seconds to 30 seconds. A blunt instrument. It meant more DB reads, which partly undid the performance win, but it capped the blast radius of a stale key. If Aurora eventually got the write, the cache would expire and refill with a consistent value within 30 seconds.

Second, a reconciliation job. After the dual-write phase stabilized — meaning Aurora’s write success rate was close enough to 100% that we trusted it — we ran a job that scanned active Redis keys, compared the cached value against Aurora, and explicitly evicted anything that diverged.

async def reconcile_cache(tenant_id: str, iccid: str) -> None:
    key = f"sim:{tenant_id}:{iccid}:status"
    cached = await redis.get(key)
    if not cached:
        return
    aurora_status = await aurora_db.get_status(tenant_id, iccid)
    if aurora_status and cached.decode() != aurora_status:
        await redis.delete(key)
        cache_stale_hits.inc()  # prometheus counter

We also added a cache.stale_hit metric — a Prometheus counter that incremented whenever the reconciliation job found a divergence. During the migration window it peaked at around 40 divergent keys per hour. After the dual-write phase ended and we flipped fully to Aurora, it dropped to zero and stayed there.

The metric wasn’t actionable in real-time. But it told us when the migration was actually stable, which is a different kind of useful.

What 60% actually means

The number is real. DB read load on that table is down 60% compared to pre-cache baselines. That was the goal, and it happened.

But 60% is the easy part. Any reasonably implemented write-through cache on a hot read path will get you there. The interesting part is not the performance win — it’s what happens when the system underneath your cache changes shape while the cache is running.

Infrastructure migrations are not atomic. There’s always a window where two systems hold different truths, and a cache sitting in front of them will pick one and serve it confidently. The cache doesn’t know it’s wrong. It just has fast answers.

The work that actually mattered wasn’t the Redis implementation. It was the reconciliation job, the reduced TTLs during the danger window, and the stale-hit counter that let us know when the danger window was over.

A cache hit you can’t reason about is just a bug with good latency.

Written by Nischal Shrestha — backend engineer, occasional optimist about distributed systems.

All posts Next: Moving MySQL → Aurora under li…