cd ../notes Nischal Shrestha
// reliability

Backoff, checkpoints & the art of the multi-hour sync job


Some jobs are quick. Others — like reconciling our entire SIM inventory against a carrier’s system of record — run for hours. And here is the only law of long-running jobs you need to internalize: they will fail in the middle. Not at the start, where it’s cheap. In the middle, at hour three, when restarting from zero means another three hours and a missed maintenance window.

A job is a sequence of resumable steps

The mental shift that fixed this for me was to stop thinking of a sync as one atomic thing and start thinking of it as a stream of small, committed steps. Each step does a bounded chunk of work and then writes down where it got to. If the process dies, the next run reads that bookmark and picks up from there.

async def run_sync(session_id):
    cursor = await load_checkpoint(session_id) or START
    while cursor != DONE:
        batch = await fetch_page(cursor)
        await apply(batch)
        cursor = batch.next_cursor
        await save_checkpoint(session_id, cursor)  # commit progress

That save_checkpoint after every page is the whole trick. It turns a catastrophic restart into a shrug. The job that used to lose three hours of work now loses, at most, one page.

Idempotency makes resumes safe

Resuming is only safe if re-applying a step is harmless. If a crash happens after apply but before save_checkpoint, the next run will replay that page. That has to be a no-op, not a double-charge. So every write is an upsert keyed on something stable, never a blind insert or an increment.

The pairing that matters: checkpoints let you resume; idempotency makes resuming safe. You need both — one without the other is a trap.

Backoff for the things you don’t control

Carriers throttle. Networks blink. Inside each step, transient failures get exponential backoff with jitter, capped so we never wait absurdly long. The difference between a job that “fails on a flaky network” and one that “rides out a flaky network” is about fifteen lines of retry logic.

delay = min(base * 2 ** attempt, ceiling) + random() * base
await sleep(delay)   # jitter avoids the retry stampede

Make the job observable to a human

A multi-hour job that runs silently is a multi-hour anxiety. Each checkpoint also emits progress — pages done, ETA, last error — so an operator can glance at a dashboard and know whether to intervene or get coffee. Most of the time the answer is coffee, which is exactly the point.

The payoff

None of this is exotic. Checkpoints, idempotent writes, capped backoff with jitter, and a progress signal — four small habits. Together they turn “the sync failed, start it over and hope” into “the sync resumed itself and nobody noticed.” That quiet is the entire goal.

Written by Nischal Shrestha — backend engineer, occasional optimist about distributed systems.