Some jobs are quick. Others — like reconciling our entire SIM inventory against a carrier’s system of record — run for hours. And here is the only law of long-running jobs you need to internalize: they will fail in the middle. Not at the start, where it’s cheap. In the middle, at hour three, when restarting from zero means another three hours and a missed maintenance window.
A job is a sequence of resumable steps
The mental shift that fixed this for me was to stop thinking of a sync as one atomic thing and start thinking of it as a stream of small, committed steps. Each step does a bounded chunk of work and then writes down where it got to. If the process dies, the next run reads that bookmark and picks up from there.
async def run_sync(session_id):
cursor = await load_checkpoint(session_id) or START
while cursor != DONE:
batch = await fetch_page(cursor)
await apply(batch)
cursor = batch.next_cursor
await save_checkpoint(session_id, cursor) # commit progress
That save_checkpoint after every page is the whole trick. It turns a
catastrophic restart into a shrug. The job that used to lose three hours of work
now loses, at most, one page.
Idempotency makes resumes safe
Resuming is only safe if re-applying a step is harmless. If a crash happens
after apply but before save_checkpoint,
the next run will replay that page. That has to be a no-op, not a double-charge. So
every write is an upsert keyed on something stable, never a blind insert or an
increment.
Backoff for the things you don’t control
Carriers throttle. Networks blink. Inside each step, transient failures get exponential backoff with jitter, capped so we never wait absurdly long. The difference between a job that “fails on a flaky network” and one that “rides out a flaky network” is about fifteen lines of retry logic.
delay = min(base * 2 ** attempt, ceiling) + random() * base
await sleep(delay) # jitter avoids the retry stampede
Make the job observable to a human
A multi-hour job that runs silently is a multi-hour anxiety. Each checkpoint also emits progress — pages done, ETA, last error — so an operator can glance at a dashboard and know whether to intervene or get coffee. Most of the time the answer is coffee, which is exactly the point.
The payoff
None of this is exotic. Checkpoints, idempotent writes, capped backoff with jitter, and a progress signal — four small habits. Together they turn “the sync failed, start it over and hope” into “the sync resumed itself and nobody noticed.” That quiet is the entire goal.
Written by Nischal Shrestha — backend engineer, occasional optimist about distributed systems.