Surviving 100k SIMs: lessons from a multi-tenant IoT portal

When I joined the team, the SIM portal was a single Django monolith talking straight to a MySQL box, refreshing SIM status on page load by calling carrier APIs synchronously. It worked — for a few thousand SIMs. At a hundred thousand, every dashboard render was a small act of faith. This post is about the handful of decisions that turned that into something I could actually sleep next to.

Tenant isolation is a data-model problem, not a deployment one

The first instinct with multi-tenancy is to reach for separate databases or schemas per customer. We didn’t have the operational budget for that, and honestly we didn’t need it. What we needed was for a query to be physically incapable of returning another tenant’s rows. So every table got a tenant_id, every index led with it, and every query went through a session-scoped helper that refused to run without one.

# the only way to touch SIM data — tenant_id is non-optional
def scoped(session, tenant_id):
    if not tenant_id:
        raise ScopeError("refusing unscoped query")
    return session.query(Sim).filter(Sim.tenant_id == tenant_id)

It looks almost too simple. But “the dangerous query literally cannot be expressed” beats “we remembered to add a WHERE clause” every single time. The number of would-be cross-tenant leaks this caught in code review paid for itself in the first month.

Stop polling carriers on the read path

The synchronous refresh was the root of every performance fire. The fix was boring and correct: decouple reading SIM state from fetching it. A pool of workers reconciles SIM state against carrier APIs on a schedule and writes the result to our own store. The portal only ever reads our copy. Carrier latency stopped being the user’s problem.

Rule of thumb: if a user action triggers a third-party API call on the critical path, you've outsourced your p99 to someone who doesn't care about it.

Bulk provisioning wants a queue, retries, and a spine

Provisioning 10,000 SIMs in one operation is not 10,000 little requests you fire and forget. Carriers rate-limit, time out, and occasionally just lie. The pipeline that survived was built around three ideas: every unit of work is idempotent, every failure is retried with exponential backoff and jitter, and the whole batch reports progress so an operator can watch it breathe.

async def provision_one(sim, attempt=0):
    try:
        await carrier.activate(sim.iccid)
    except TransientError:
        delay = min(2 ** attempt + random(), 60)
        await sleep(delay)
        return await provision_one(sim, attempt + 1)

The jitter matters more than people expect. Without it, every retry from a failed batch wakes up at the same instant and stampedes the carrier in lockstep — you’ve built a self-inflicted DDoS with a polite name.

Observability is the feature

The thing that actually changed my life wasn’t any single architectural move — it was wiring OpenTelemetry traces through the workers and putting SIM-state-age on a Grafana board. Once I could see, per tenant, how stale our view of the fleet was, every other decision got easier. You can’t tune what you can’t watch, and at 100k SIMs you are absolutely not eyeballing logs.

What I’d tell past me

Decouple reads from fetches on day one. Make the unsafe query unrepresentable. Treat every external call as something that will fail halfway through a batch, because it will. None of this is clever — it’s just the stuff that lets a small team run a six-figure fleet without a pager going off at dinner.

Written by Nischal Shrestha — backend engineer, occasional optimist about distributed systems.

All posts