Surviving 100k SIMs: lessons from a multi-tenant IoT portal
What I learned giving a fleet of 100,000+ active SIMs real-time visibility — tenant isolation, the bulk-provisioning pipeline, and why I stopped trusting synchronous anything.
Read postECS Fargate, zero-downtime deploys, and the lie of rolling updates
How rolling deployments on ECS can silently drop in-flight requests — and the connection-draining, health-check, and task-definition patterns that actually prevent it.
Read postMoving MySQL → Aurora under live traffic (and sleeping fine)
A zero-drama migration playbook: dual-writes, shadow reads, a Redis buffer in front of the hot path, and the rollback switch I hoped I'd never flip.
Read postWhat a 60% drop in DB read load actually taught me about caching
Notes on the Redis pipeline that replaced our synchronous SMS-status path — and the cache-invalidation bugs that nearly undid all of it.
Read postBackoff, checkpoints & the art of the multi-hour sync job
Long-running jobs fail in the middle — always. Here's how I make carrier syncs that run for hours resume cleanly instead of starting over at 3 a.m.
Read post