ECS Fargate, zero-downtime deploys, and the lie of rolling updates

We migrated to ECS Fargate because managing EC2 instances at scale is, charitably, a full-time job for someone who has other full-time jobs. The pitch was reasonable: “Just containers, but AWS manages the metal.” What we got was containers, AWS managing the metal, and a first deploy that cut a handful of in-flight carrier sync sessions mid-execution. The old task got terminated while it was still holding open connections. AWS called it a rolling update. It wasn’t.

Here’s what actually happened and the three things you have to wire up yourself before “rolling update” means anything close to zero-downtime.

What actually happens during a rolling update

ECS’s rolling update sequence looks like this on paper:

Start new task
Wait for new task to pass health checks
Deregister old task from ALB target group
Send SIGTERM to old task
Wait stopTimeout seconds
Send SIGKILL

The problem is steps 3 and 4. “Deregister from target group” is not the same as “ALB stops sending traffic.” The ALB has a deregistrationDelay (default: 300 seconds) during which it drains existing connections to the old task. ECS doesn’t wait for that drain to finish. It fires SIGTERM after deregistration starts, not after the drain completes.

If your stopTimeout in the task definition is less than the ALB drain time — and the default stopTimeout is 30 seconds — ECS kills the process while the ALB is still routing requests to it. You get terminated tasks serving live traffic until the load balancer catches up. Then you get errors.

The defaults don’t coordinate. You have to make them coordinate.

The three fixes

1. stopTimeout must be >= ALB drain time plus buffer

Set deregistrationDelay on the ALB target group to something reasonable for your workload. For our carrier sync service, 30 seconds was enough for long-running requests to complete. Then set stopTimeout in the ECS task definition to at least that, plus a margin. We use 35 seconds with a 30-second drain.

{
  "stopTimeout": 35,
  "healthCheck": {
    "command": ["CMD-SHELL", "curl -f http://localhost:8000/healthz || exit 1"],
    "startPeriod": 15
  }
}

Set startPeriod to give the application time to initialize before ECS starts counting failed health checks against it. Without it, a slow-starting container gets killed before it’s ready.

2. The health check must return 503 during shutdown

This is the one most people miss. If your /healthz always returns 200, the ALB keeps the dying task in rotation. It will send new requests to a container that’s in the middle of shutting down. The fix is a shutdown flag that the SIGTERM handler sets, and a health check endpoint that reads it.

import signal
from fastapi import FastAPI, Response

app = FastAPI()
shutting_down = False

def handle_sigterm(*args):
    global shutting_down
    shutting_down = True

signal.signal(signal.SIGTERM, handle_sigterm)

@app.get("/healthz")
async def healthz(response: Response):
    if shutting_down:
        response.status_code = 503
        return {"status": "shutting_down"}
    return {"status": "ok"}

Once shutting_down is true, the ALB health check fails, the target gets marked unhealthy, and no new requests route to it. The ALB drain period then handles the connections already in flight.

3. Drain open connections before exiting

Setting the flag is not enough. The application needs to actually wait for in-flight requests to finish before calling exit 0. With FastAPI and uvicorn, you can handle this by letting the server’s graceful shutdown run its course after SIGTERM.

import asyncio
import uvicorn

async def main():
    config = uvicorn.Config(
        "app:app",
        host="0.0.0.0",
        port=8000,
        timeout_graceful_shutdown=30  # match your drain window
    )
    server = uvicorn.Server(config)
    await server.serve()

if __name__ == "__main__":
    asyncio.run(main())

timeout_graceful_shutdown tells uvicorn how long to wait for active connections to close after receiving a shutdown signal. Set it to match your deregistration delay.

The health check nobody reads

A /healthz that always returns 200 is a lie your load balancer believes. It will keep sending traffic to a container that hasn’t finished initializing, or one that’s about to die, or one whose database connection pool is exhausted. The load balancer trusts whatever you tell it.

A health check worth having does three things: returns 200 only when the application is genuinely ready to handle traffic, returns 503 during graceful shutdown, and has a startPeriod long enough that ECS doesn’t start checking before the app is ready.

That last point matters because ECS uses the health check to decide whether to kill a starting container. If your app takes 12 seconds to initialize and your startPeriod is 5, ECS might kill it before it’s ready and start the cycle over. We set startPeriod to 15 seconds for a service that takes 8-10 seconds to warm its connection pool. Comfortable margin, not guesswork.

The part that surprised me

None of this is undocumented. AWS has pages on stopTimeout, deregistrationDelay, health check configuration, and connection draining. The problem is that those pages don’t talk to each other. You read the ECS task definition reference and it tells you stopTimeout exists. You read the ALB docs and they tell you deregistrationDelay exists. Nowhere does it say: these two values need to be coordinated, and here is the sequence of events during a deploy and why they interact.

You figure it out by reading three docs pages, drawing the sequence on paper, deploying under synthetic load, and watching what breaks. “Managed infrastructure” means AWS manages the physical machines. It doesn’t mean AWS manages the interaction between its own services on your behalf.

That’s not a complaint. The pieces are there. But “managed” has a narrower meaning than the word implies, and realizing that earlier would have saved an afternoon of tracing dropped connections.

Zero-downtime is a contract you write yourself. The platform gives you the tools. Putting them in the right order is your job.

Written by Nischal Shrestha — backend engineer, occasional optimist about distributed systems.

All posts Next: Surviving 100k SIMs…