#Kubernetes

Zero-downtime Kubernetes deploys — the details nobody tells you

Abdelfattah Hilmi · Jan 20, 2026 ·

#Kubernetes#SRE#DevOps#Production

“We have readiness probes — we’re zero-downtime, right?”

Wrong, and I’ll prove it.

Run a load test against a Deployment doing a rolling update. Even with a perfect readinessProbe, you’ll see a small spike of 502s and connection resets. Why?

Because Kubernetes’ termination flow has at least four asynchronous things racing each other, and your application is one of them. The default behavior leaks in-flight requests on every deploy. This post walks through the timing diagram nobody draws, and the five tiny configuration changes that actually fix it.

The pod termination flow

When you kubectl rollout restart deploy/api, here’s what kube actually does for each pod:

The pod is marked Terminating in the API server.
In parallel:
- The endpoints controller removes the pod’s IP from the Endpoints / EndpointSlice of every Service it backs.
- The kubelet sends SIGTERM to PID 1 in the container.
- Any preStop hook runs.
The kubelet waits up to terminationGracePeriodSeconds (default 30s).
If the container hasn’t exited, the kubelet sends SIGKILL.

The bug is in step 2: endpoint removal is eventually-consistent and runs in parallel with SIGTERM. The flow that bites everyone:

t=0ms: pod marked Terminating, SIGTERM sent.
t=5ms: your app calls server.close() and stops accepting connections.
t=80ms: a kube-proxy or ingress controller somewhere still thinks the pod is healthy and routes a request to it → connection refused.
t=200ms: kube-proxy finally re-syncs iptables, stops routing.

The pod was killed faster than the cluster could update its routing tables.

Fix 1: a `preStop` sleep

This is the single biggest win:

spec:
  containers:
    - name: api
      lifecycle:
        preStop:
          exec:
            command: ["sleep", "15"]

preStop runs before SIGTERM is sent. During those 15 seconds:

The endpoint controller has time to propagate the removal across every node, every kube-proxy, every ingress controller.
Your pod is still healthy and serving traffic — but new traffic has already been steered away.
After 15s, SIGTERM arrives and your pod can shut down cleanly with no new connections incoming.

Then bump the grace period so SIGTERM has time to drain in-flight work:

spec:
  terminationGracePeriodSeconds: 60   # 15s preStop + up to 45s for in-flight requests

The terminationGracePeriod clock includes the preStop window. If your preStop sleeps 15s and your grace period is 30s, your app only gets 15s to drain.

Fix 2: handle SIGTERM in your app

Most web frameworks won’t do this for you. The pattern in Go:

srv := &http.Server{Addr: ":8080", Handler: mux}

go func() {
    if err := srv.ListenAndServe(); err != nil && err != http.ErrServerClosed {
        log.Fatal(err)
    }
}()

stop := make(chan os.Signal, 1)
signal.Notify(stop, syscall.SIGTERM, syscall.SIGINT)
<-stop

// Stop accepting new conns, wait for in-flight requests to finish.
ctx, cancel := context.WithTimeout(context.Background(), 45*time.Second)
defer cancel()
if err := srv.Shutdown(ctx); err != nil {
    log.Printf("graceful shutdown failed: %v", err)
}

Node.js with Express:

const server = app.listen(8080);
process.on("SIGTERM", () => {
  server.close(() => process.exit(0));   // stops accepting new conns; finishes in-flight
  setTimeout(() => process.exit(1), 45_000).unref(); // hard cap
});

Python with gunicorn: gunicorn already handles SIGTERM gracefully — but only if your workers can finish in time. Set --graceful-timeout 45.

If your app ignores SIGTERM, you’re back to square one: SIGKILL drops everything.

Fix 3: use a PodDisruptionBudget

A rolling update isn’t the only way pods die. Node drain, eviction, autoscaler scale-down — all of these terminate pods. Without a PodDisruptionBudget, a node drain can kill every replica simultaneously:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2          # or: maxUnavailable: 25%
  selector:
    matchLabels: { app: api }

The cluster autoscaler and kubectl drain will then respect this — they’ll wait for a new replica to come up before evicting the next one. Without a PDB, you’ve inherited a zero-downtime deploy that breaks the moment ops runs kubectl drain.

Fix 4: tune the rollout itself

The defaults are reasonable but rarely optimal:

spec:
  replicas: 6
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 25%         # spin up 25% extra during rollout
      maxUnavailable: 0     # never go below desired count
  minReadySeconds: 10       # wait 10s after Ready before counting a pod as available

maxUnavailable: 0 is the safe default for stateful or capacity-sensitive workloads — you’ll always have full capacity during rollouts, at the cost of needing room for replicas + maxSurge pods.

minReadySeconds is the underrated knob. A pod that goes Ready then immediately falls over (bad config, slow JVM warmup) will still pass a single Ready check. minReadySeconds: 10 makes the rollout wait 10s of continuous readiness before promoting the new pod — catches the “Ready then crashloop” class of bug.

Fix 5: probe configuration that doesn’t lie

The default readinessProbe example most teams copy:

readinessProbe:
  httpGet: { path: /healthz, port: 8080 }

This is broken in two ways:

/healthz returns 200 too eagerly. It should check that downstream dependencies the app actually needs (DB, redis, message broker) are reachable. If the DB is down and your readiness probe is “is the process alive?”, you’ll happily route traffic to a pod that can’t serve it.
No startupProbe for slow-start apps. A JVM app that takes 40s to warm up will fail the readiness probe 8 times during boot, get marked Unready, and trigger restarts. Use a startupProbe with a long failureThreshold to give it room:

startupProbe:
  httpGet: { path: /healthz, port: 8080 }
  failureThreshold: 30
  periodSeconds: 5         # 30 × 5s = 150s to start
readinessProbe:
  httpGet: { path: /healthz, port: 8080 }
  periodSeconds: 5
  failureThreshold: 2      # take pod out of rotation after 10s of unhealth

While startupProbe is running, the readinessProbe is paused. After startup passes, readiness takes over.

The full picture

A production-grade Deployment for a stateless HTTP service:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 6
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 0
  minReadySeconds: 10
  template:
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: api
          image: ghcr.io/acme/api:v1.42.0
          ports: [{ containerPort: 8080 }]
          startupProbe:
            httpGet: { path: /healthz, port: 8080 }
            failureThreshold: 30
            periodSeconds: 5
          readinessProbe:
            httpGet: { path: /healthz, port: 8080 }
            periodSeconds: 5
            failureThreshold: 2
          livenessProbe:
            httpGet: { path: /livez, port: 8080 }
            periodSeconds: 30
            failureThreshold: 3
          lifecycle:
            preStop:
              exec:
                command: ["sleep", "15"]
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels: { app: api }

Re-run the load test against this and the 502s disappear.

A note on `livenessProbe`

Most teams set livenessProbe == readinessProbe. Don’t.

readinessProbe failing → pod removed from Service endpoints, no restart. Good for transient backpressure.
livenessProbe failing → pod killed and restarted. Should only happen if the pod is genuinely wedged.

If your liveness probe checks the database, a 5-second DB blip restarts every pod in your fleet simultaneously, which then all start cold and hammer the DB further. This is a classic outage amplifier. Liveness should answer “is this process deadlocked?” — usually just “is the event loop responsive?“.

Closing thought

Kubernetes will not give you zero-downtime deploys by default. It gives you the primitives to build them. The five knobs — preStop sleep, grace period, PDB, rollout strategy, and proper probes — work together. Skip any one and you’ll occasionally see 502s during deploys and never quite understand why.

— Abdel