#Kubernetes
#Kubernetes
Wrong, and I’ll prove it.
Run a load test against a Deployment doing a rolling update. Even with a perfect readinessProbe, you’ll see a small spike of 502s and connection resets. Why?
Because Kubernetes’ termination flow has at least four asynchronous things racing each other, and your application is one of them. The default behavior leaks in-flight requests on every deploy. This post walks through the timing diagram nobody draws, and the five tiny configuration changes that actually fix it.
When you kubectl rollout restart deploy/api, here’s what kube actually does for each pod:
Terminating in the API server.Endpoints / EndpointSlice of every Service it backs.SIGTERM to PID 1 in the container.preStop hook runs.terminationGracePeriodSeconds (default 30s).SIGKILL.The bug is in step 2: endpoint removal is eventually-consistent and runs in parallel with SIGTERM. The flow that bites everyone:
server.close() and stops accepting connections.The pod was killed faster than the cluster could update its routing tables.
preStop sleepThis is the single biggest win:
spec:
containers:
- name: api
lifecycle:
preStop:
exec:
command: ["sleep", "15"]
preStop runs before SIGTERM is sent. During those 15 seconds:
Then bump the grace period so SIGTERM has time to drain in-flight work:
spec:
terminationGracePeriodSeconds: 60 # 15s preStop + up to 45s for in-flight requests
The terminationGracePeriod clock includes the preStop window. If your preStop sleeps 15s and your grace period is 30s, your app only gets 15s to drain.
Most web frameworks won’t do this for you. The pattern in Go:
srv := &http.Server{Addr: ":8080", Handler: mux}
go func() {
if err := srv.ListenAndServe(); err != nil && err != http.ErrServerClosed {
log.Fatal(err)
}
}()
stop := make(chan os.Signal, 1)
signal.Notify(stop, syscall.SIGTERM, syscall.SIGINT)
<-stop
// Stop accepting new conns, wait for in-flight requests to finish.
ctx, cancel := context.WithTimeout(context.Background(), 45*time.Second)
defer cancel()
if err := srv.Shutdown(ctx); err != nil {
log.Printf("graceful shutdown failed: %v", err)
}
Node.js with Express:
const server = app.listen(8080);
process.on("SIGTERM", () => {
server.close(() => process.exit(0)); // stops accepting new conns; finishes in-flight
setTimeout(() => process.exit(1), 45_000).unref(); // hard cap
});
Python with gunicorn: gunicorn already handles SIGTERM gracefully — but only if your workers can finish in time. Set --graceful-timeout 45.
If your app ignores SIGTERM, you’re back to square one: SIGKILL drops everything.
A rolling update isn’t the only way pods die. Node drain, eviction, autoscaler scale-down — all of these terminate pods. Without a PodDisruptionBudget, a node drain can kill every replica simultaneously:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: 2 # or: maxUnavailable: 25%
selector:
matchLabels: { app: api }
The cluster autoscaler and kubectl drain will then respect this — they’ll wait for a new replica to come up before evicting the next one. Without a PDB, you’ve inherited a zero-downtime deploy that breaks the moment ops runs kubectl drain.
The defaults are reasonable but rarely optimal:
spec:
replicas: 6
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25% # spin up 25% extra during rollout
maxUnavailable: 0 # never go below desired count
minReadySeconds: 10 # wait 10s after Ready before counting a pod as available
maxUnavailable: 0 is the safe default for stateful or capacity-sensitive workloads — you’ll always have full capacity during rollouts, at the cost of needing room for replicas + maxSurge pods.
minReadySeconds is the underrated knob. A pod that goes Ready then immediately falls over (bad config, slow JVM warmup) will still pass a single Ready check. minReadySeconds: 10 makes the rollout wait 10s of continuous readiness before promoting the new pod — catches the “Ready then crashloop” class of bug.
The default readinessProbe example most teams copy:
readinessProbe:
httpGet: { path: /healthz, port: 8080 }
This is broken in two ways:
/healthz returns 200 too eagerly. It should check that downstream dependencies the app actually needs (DB, redis, message broker) are reachable. If the DB is down and your readiness probe is “is the process alive?”, you’ll happily route traffic to a pod that can’t serve it.startupProbe for slow-start apps. A JVM app that takes 40s to warm up will fail the readiness probe 8 times during boot, get marked Unready, and trigger restarts. Use a startupProbe with a long failureThreshold to give it room:startupProbe:
httpGet: { path: /healthz, port: 8080 }
failureThreshold: 30
periodSeconds: 5 # 30 × 5s = 150s to start
readinessProbe:
httpGet: { path: /healthz, port: 8080 }
periodSeconds: 5
failureThreshold: 2 # take pod out of rotation after 10s of unhealth
While startupProbe is running, the readinessProbe is paused. After startup passes, readiness takes over.
A production-grade Deployment for a stateless HTTP service:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
replicas: 6
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25%
maxUnavailable: 0
minReadySeconds: 10
template:
spec:
terminationGracePeriodSeconds: 60
containers:
- name: api
image: ghcr.io/acme/api:v1.42.0
ports: [{ containerPort: 8080 }]
startupProbe:
httpGet: { path: /healthz, port: 8080 }
failureThreshold: 30
periodSeconds: 5
readinessProbe:
httpGet: { path: /healthz, port: 8080 }
periodSeconds: 5
failureThreshold: 2
livenessProbe:
httpGet: { path: /livez, port: 8080 }
periodSeconds: 30
failureThreshold: 3
lifecycle:
preStop:
exec:
command: ["sleep", "15"]
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: 2
selector:
matchLabels: { app: api }
Re-run the load test against this and the 502s disappear.
livenessProbeMost teams set livenessProbe == readinessProbe. Don’t.
readinessProbe failing → pod removed from Service endpoints, no restart. Good for transient backpressure.livenessProbe failing → pod killed and restarted. Should only happen if the pod is genuinely wedged.If your liveness probe checks the database, a 5-second DB blip restarts every pod in your fleet simultaneously, which then all start cold and hammer the DB further. This is a classic outage amplifier. Liveness should answer “is this process deadlocked?” — usually just “is the event loop responsive?“.
Kubernetes will not give you zero-downtime deploys by default. It gives you the primitives to build them. The five knobs — preStop sleep, grace period, PDB, rollout strategy, and proper probes — work together. Skip any one and you’ll occasionally see 502s during deploys and never quite understand why.
— Abdel