Kubernetes Probes are Awesome (until they aren't)

Many teams get excited about including kubernetes probes in their deployments. However, misconfigured probes can take a bad situation and make it disastrous! In this post, I share some hard-learnt lessons they don't say in the docs.

Kubernetes Probes are Awesome (until they aren't)
Photo by Tom Wilson / Unsplash

Kubernetes offers three types of probes: Startup (checks if the app has begun running), Readiness (continuously checks if it’s ready for traffic), and Liveness (continuously checks if it’s still running).

I'll skip the basics and jump into hard-learned lessons from using probes and how they can tank the whole system if not used correctly.

Lesson 1: Understand the falling dominoes problem

Let's say you have a deployment with 10 pods all receiving traffic of 1000 requests/second. So each of them receives 100 reqs/second.

For some reason one pod fails and instantly restarts. In-flight requests immediately 502, but kubernetes immediately restarts the pod.

Each time this happens, there's an exponentially increasing delay before restarting again (capped at 5 minutes). While the pod is out (specially when it's out for 5 mins):

  • The remaining 9 pods pick up the slack → ~111 req/s each.
  • If another pod fails, now 8 pods must handle ~125 req/s.
  • Fail one more, and 7 pods are carrying ~143 req/s.

The problem is: Each failing pod results in survivors handling more load; making it more likely that another fails or gets OOMKilled which adds EVEN MORE stress on the survivors in a positive feedback loop.

This is how a single recurring pod crash -left unchecked- snowballs into what I call the falling dominoes problem.

Even though we started with 10 pods, over time, we end up with zero. This scenario usually plays out over a few days if the deployment is not autoscaled or redeployed (for example: during holiday times or company shutdowns).

Automatic Recovery is NOT Possible

Occasionally, one of the failed pods restarts and suddenly 1000 reqs/second flock its way making it fail instantly. Then another failed one restarts, and all 1000 reqs/second flock its way making it fail instantly as well and so on!! This is an oversimplification but is valid nonetheless.

If downstream services (like frontends) have automatic retries then it further compounds the problem.

It might seem like a rare problem but you'd be surprised how easy this is to cause accidentally with misconfigured probes; your team may even have it without realizing 😄

Lesson 2: It's [kinda] okay to use the same healthcheck endpoint for Readiness and Liveness

Despite common internet wisdom, it's ok [in my opinion] if you understand what you're doing and what the consequences are for a particular situation. However, YOLOing it is not a good idea.

When configuring probes, you get to define multiple parameters that affect how lenient/stringent the probe is. Things like initialDelaySeconds periodSeconds
failureThreshold
successThreshold and timeoutSeconds

You may use the same check for readiness and liveness in most scenarios but manipulate the parameters to make them more lenient/stringent for different probes. I'll share more ahead.

Lesson 3: Liveness probes should be as lenient as possible

Let's say we have stringent liveness probes and one pod starts having performance problems; the liveness probe will kill it.

When this happens multiple times over, restart times increase, traffic gets rerouted to surviving pods for a longer period of time, this makes it more likely to have performance problems in one of the surviving pods.

This usually starts super slow, an occasional restart here, followed by another one 12 hours later and picks up speed as restart delays gradually increase. It's the falling dominoes problem described earlier and will cause all your pods (even the ones that were fine to begin with) to fall like dominoes one after the other.

# Example of lenient liveness probe
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 5
  timeoutSeconds: 30
  failureThreshold: 5
  successThreshold: 1

Being lenient here ensures you don't restart temporarily struggling pods but only ones that're completely out of service.

Bonus Tip: Liveness probes are non-negotiable. Network outages and code deadlocks happen all the time; specially in 3rd party libraries; have a backup plan :)

Lesson 4: Readiness probes should be as stringent as possible

Readiness probes are responsible for shifting traffic to/from healthy/unhealthy pods.

If we detect that a pod is having performance problems, we want to shift traffic away from it ASAP for reliability. Over time the pod would either:

  • recover: then it starts getting traffic again without having gotten restarted
  • not recover: then it would be killed by the liveness probe eventually

If readiness probes were lenient, we'd continue sending traffic to overloaded pods; this will

  • prevent them from recovering which worsens reliability with slow response times and 504s
  • overload them to the point of getting OOMKilled. This would yield 502s and accelerate the falling dominoes problem to a complete outage
# Example of stringent readiness probe
readinessProbe:
  httpGet:
    path: /health
    port: 8080
  periodSeconds: 10
  timeoutSeconds: 2
  failureThreshold: 1
  successThreshold: 4

Having stringent readiness probes ensures we act before the problem grows.

Lesson 5: Don't use readiness probes if you don't have to

At Buffer, we use AWS Application Loadbalancers (ALBs) to serve all our traffic to our pods (both internal and external).

ALBs already run health checks against targets; so in theory, Kubernetes readiness probes are simply redundant. In practice, we found that they're also harmful!

Here's why ...

ALB Health Checks (Cheap)

  • If a pod fails the check, it’s marked unhealthy almost instantly.
  • It's drained and traffic immediately shifts to healthy pods.
  • Once the pod passes again, it’s back in action almost instantly.

Kubernetes readiness probes (expensive with ALBs)

  • When a pod fails, Kubernetes removes it from the ALB target group.
  • ALB starts draining in-flight connections. This is super slow, especially if the pod was already struggling.
  • Even if the pod recovers quickly, we found that it cannot start receiving traffic again right away.
    • The pod must FIRST re-register, and re-pass health checks before getting traffic again.
    • Meanwhile, even though the pod is healthy again. It does not receive any traffic until this process finishes. the surviving pods pick up the extra traffic, increasing their risk of failure.

That’s how a recurring blip in a readiness probe -when paired with ALBs- can cascade into the falling dominoes problem.

Need Another Reason?

If you misconfigure the health check endpoint or over-complicate it [leading to failures] and use it for both ALB health checks and readiness probes. The health check endpoint would fail for all pods [because it's misconfigured].

  • By default, the ALB target group would "fail open"; meaning that if it can't find any healthy targets, it would just route requests to all targets anyway.
  • This can help dampen the effect of the misconfigured health check until the issue is addressed.
  • However, if a readiness probe was there as well; it'll forcibly deregister all the targets from the ALB causing 503s even though the underlying app is fine.

Therefore, this will turn the misconfigured health check from an internal issue into a customer-facing full outage.

If you don't use ALBs

It's important to still understand how the health checks work in whatever routing system you use and fine-tune them. I believe the learning here to not use readiness probes if you don't have to still applies though.

Closing

Kubernetes probes are powerful, but dangerous when misused. With the right balance, they’ll protect your system instead of taking it down.

However, misconfigure them and they will turn a bad situation into a disastrous situation; they'll wake you up at night or interrupts that sweet long vacation you're enjoying.

That’s my take. I'd love to hear what you think 😄

Have you seen the falling dominoes problem in your systems or other probe-related issues? Tell me about your battle scars!