When Kubernetes Self-Healing Isn't Fast Enough: Building Custom Recovery Jobs

Kubernetes is famous for self-healing. Pods crash? They restart. Nodes fail? Workloads reschedule. It's one of the platform's biggest selling points.

But when you're running 100+ Mautic instances on spot infrastructure and pushing a million emails through the system, "eventually self-healing" isn't good enough. We learned this the hard way - and built custom recovery mechanisms that cut our recovery time from hours to minutes.

The Gap Between Theory and Production

Kubernetes self-healing works well for normal operations. A pod crashes, the controller restarts it. A health check fails, traffic is routed away. These mechanisms handle the majority of failure scenarios reliably.

But our environment isn't normal. We run a multi-tenant marketing automation platform on AWS spot instances. We process massive email campaigns that push infrastructure to its limits. And we discovered two specific scenarios where Kubernetes' default self-healing is far too slow.

Problem 1: Nodes That Won't Die

Our Kubernetes cluster uses Karpenter (an autoscaler) to manage node lifecycle - provisioning nodes when workloads need them and removing nodes when they're no longer needed.

Here's the problem: when a spot instance is reclaimed by AWS, or a node enters an unhealthy state for any reason (out of memory, disk pressure, network issues), Karpenter's default behavior is to wait and observe. It's conservative by design - it doesn't want to prematurely remove a node that might recover.

How conservative? In our experience, Karpenter can take hours to remove an unhealthy node from the cluster. Hours during which:

Pods that were running on that node can't be rescheduled to healthy nodes because Kubernetes still thinks the node exists.
Cluster capacity is artificially reduced - the scheduler sees the node but can't place workloads on it.
During email campaigns, this means reduced processing capacity when you need it most.

For a platform where email sending performance directly impacts the client's business, waiting hours for a node to be removed is unacceptable.

Our Solution: A 5-Minute Health Monitor

We built a custom Kubernetes CronJob - essentially a shell script running on a schedule - that checks node health every 5 minutes. The logic is straightforward:

Query all nodes in the cluster and check their status.
If a node has a healthy status - it's fine, move on.
If a node reports Unknown or NotReady status and has been in that state for more than 8 minutes - remove it from the cluster.

That's it. No machine learning, no complex heuristics. Just a simple rule: if a node hasn't been healthy for 8 minutes, it's not coming back.

When our job removes the node, Karpenter immediately detects that the cluster has lost capacity and provisions a replacement. Instead of waiting hours for Karpenter to make the decision on its own, we force the issue in under 10 minutes.

Why 8 Minutes?

We chose this threshold carefully:

Too short (1-2 minutes) and you risk removing nodes that are just temporarily slow - maybe they're pulling a large Docker image or running garbage collection.
Too long (30+ minutes) and you defeat the purpose of the custom health monitor.
8 minutes gives the node a fair chance to recover while still being dramatically faster than Karpenter's default behavior.

In practice, if a node is Unknown for 8 minutes, it's not recovering. Spot reclamations and hard failures are immediate - the 8-minute window only catches nodes in ambiguous states.

Problem 2: Ghost Pods That Steal Resources

The second problem emerged during our most extreme stress test: sending 1 million emails in 25 minutes with approximately 52 parallel workers.

During this test, we pushed the cluster hard. Nodes were scaling up and down. Workers were consuming and producing messages at maximum throughput. And we discovered a subtle but impactful issue: ghost pods.

Here's what happens:

A node starts shutting down (spot reclamation or scale-down).
Pods on that node begin their shutdown sequence.
But the node disappears before the pods complete their shutdown.
The pods end up in an Unknown state - not running, not terminated, just... stuck.

The problem? Kubernetes' scheduler still counts these ghost pods against the cluster's capacity. Each ghost pod occupies a scheduling slot. If you have 5 ghost pods on a node, that's 5 fewer pods that can be scheduled elsewhere.

During high-load email sends, every pod slot matters. We need every available worker processing email batches.

Our Solution: The Pod Cleanup Job

We built another CronJob specifically for cleaning up stuck pods. This one is a bit more nuanced:

Scan all pods in the cluster.
Identify pods that have been in a suspicious status (Unknown or similar) for longer than a threshold period.
Attempt graceful deletion first - send a delete request and give the pod a chance to terminate cleanly.
If the pod doesn't respond to the graceful delete within a timeout - force-delete it.

The two-phase approach is important. A pod might be in Unknown status because of a transient network issue between the API server and the node. A graceful delete gives it a chance to respond properly. Force-deletion is the last resort for truly stuck pods.

Why We Didn't Use an Existing Operator

You might wonder: aren't there existing Kubernetes operators for this? Node Problem Detector, Descheduler, various self-healing controllers?

Yes, they exist. But our requirements were specific:

Speed over sophistication. We didn't need complex failure analysis. We needed fast, decisive action on obvious failures.
Minimal overhead. A CronJob with a shell script uses negligible resources compared to a continuously-running operator.
Full control. We know exactly what our jobs do because we wrote them. No debugging someone else's reconciliation logic.
Spot-instance awareness. Our jobs are tuned for the specific failure patterns we see with spot instances - which are different from what happens with on-demand nodes.

The jobs are simple bash scripts that call kubectl. They run every few minutes, check status, and take action. They're easy to understand, easy to modify, and easy to debug.

The Impact

Before these custom jobs:

An unhealthy node could reduce cluster capacity for hours.
Ghost pods from spot reclamation could block scheduling indefinitely.
During high-load email sends, the team had to manually watch for these issues.

After:

Unhealthy nodes are removed within 10 minutes (5-minute check interval + some processing time).
Ghost pods are cleaned up automatically.
The 1-million email stress test ran to completion without manual intervention.
The team sleeps better.

Lessons Learned

1. Kubernetes self-healing is a spectrum, not a binary.

Default self-healing handles maybe 90% of failure scenarios. But the remaining 10% - the edge cases that happen during high load, spot reclamation, or network partitions - are exactly the cases where you need automation most.

2. Simple scripts beat complex operators.

For well-defined failure patterns, a CronJob with a shell script is the right tool. You don't need a reconciliation loop or a custom resource definition. You need kubectl and a conditional.

3. Tune your thresholds to your workload.

Our 8-minute threshold for nodes works because we know our failure patterns. Your workload might need different thresholds. Run your system under load, observe the failure modes, and set thresholds accordingly.

4. Build self-healing before you need it.

We discovered the ghost pod problem during a stress test. If we'd discovered it during a real production campaign for a paying customer, the conversation would have been very different. Test under realistic load and build recovery mechanisms before they become urgent.

5. Monitor your self-healing.

Our cleanup jobs log every action they take. We track how often nodes are removed, how often pods are force-deleted, and how long recovery takes. If the cleanup jobs are firing frequently, it's a signal that something else needs attention.

The Bigger Picture

These custom recovery jobs are part of a broader self-healing architecture that includes:

Graceful shutdown signals for Symfony Messenger consumers - when a spot instance is reclaimed, workers get 1-2 minutes to finish their current batch before being terminated.
SQS visibility timeouts - if a worker dies without acknowledging a message, the message returns to the queue for reprocessing.
Anti-affinity rules - Apache/FPM pods are spread across multiple nodes, so a single node failure never takes down an entire Mautic instance.
Canary deployments - updates are rolled out in batches, with automatic stop on failure.

Self-healing isn't a single feature. It's a layer of mechanisms that, together, make the platform resilient to the messy reality of running production workloads on cloud infrastructure.

Ready to Build Resilient Kubernetes Infrastructure?

If you're running production workloads on Kubernetes and struggling with recovery times, especially on spot instances, we can help. We've learned these lessons through real production experience with a multi-tenant platform at scale.

[Book a free consultation](https://www.droptica.com/contact/) to discuss your Kubernetes reliability architecture.

The Gap Between Theory and Production

Problem 1: Nodes That Won't Die

Our Solution: A 5-Minute Health Monitor

Why 8 Minutes?

Problem 2: Ghost Pods That Steal Resources

Our Solution: The Pod Cleanup Job

Why We Didn't Use an Existing Operator

The Impact

Lessons Learned

The Bigger Picture

Ready to Build Resilient Kubernetes Infrastructure?

Tags

Mautomic Team

More Articles

Multi-Tenant vs Multi-Instance: Why We Chose Isolated Mautic Installations Over a Shared Database

From 15 Repositories to 1: Consolidating a Multi-Vendor Python Codebase

Monitoring 671 Metrics Across a Multi-Tenant Platform: What We Track and Why

Need Help with Mautic?