Running production workloads on AWS spot instances sounds like a terrible idea - until you look at the numbers. A 50-65% cost reduction on compute is hard to ignore when you're operating a multi-tenant platform designed to scale to hundreds of instances.
We run a marketing automation platform on Kubernetes with real production traffic flowing through spot instances every day. This isn't a dev/staging trick. It's production. And it works - because we designed every layer of the system to handle interruptions gracefully.
Here's how we made spot instances safe for production, and how you can too.
Why Most Teams Are Afraid of Spots (And Why That's Reasonable)
The fear is straightforward: AWS can reclaim a spot instance at any moment with only 2-3 minutes of warning. For teams running stateful services or monolithic applications, that's a nightmare scenario. Your application is humming along, processing requests, and suddenly the rug gets pulled out.
The risks are real:
- Lost processing progress. A worker halfway through a task loses everything when the node disappears.
- Messages in limbo. If your application is consuming from a queue (like SQS) and the process dies without closing its connections, messages end up in an unknown state - not processed, not returned to the queue, just stuck until a visibility timeout expires.
- Duplicate work. When those stuck messages eventually return to the queue, another worker picks them up. If the original worker partially completed the work, you might get duplicates - duplicate email sends, duplicate API calls, duplicate database writes.
- Cascading failures. In a cluster environment, losing a node can trigger a chain reaction of pod rescheduling, resource pressure on remaining nodes, and degraded performance.
These fears are valid. But they're all solvable - if you architect for interruption from day one.
Our Node Group Strategy: Three Groups, Three Purposes
The foundation of running spots safely is understanding that not all workloads are equally tolerant of interruption. We split our Kubernetes cluster into three distinct node groups, each with its own lifecycle rules.
Management Node Group: The Heart of the Cluster
This group runs on on-demand instances. Always. It never scales down, and AWS can never reclaim it.
Why? Because it runs Karpenter - the autoscaler that manages all other node groups. If the autoscaler goes down because its underlying instance was reclaimed, the entire cluster loses its ability to react to changes. New nodes can't be provisioned. Unhealthy nodes can't be replaced. The cluster becomes static and fragile.
We think of the management node group as the heart of the cluster. Everything else can be interrupted and recovered, but the heart must keep beating. The cost of running a single on-demand management node is trivial compared to the chaos of losing your autoscaler.
Application Node Group: Web Traffic with High Availability
This group runs the web-facing pods - Apache and PHP-FPM containers that serve user requests. These pods are configured with anti-affinity rules, which means Kubernetes guarantees they always land on different physical servers.
With two replicas per tenant instance spread across separate nodes, losing one node still leaves the other replica handling traffic. The load balancer routes around the missing pod until Karpenter provisions a replacement node and Kubernetes schedules a new pod.
This group can run on spots, but only because the anti-affinity rules provide redundancy. Without them, a spot reclamation could take down an entire tenant's web interface.
Worker Node Group: Where the Real Savings Happen
This is the group that processes all the heavy lifting - cron jobs and Symfony Messenger consumers that handle email campaign dispatch, queue processing, webhook delivery, and async tasks.
These workers run on spot instances and this is where the 50-65% savings materialize.
Why are workers ideal for spots? Because queue-based work is inherently interruptible. If a worker dies mid-task, the message it was processing returns to the queue (via SQS visibility timeout) and another worker picks it up. No data is lost. No work is duplicated (if you handle it correctly). The system just continues.
The worker nodes are also configured to be bigger and more powerful than application nodes - they need the CPU and memory for heavy batch processing. We use Kubernetes labels to control scheduling, ensuring that worker pods only land on worker nodes and vice versa.
Graceful Shutdown: The Key to Making Spots Reliable
The difference between "spots work fine" and "spots cause random production incidents" comes down to one thing: how cleanly your workers shut down when interrupted.
Here's the sequence that executes when a spot instance is reclaimed (or a deployment triggers a pod restart):
Step 1: Termination signal arrives. AWS gives 2-3 minutes notice before reclaiming a spot instance. Kubernetes translates this into a SIGTERM signal sent to every pod on that node.
Step 2: Consumer finishes its current batch. Our Symfony Messenger consumers are configured with a grace period of 1-2 minutes. When they receive SIGTERM, they stop pulling new messages from SQS and focus on finishing whatever they're currently processing.
Step 3: SQS connections are properly closed. This step is critical and often overlooked. Symfony Messenger opens persistent connections to SQS. If the consumer is force-killed without closing these connections, the messages it was processing stay "in flight" on SQS - invisible to other consumers, but not acknowledged as processed. They sit in this limbo state until the SQS visibility timeout expires (typically minutes), at which point they return to the queue.
Proper connection cleanup means the consumer tells SQS "I'm done" or "I didn't finish, put these back" before exiting. This ensures a clean handoff.
Step 4: Force-kill fallback. If the consumer doesn't shut down within the grace period, Kubernetes force-kills it. This is the safety net - it shouldn't happen during normal operations, but if a consumer is stuck (deadlocked, waiting on an unresponsive external service), the force-kill ensures the node can shut down.
Step 5: SQS visibility timeout handles the rest. Even in the worst case - a force-kill with no clean connection closure - the SQS visibility timeout eventually returns unprocessed messages to the queue. Another worker picks them up and processes them. The message might be processed twice, but our deduplication mechanisms catch that.
Why SQS Connection Cleanup Deserves Its Own Section
We can't overstate how important proper connection cleanup is. Here's what happens without it:
- Consumer is processing a batch of 1,000 email recipients.
- Spot reclamation signal arrives.
- Consumer is force-killed mid-processing.
- SQS doesn't know the message is abandoned - it's still "in flight."
- For the next N minutes (visibility timeout), no other consumer can see this message.
- The visibility timeout expires. The message returns to the queue.
- Another consumer picks it up and processes the entire batch again.
If the original consumer had already sent 500 of the 1,000 emails before being killed, those 500 people now get duplicate emails. At scale - with hundreds of campaigns running simultaneously - this becomes a real deliverability problem.
With proper cleanup, when the consumer receives the termination signal, it either finishes the batch and acknowledges it, or it explicitly releases the message back to SQS. Either way, there's no ambiguity about the message's state.
The Numbers That Made Our CFO Happy
Here's the financial impact of our spot instance strategy:
- 50-65% savings on compute costs compared to on-demand pricing for worker nodes.
- Per-instance cost came in below our initial estimates - we had budgeted approximately $120/month per tenant instance, and the actual cost came in closer to $90/month. That delta, multiplied across hundreds of instances, is significant.
- Zero observed email loss from spot interruptions during our 1-million email stress test. The graceful shutdown and SQS retry mechanisms caught every edge case.
- Designed for 300-500 tenants - and the savings scale linearly. The more tenants on the platform, the more compute runs on spots, and the greater the absolute savings.
When to Use Spot vs. On-Demand: A Decision Framework
Not everything should run on spots. Here's our framework:
Always On-Demand
- Cluster management plane (autoscaler, monitoring, ingress controllers). If these go down, the cluster can't self-heal.
- Databases and stateful services. Spot reclamation of a database node is catastrophic unless you have synchronous replication and automatic failover.
- Single-replica critical services. If there's only one instance and no redundancy, on-demand is the only safe choice.
Safe for Spots (With Proper Design)
- Queue consumers and batch processors. These are the ideal spot workloads - interruptible, retryable, and stateless.
- Cron jobs. Most cron jobs are short-lived and can be retried if interrupted.
- Web pods with multiple replicas and anti-affinity. If you have 2+ replicas guaranteed to be on different nodes, losing one is a non-event.
The Grey Zone
- CI/CD runners. Spots are great for build agents, but a reclamation mid-build wastes time. Use spots for parallelizable builds where a retry is cheap.
- Development environments. Spots save money, but developers hate it when their environment disappears. Consider using spots with on-demand fallback.
Practical Advice for Teams Considering Spots
1. Design for interruption from day one. Don't bolt spot support onto an existing architecture. If your workers don't handle SIGTERM gracefully today, fix that before moving to spots.
2. Use multiple instance types. Configure your spot node group to accept a wide range of instance types. The more flexibility you give AWS, the less likely you are to face interruptions - AWS prefers to reclaim instances where demand is highest, so using less popular instance types reduces your exposure.
3. Separate your node groups by purpose. Don't mix management, application, and worker pods on the same nodes. Use labels, taints, and tolerations to ensure each workload lands on the right node group.
4. Monitor your spot interruption rate. Track how often spots are reclaimed and how your system recovers. If you're seeing frequent interruptions, adjust your instance type mix or region selection.
5. Test under load. A spot interruption during low traffic might be invisible. A spot interruption during a peak email campaign with 52 parallel workers is a very different story. Run stress tests with simulated spot reclamations.
6. Build custom self-healing. Kubernetes' default recovery is too slow for spot-heavy workloads. We built custom CronJobs that detect and remove unhealthy nodes within minutes, rather than waiting hours for the default Karpenter behavior.
The Compound Effect of Savings at Scale
The real power of spots becomes apparent at scale. When you're running a handful of instances, saving 50% on a few nodes is nice but not transformative. When you're running hundreds of tenant instances, each with worker pods consuming batch processing capacity, the savings compound dramatically.
For our platform - designed to support 300-500 tenants - spot instances aren't a nice-to-have optimization. They're a fundamental part of the business model that makes the per-tenant economics work. The difference between on-demand and spot pricing at that scale is the difference between a viable product and an unprofitable one.
Key Takeaways
- Spot instances in production are viable - but only with the right architecture. Design for interruption at every layer.
- Not all workloads are equal. Protect your management plane with on-demand instances. Put batch processing on spots.
- Graceful shutdown is the single most important feature for spot reliability. Invest heavily in proper signal handling and connection cleanup.
- SQS (or any queue system) provides the safety net. If a worker dies, the message returns. Design your consumers to be idempotent.
- The savings are real: 50-65% on compute costs, with per-instance costs coming in below our initial estimates.
- Test with real load. Spot interruptions during idle periods tell you nothing. Test during peak throughput.
Ready to Optimize Your AWS Costs?
If you're running Kubernetes workloads on AWS and paying full on-demand prices for everything, there's likely significant money on the table. We've built production systems that leverage spots safely at scale - and we can help you do the same.
[Book a free consultation](https://www.droptica.com/contact/) to discuss your infrastructure cost optimization strategy.
Written by
Mautomic Team
The Mautomic team brings together experienced marketing automation specialists, developers, and consultants dedicated to helping businesses succeed with Mautic.