Deploying Across 100+ Mautic Instances Without Breaking a Sweat: Wave-Based Deployments with Argo CD and Helm

Imagine you're running a marketing automation platform with over a hundred independent Mautic instances - each with its own database, its own configuration, and its own production traffic. Now you need to push an update. To all of them. Without breaking anything.

That's the deployment challenge we solved using Argo CD, Helm charts, and a carefully orchestrated wave-based approach. In this post, we'll break down how we deploy across 100+ Mautic instances safely, predictably, and with the ability to stop the moment something goes wrong.

The Problem: "kubectl apply" Doesn't Cut It at Scale

When you have a single application, deployment is straightforward. When you have a hundred independent instances of that application - each a complete Mautic installation with its own database - a naive deployment strategy becomes dangerous.

Here's what makes it hard:

Each instance has its own database schema. Deployments often include database migrations that must run before the new application code starts serving traffic.
Cron jobs and queue consumers must be stopped during the migration. If a cron runs against a partially-migrated database, you'll get data corruption.
A bad deployment rolling out to all instances simultaneously could take down the entire platform. You need the ability to test on a few instances first and stop if something breaks.
10 minutes per instance. At 100 instances deployed sequentially, that's almost 17 hours. You need parallelism - but controlled parallelism.

The Foundation: One Helm Chart to Rule Them All

Rather than maintaining separate Kubernetes YAML files for every Mautic instance (which would be insane at 100+ instances), we use a single Helm chart as the template for all of them.

The chart defines everything a Mautic instance needs: Apache/FPM pods, Messenger consumers, cron jobs, configuration secrets, health checks, and the deployment orchestration logic. Each tenant instance plugs in its own values - database credentials, SQS configuration, domain name - but the structure is identical.

Argo CD reads this Helm chart and manages the desired state for every instance. In the Argo CD dashboard, each Mautic installation appears as an "application" with a visual map of its resources: you can see the Apache pods (green = healthy), the FPM pods, the Messenger consumer (which restarts itself hourly by design), and all the cron jobs lit up green from their last successful run.

Sync Waves: Why Order Matters

Argo CD has a feature called sync waves that lets you define the order in which Kubernetes resources are created or updated. This is critical for Mautic deployments because things must happen in a specific sequence.

Here's our wave structure:

Wave 1: Installation Check

Before anything else, a bash script job runs and performs three checks:

Does the database exist for this instance?
Does the admin user exist?
Does the instance URL respond?

If all three checks pass, Mautic is already installed and the job exits successfully. If any check fails, the job triggers a full Mautic installation. This makes the entire process idempotent - you can re-run the sync as many times as you want and it will always do the right thing.

Wave 2: Application Pods

After the installation check passes, the actual application containers come up - Apache and FPM pods. These are configured with anti-affinity rules so they always land on different physical servers, ensuring high availability even during node failures.

Wave 3: Cron Jobs and Workers

Only after the application pods are healthy do the cron jobs and Messenger consumers start. The crons are activated with an explicit "activate" parameter - they don't just run because they exist, they run because the deployment process told them it's safe to start.

The Zero-Downtime Deployment Dance

When we push an update (say, a new Docker image with a Mautic patch), here's exactly what happens on each instance:

Scale all pods to zero. Apache, FPM, Messenger consumers - everything goes down. This is necessary because Mautic's database migrations can't run while the application is serving traffic.

Disable cron jobs. They're deactivated so they can't fire during the migration window.

Run deployment commands. Database migrations, cache clearing, asset compilation - whatever the update requires. This is the "magic on the database" as our DevOps engineer puts it.

Scale pods back up. Apache and FPM pods come back to their normal replica count (2 pods per instance with anti-affinity).

Reactivate crons. The cron jobs get their "activate" flag back and resume their schedules.

Health check. A dedicated /health route is pinged to confirm the instance is operational.

The entire process takes approximately 10 minutes per instance.

Batch Deployments: Canary Groups and Controlled Rollout

Now for the real power: deploying across many instances at once. We built custom Jenkins jobs that let you choose your rollout strategy:

One by one: Update instance #1, verify, update instance #2, verify... Safe but slow.
Groups of 10: Update 10 instances in parallel, verify all 10, then move to the next batch.
Custom batch sizes: Need to update 3 at a time? 20? Your call.

The critical feature is the canary group. The first batch of Mautic instances serves as the test group. If any instance in the canary group fails to reach a healthy state after deployment - if Argo CD doesn't show "Sync OK" with all jobs green - the rollout stops immediately. No further instances are touched until the issue is investigated and resolved.

Argo CD verifies health by checking that all sync waves completed successfully and all resources are in their expected state. If the installation job failed, or if pods didn't come up, or if the health check returned an error - the application status shows unhealthy, and the batch deployment won't proceed.

Handling Messenger Consumers: The Graceful Shutdown Problem

One of the trickier aspects of deploying Mautic in our architecture is the Symfony Messenger consumer. This worker processes SQS queues - email batches, webhook deliveries, async tasks. You can't just kill it mid-message.

When deployment starts and pods need to scale to zero, here's what happens:

The consumer receives a termination signal (SIGTERM).
It's given 1-2 minutes to finish processing its current message.
It must properly close its SQS connections before exiting. This is critical - if the connection isn't closed cleanly, the message stays "in flight" on SQS in an unknown state. It would eventually return to the queue via the visibility timeout, but it might have already been partially processed.
If the consumer doesn't shut down within the grace period, it's force-killed. The SQS visibility timeout ensures the message returns to the queue and will be reprocessed by another worker.

This same graceful shutdown logic also protects us during spot instance reclamations - when AWS decides to take back a spot instance, our workers get a few minutes to wrap up cleanly.

What This Looks Like in Practice

On a typical deployment day, our lead developer changes the Docker image tag in Argo CD, clicks "Sync," and the deployment cascade begins. The canary group deploys first. Within 10 minutes, the first batch is verified healthy. Then the next batch rolls out. And the next.

For a platform with dozens of instances, the full rollout completes in under an hour with full confidence that every instance is healthy. When the platform scales to 300-500 instances, the same pattern holds - just more batches.

The team doesn't babysit deployments. They click sync, check the canary, and let the pipeline do its work. If something breaks, it breaks on the canary - not on production tenant #247.

Key Takeaways

Sync waves are essential for applications that require ordered deployment steps (database migrations before application start, application before crons).

Canary groups protect at scale. Never deploy to all instances at once. Always test on a small group first and automate the "stop if unhealthy" logic.

Graceful shutdown is not optional for queue consumers. Invest time in proper signal handling and connection cleanup - it prevents the most insidious production issues.

One Helm chart, many instances. Keep your deployment definition in one place. Per-instance differences should be configuration values, not separate templates.

Health checks close the loop. Don't just deploy and hope. A /health route that's checked after every deployment gives you confidence that the instance is actually working.

Ready to Deploy at Scale?

If you're running multi-instance applications on Kubernetes and struggling with deployment orchestration, we'd love to help. We've been through the hard parts - sync wave ordering, graceful shutdown edge cases, canary deployment logic - and we can help you get there faster.

[Book a free consultation](https://www.droptica.com/contact/) to discuss your deployment architecture.

The Problem: "kubectl apply" Doesn't Cut It at Scale

The Foundation: One Helm Chart to Rule Them All

Sync Waves: Why Order Matters

The Zero-Downtime Deployment Dance

Batch Deployments: Canary Groups and Controlled Rollout

Handling Messenger Consumers: The Graceful Shutdown Problem

What This Looks Like in Practice

Key Takeaways

Ready to Deploy at Scale?

Tags

Mautomic Team

More Articles

Multi-Tenant vs Multi-Instance: Why We Chose Isolated Mautic Installations Over a Shared Database

From 15 Repositories to 1: Consolidating a Multi-Vendor Python Codebase

Monitoring 671 Metrics Across a Multi-Tenant Platform: What We Track and Why

Need Help with Mautic?