Why We Chose API Polling Over Webhooks for Bounce Handling at Scale

Every email platform engineer has heard the conventional wisdom: use webhooks for real-time event processing. Your ESP bounces an email, fires a webhook to your endpoint, you process it instantly. Simple, fast, elegant.

We ignored that advice. For our multi-tenant Mautic platform - where 300-400 organizations share a single Mailgun sending domain and push campaigns simultaneously - we built a crawler that polls the Mailgun API on a schedule. No webhook endpoints. No public-facing receivers. Just a background service fetching bounce records by time window.

Here's why polling beat webhooks for our architecture, and when it might beat them for yours.

The Multi-Tenant Bounce Problem

To understand why this decision mattered, you need to understand the architecture. Our platform runs hundreds of independent Mautic instances - each with its own database, its own contacts, its own campaigns. But all of them send email through a single Mailgun domain.

When an email bounces, Mailgun records the event. But Mailgun doesn't know which Mautic instance the email belongs to. From Mailgun's perspective, the bounce happened on the shared domain. Period.

Somehow, that bounce event needs to reach the correct tenant's Mautic instance so it can update the contact record, suppress future sends, and maintain accurate deliverability metrics.

This is the routing problem. And the approach you choose for ingesting bounce events - webhooks or polling - fundamentally shapes how you solve it.

The Webhook Approach (And Why We Walked Away)

The textbook solution looks like this: configure Mailgun to send webhooks to your platform. Set up an endpoint that receives all bounce events for the shared domain. Parse the event, identify the tenant, forward the bounce to the correct Mautic instance.

On paper, clean. In production, problematic.

Availability requirements are brutal. A webhook endpoint must be available 100% of the time. If your endpoint is down when a bounce event fires, that event might be lost. Mailgun retries failed webhooks, yes - but for how long? What if your system is down for maintenance, or recovering from a deployment, or handling a traffic spike? Every missed bounce is a contact that keeps receiving emails they shouldn't, eroding your sender reputation.

Burst traffic is unpredictable. When 300 tenants launch campaigns simultaneously, bounces don't arrive in a steady stream. They come in bursts. Your webhook endpoint needs to handle sudden spikes of thousands of events per minute, then idle for hours. Sizing infrastructure for peak burst is expensive. Under-sizing means dropped events.

Ordering is chaotic. Webhooks arrive when they arrive. During high-volume sends, events for the same contact can arrive out of order. A soft bounce might arrive after a hard bounce for the same address, leading to incorrect contact status if you're not careful with idempotency.

Debugging is painful. When a bounce doesn't reach the correct Mautic instance, was it because Mailgun didn't send the webhook? Because your endpoint was temporarily unreachable? Because the routing logic failed? Because the event was malformed? Diagnosing missing webhooks requires correlating logs across multiple systems with imperfect timestamps.

You need a public endpoint. Webhooks require exposing an endpoint to the internet. That endpoint needs authentication, rate limiting, input validation, DDoS protection. It's another attack surface on your infrastructure.

We also evaluated FrankenPHP as part of the bounce handling infrastructure but ultimately abandoned that direction. The complexity of running a reliable, high-availability webhook receiver for this volume wasn't justified by the benefits.

The Polling Approach: A Crawler That Never Misses

Instead of waiting for Mailgun to push events to us, we built a dedicated service that pulls them on a schedule.

Here's how it works:

Time-window fetching. The crawler runs on a regular interval. Each run, it queries the Mailgun API for bounce events that occurred since the last successful fetch. The time offset ensures continuity - every bounce that Mailgun recorded in that window is retrieved, regardless of what was happening on our side during that period.

Tenant identification via custom headers. When our platform sends an email through Mailgun, it includes a custom header containing a tenant identifier - a prefix that maps to the originating Mautic instance. When the crawler retrieves a bounce record, it reads this header from the original message metadata and knows exactly which tenant the bounce belongs to.

Routing to the correct instance. Once the tenant is identified, the bounce data is forwarded to that tenant's Mautic instance. Mautic processes it normally - updating the contact record, marking the address as bounced, suppressing future sends.

Hard and soft bounce handling. The crawler distinguishes between permanent failures (hard bounces - invalid address, domain doesn't exist) and temporary failures (soft bounces - mailbox full, server temporarily unavailable). Each type triggers different behavior in Mautic: hard bounces immediately suppress the contact, soft bounces increment a counter that triggers suppression after repeated failures.

Why Polling Wins for Multi-Tenant Architecture

The advantages compound at scale:

Resilience to downtime. If our crawler service is down for an hour - because of a deployment, a bug, or infrastructure issues - nothing is lost. When the crawler comes back up, it fetches all events from the last successful timestamp forward. The time-window approach is inherently self-healing. There's no event to "miss" because the data stays in Mailgun's API until we fetch it.

Predictable load patterns. We control the polling frequency and batch size. Instead of handling unpredictable webhook bursts, we process bounces in controlled batches at a pace our infrastructure handles comfortably. Resource planning is straightforward.

Simple debugging. If we suspect a bounce wasn't processed, we can re-poll any time window and compare the results against our records. The Mailgun API is the source of truth, and we can query it at any time. No guessing about whether an event was sent or received.

No public endpoint required. The crawler is an internal service that makes outbound API calls. There's no webhook URL to expose, secure, or keep available. Our attack surface stays smaller.

Idempotent by design. Polling the same time window twice returns the same events. Processing the same bounce event twice doesn't cause issues because the bounce handler is idempotent - marking an already-bounced contact as bounced is a no-op. This means overlapping time windows (a safety measure we use) never cause problems.

Operationally simple. The crawler is a single service with a single responsibility. It runs on a schedule, fetches data, routes it. There are no complex retry mechanisms, no queue of incoming webhook payloads, no concerns about event ordering. When something goes wrong, there's one place to look.

The Trade-offs We Accepted

Polling isn't perfect. Here's what we gave up:

Latency. Bounces aren't processed the instant they happen. There's a delay equal to the polling interval. If we poll every few minutes, a bounce might sit unprocessed for up to that interval before being picked up. For our use case, this is perfectly acceptable - bounce processing doesn't need sub-second latency. A contact that bounced will be suppressed before the next campaign send, which is what matters.

API rate limits. We need to respect Mailgun's API rate limits. Polling too aggressively can hit throttling. We tuned the polling frequency to balance freshness against rate limit headroom - with enough margin that spikes in bounce volume don't push us into throttling territory.

API cost. Making API calls costs slightly more than receiving webhooks (which are free from Mailgun's side). At our scale, the difference is negligible - pennies compared to the operational cost of maintaining a highly available webhook infrastructure.

Dependency on API availability. If Mailgun's API is down, we can't fetch bounces. In practice, Mailgun's API has been more reliable than any webhook receiver we could build ourselves. And when it does have issues, the time-window approach means we catch up automatically once it's back.

When to Use Which Approach

Our experience suggests a clear decision framework:

Use webhooks when:

You're running a single tenant or a small number of tenants
Volume is low to moderate (thousands of events per day, not per minute)
You need real-time processing (transactional email confirmations, instant notifications)
You have robust infrastructure for a highly available endpoint

Use polling when:

You're running a multi-tenant platform with shared sending infrastructure
Volume is high and bursty (hundreds of clients sending simultaneously)
Reliability matters more than real-time latency
You want operational simplicity and fewer moving parts
Debugging and auditability are important requirements

Consider a hybrid approach when:

You need real-time processing AND guaranteed delivery
Webhooks handle the happy path; polling runs as a safety net to catch anything that was missed
You have the engineering capacity to maintain both systems

The hybrid approach is the most robust but also the most complex. For most teams, choosing one approach and building it well is better than building two systems poorly.

Bounces Affect Reputation - Process Them Reliably

There's a deeper reason why bounce handling architecture matters: bounces directly impact your sender reputation. Every email you send to an address that has previously hard-bounced is a signal to ISPs that you're not maintaining your lists. On a shared sending domain, one tenant's unprocessed bounces can degrade deliverability for every other tenant on the platform.

Reliable bounce processing isn't just a technical nicety. It's a business requirement. A missed bounce today means a reputation hit tomorrow, which means lower inbox placement rates for everyone on the platform next week.

That's why we chose the approach that prioritizes reliability over speed. A bounce processed three minutes late is infinitely better than a bounce that's lost forever because a webhook endpoint was briefly unavailable.

Need Help With Email Deliverability at Scale?

If you're building a multi-tenant email platform and wrestling with bounce handling, deliverability architecture, or sender reputation management, we've spent years solving these problems in production. We can help you choose the right approach for your architecture and avoid the pitfalls we've already navigated.

[Book a free consultation](https://www.droptica.com/contact/) to discuss your email deliverability architecture.

The Multi-Tenant Bounce Problem

The Webhook Approach (And Why We Walked Away)

The Polling Approach: A Crawler That Never Misses

Why Polling Wins for Multi-Tenant Architecture

The Trade-offs We Accepted

When to Use Which Approach

Bounces Affect Reputation - Process Them Reliably

Need Help With Email Deliverability at Scale?

Tags

Mautomic Team

More Articles

Multi-Tenant vs Multi-Instance: Why We Chose Isolated Mautic Installations Over a Shared Database

From 15 Repositories to 1: Consolidating a Multi-Vendor Python Codebase

Monitoring 671 Metrics Across a Multi-Tenant Platform: What We Track and Why

Need Help with Mautic?