Monitoring 671 Metrics Across a Multi-Tenant Platform: What We Track and Why

When you're running a single application, monitoring is straightforward. CPU, memory, request latency, error rate - a handful of dashboards and you're covered. When you're running a multi-tenant platform with dozens of independent Mautic instances on Kubernetes, each with its own database, pods, cron jobs, and queue consumers, "a handful of dashboards" doesn't cut it.

We currently track 671 distinct metrics across our platform using Prometheus, Grafana, and Loki. That number sounds excessive until you understand what each category covers and why it matters. In this post, we'll break down the ten key metric categories that keep our platform observable, explain what specific metrics we watch within each, and share what we've learned about building monitoring that's actually useful - not just comprehensive.

The Observability Stack

Before diving into metrics, here's what we're working with:

Prometheus collects and stores time-series metrics from every component in the cluster.
Grafana provides dashboards, alerting, and visualization.
Loki handles log aggregation, giving us searchable logs from all pods across all tenants.
DynamoDB stores structured log data for specific processing pipelines.

We started with community-built Grafana dashboards - the ones maintained by the Kubernetes monitoring community - and they gave us a solid baseline. But as our platform grew, we found ourselves building dedicated dashboards focused on our specific multi-tenant concerns. Generic dashboards show you data. Purpose-built dashboards answer questions.

Category 1: Scrape Health - Can We Even Collect Data?

Before you can monitor anything, you need to know that your monitoring system itself is working. This is the meta-monitoring category.

Key metrics:

up - A binary indicator (0 or 1) for each scrape target. Is Prometheus successfully reaching this target? If up is 0, every other metric for that target is missing, and you're flying blind.
scrape_duration_seconds - How long each scrape takes. Rising values indicate either target-side performance issues or network problems between Prometheus and the target.
scrape_samples_scraped - How many individual metric samples each target exposes. Useful for understanding your metrics footprint.
scrape_series_added - Detects cardinality explosions early. If a target suddenly starts exposing thousands of new series, you want to know before Prometheus runs out of memory.

Why it matters: If your scrape targets are down or degraded, your entire monitoring pipeline is compromised. This category is the foundation everything else rests on.

Category 2: Kubernetes Control Plane - Is the Cluster Healthy?

The Kubernetes API server is the brain of the cluster. Every operation - pod scheduling, scaling, configuration changes, health checks - goes through it.

Key metric:

apiserver_request_total - Total API server requests, broken down by HTTP status code, verb, and resource type. This tells you the volume of cluster operations and, critically, the error rate.

A spike in 5xx responses from the API server means the cluster itself is struggling. If the API server is unhealthy, everything downstream - deployments, scaling, self-healing - stops working correctly.

Why it matters: On a multi-tenant platform, API server health affects all tenants simultaneously. A control plane issue is a platform-wide issue.

Category 3: Pod Health and Lifecycle - What's Running and What's Struggling?

With dozens of tenant instances, each running multiple pods, knowing the health state of every pod is essential.

Key metrics:

kube_pod_status_phase - Categorizes every pod as Pending, Running, Succeeded, Failed, or Unknown. A pod stuck in Pending means scheduling failed. A pod in Failed means something crashed.
kube_pod_status_ready - Is the pod actually serving traffic? A pod can be Running but not Ready (still initializing, failing health checks). This is your basic service-level indicator.
kube_pod_container_status_restarts_total - The restart counter. A rising value signals CrashLoopBackOff - the pod starts, crashes, restarts, crashes, restarts. This is the single most important early warning for application issues.
kube_pod_container_status_waiting_reason - When a pod is stuck waiting, this tells you why: ImagePullBackOff (can't pull the Docker image), resource limits (not enough CPU/memory to schedule), configuration errors.
kube_pod_status_unschedulable - The cluster doesn't have room for this pod. Either resource requests are too high, node taints are misconfigured, or the cluster needs to scale up.

Why it matters: On a multi-tenant platform, a pod issue for one tenant shouldn't cascade to others. These metrics let us isolate exactly which tenant's pods are struggling and why - without logging into individual instances.

Category 4: Workload Controllers - Are Deployments Succeeding?

Kubernetes controllers (Deployments, StatefulSets, DaemonSets) manage the desired state of your workloads. When a deployment specifies 2 replicas, the controller ensures 2 pods are running.

Key metrics:

kube_deployment_spec_replicas vs. kube_deployment_status_replicas_ready - The gap between desired and actual replicas. If spec says 2 but ready says 1, something is wrong.
Deployment conditions: Progressing and Available - A deployment stuck in "Progressing" for too long means a rollout is stalled. Maybe the new image is failing health checks. Maybe resource limits are preventing scheduling.
kube_job_failed and kube_job_complete - For our cron jobs and one-off deployment tasks, these tell us whether the job finished successfully or failed.

Why it matters: With wave-based deployments across many tenant instances, knowing that each deployment reached its desired state is critical. A stuck rollout on one instance shouldn't go unnoticed while we're deploying the next batch.

Category 5: Node Health - What's Happening to the Machines?

Our platform runs on a mix of on-demand and spot instances. Nodes come and go - that's by design. But we need to know when they do.

Key metrics:

kube_node_status_condition - Is the node Ready or NotReady? On a spot-instance-heavy cluster, nodes transitioning to NotReady is expected but must be tracked to ensure workloads are rescheduled promptly.
kube_node_spec_unschedulable - Detects when a node has been cordoned (marked as unschedulable), either manually or by the cluster autoscaler during scale-down.
kube_node_status_capacity vs. kube_node_status_allocatable - How much headroom does the cluster have? If allocatable resources are consistently close to capacity, the cluster needs to scale before it hits a wall.

Why it matters: Spot instances save us 50-65% on compute costs, but they require vigilant node monitoring. A node reclaimed by AWS must result in pods being rescheduled - not workloads being silently lost.

Category 6: Container Resources - Who's Using What?

Right-sizing resource requests and limits is an ongoing process. These metrics guide that process.

Key metrics:

container_cpu_usage_seconds_total - Actual CPU consumption per container. Compare this against requests and limits to identify over-provisioned or under-provisioned workloads.
container_cpu_cfs_throttled_seconds_total - CPU throttling. If this value is high, the container's CPU limit is too low, and the application is being slowed down by the scheduler. This is a silent performance killer.
container_memory_working_set_bytes - Real memory usage (not including cache). This is the metric that predicts OOM (Out of Memory) kills.
container_oom_events_total - Actual OOM kills. When a container exceeds its memory limit, Kubernetes terminates it. This metric counts those events.

Why it matters: On a multi-tenant platform, resource contention between tenants is the "noisy neighbor" problem. These metrics ensure each tenant's resource allocation is appropriate and that no single tenant is starving others.

Category 7: Ingress and Traffic - What's Coming in the Front Door?

Traefik serves as our ingress controller, routing traffic to the correct tenant instance based on the hostname.

Key metrics:

traefik_entrypoint_requests_total - Total request volume across the platform. Broken down by entrypoint, this shows traffic patterns and peaks.
traefik_entrypoint_request_duration_seconds - Request latency at the ingress level. We watch p95 and p99 values - if these climb, something downstream is degrading.
traefik_open_connections - Connection pressure. A spike in open connections could indicate a slow backend, a DDoS attempt, or a misconfigured client.
traefik_config_last_reload_success - After every configuration change (new tenant added, routing updated), did Traefik successfully reload? A failed reload means new tenants aren't reachable.

Why it matters: Traefik is the front door to the entire platform. Latency or errors here affect every single tenant. This is where we detect platform-wide issues fastest.

Category 8: GitOps and Deployments - Is Argo CD Keeping Up?

Argo CD manages the desired state for every tenant instance, syncing Kubernetes resources from our Git repository.

Key metrics:

argocd_app_info - Deployment state for every application. Which version is deployed where? Is each app synced, out-of-sync, or in an error state?
argocd_app_orphaned_resources_count - Drift detection. Orphaned resources mean something exists in the cluster that isn't defined in Git - a sign of manual intervention or configuration drift.
argocd_app_reconcile_* - Reconciliation performance. How quickly is Argo CD processing changes? With dozens of applications to manage, reconciliation latency matters. If Argo falls behind, deployments stall.

Why it matters: In a GitOps-driven platform, Argo CD is the deployment engine. If it's slow or erroring, updates don't reach tenants.

Category 9: Queue Processing - Are Messages Being Handled?

Our platform processes email sends, webhook deliveries, and async tasks through SQS queues consumed by Symfony Messenger workers.

Key metrics:

workqueue_depth - The number of items waiting in the queue. A growing queue means consumers can't keep up with producers. This is an early warning for email delivery delays.
workqueue_queue_duration_seconds - How long items sit in the queue before processing. Rising values mean increased latency for email sends.
workqueue_unfinished_work_seconds - Time spent on items currently being processed. Unusually high values may indicate stuck or very slow processing.
workqueue_retries_total - Retry volume. A spike in retries means errors are occurring - failed API calls, transient service issues, or data problems.

Why it matters: For a marketing automation platform, email delivery speed is the product. Queue metrics are directly tied to the core value proposition.

Category 10: Storage - Are Volumes Healthy?

After our migration from EFS to FSx (a separate blog post on its own), storage monitoring became non-negotiable.

Key metrics:

kube_persistentvolumeclaim_status_phase - Is each PVC in Bound state? A PVC stuck in Pending means a tenant's storage isn't available - and their instance is likely broken.
csi_operations_seconds - Latency of storage driver operations. If the CSI driver is slow, every file operation across the platform is slow. This was exactly the metric that would have caught our EFS performance issues earlier.

Why it matters: Storage is the foundation. If it's degraded, everything built on top of it suffers - and the symptoms are diffuse and hard to diagnose without specific storage metrics.

Building Dashboards That Answer Questions

Having 671 metrics is meaningless if you can't extract actionable information from them. Here's our approach to dashboard design:

Every dashboard answers a specific question. Not "show me everything about pods" but "which tenant pods are unhealthy right now, and why?" The question determines which metrics appear and how they're visualized.

Start with community dashboards, then customize. The Kubernetes community maintains excellent Grafana dashboards that cover most of these categories. We used them as our baseline and then built focused views for our multi-tenant-specific concerns: per-tenant resource usage, cross-tenant queue performance, tenant deployment status.

Alert on trends, not thresholds. A single pod restart isn't an emergency. Five pod restarts in ten minutes is. We build alerts around rate-of-change patterns rather than static thresholds wherever possible.

What 671 Metrics Taught Us

Monitoring at this scale taught us that observability isn't about collecting more data - it's about collecting the right data and making it accessible. Every metric category we've described serves a specific purpose in our operational workflow. Remove any one category, and we'd have a blind spot that would eventually cause an incident.

The investment in monitoring infrastructure pays for itself the first time you catch a stuck deployment, a storage degradation, or a queue backup before it affects tenants. For a multi-tenant platform, observability isn't a nice-to-have - it's a core feature of the product.

[Contact us](https://www.droptica.com/contact/) if you're building a multi-tenant platform and need help designing an observability strategy that scales with your tenant count.

The Observability Stack

Category 1: Scrape Health - Can We Even Collect Data?

Category 2: Kubernetes Control Plane - Is the Cluster Healthy?

Category 3: Pod Health and Lifecycle - What's Running and What's Struggling?

Category 4: Workload Controllers - Are Deployments Succeeding?

Category 5: Node Health - What's Happening to the Machines?

Category 6: Container Resources - Who's Using What?

Category 7: Ingress and Traffic - What's Coming in the Front Door?

Category 8: GitOps and Deployments - Is Argo CD Keeping Up?

Category 9: Queue Processing - Are Messages Being Handled?

Category 10: Storage - Are Volumes Healthy?

Building Dashboards That Answer Questions

What 671 Metrics Taught Us

Tags

Mautomic Team

More Articles

Multi-Tenant vs Multi-Instance: Why We Chose Isolated Mautic Installations Over a Shared Database

From 15 Repositories to 1: Consolidating a Multi-Vendor Python Codebase

500x Faster: Why We Migrated from EFS to FSx for Our Kubernetes Workloads

Need Help with Mautic?