Prometheus has become the de facto standard for metrics collection in cloud-native environments. Its pull-based model, powerful query language, and deep Kubernetes integration make it an obvious choice for platform teams. But as organizations scale — more services, more replicas, more labels — Prometheus starts showing cracks. Queries slow down, memory usage balloons, and what was once a reliable monitoring backbone becomes an operational liability. This article examines exactly why that happens and what you can do about it, from quick tactical fixes to full architectural overhauls.
The Cardinality Problem: Why It Kills Prometheus
Cardinality is the single most important concept to understand when troubleshooting Prometheus scalability. In the context of time series databases, cardinality refers to the total number of unique label combinations that exist across all your metrics. Every unique combination creates a distinct time series, and Prometheus must store, index, and query each of them independently.
Consider a simple HTTP request counter: http_requests_total. If you label it with method (GET, POST, PUT, DELETE), status_code (200, 201, 400, 404, 500, 503), and endpoint (50 distinct API paths), you already have 4 × 6 × 50 = 1,200 time series from a single metric. Now add a customer_id label with 10,000 distinct values. You have just created 12 million time series from one counter.
This is the cardinality explosion pattern, and it is the most common cause of Prometheus degradation in production. The problem is compounded by labels that have unbounded or high-entropy values:
- User IDs or session tokens embedded in labels
- Request IDs or trace IDs (effectively infinite cardinality)
- Pod names without proper aggregation, especially in autoscaling environments
- Free-form error messages or SQL query strings
- IP addresses, particularly in environments with high churn
The relationship between cardinality and resource consumption is not linear — it is roughly proportional but carries significant overhead per series in memory indexing structures. Prometheus stores its head block (the most recent data) entirely in memory. Each time series in the head block requires approximately 3–4 KB of RAM for the series itself plus index entries. A Prometheus instance with 1 million active time series will typically consume 4–6 GB of RAM just for the head block, before accounting for query processing overhead.
Memory Explosion Patterns and Real Symptoms
Memory issues in Prometheus rarely announce themselves cleanly. Instead, they manifest through a cascade of symptoms that are easy to misdiagnose. Understanding the failure modes helps you identify the root cause faster and apply the right remedy.
The Head Block Growth Pattern
Prometheus keeps a two-hour window of data in memory as the head block before compacting it to disk. If your series count grows continuously — which happens when pod churn creates new series faster than old ones expire — the head block never shrinks. You can monitor this directly with prometheus_tsdb_head_series and prometheus_tsdb_head_chunks. A healthy instance shows this number plateauing. A cardinality problem shows it growing monotonically until OOM.
Query Timeout Cascades
As series count grows, even well-written PromQL queries that worked fine at 100k series become unbearably slow at 1M. Grafana dashboards start timing out, alert evaluation lags behind schedule, and Alertmanager begins receiving delayed or duplicated firing alerts. The prometheus_rule_evaluation_duration_seconds metric is a reliable early warning — when p99 evaluation time for your recording rules exceeds your evaluation interval, you have a problem.
Scrape Failures Under Memory Pressure
When Prometheus is under heavy memory pressure, its Go garbage collector starts spending more time collecting, which introduces latency into the scrape loop. Scrapes begin timing out, causing gaps in your data. This creates a deceptive situation where you have gaps in metrics precisely when your system is under stress — exactly when you need monitoring most. Watch up metric drops and prometheus_target_scrapes_exceeded_sample_limit_total for these patterns.
Compaction Pressure
High cardinality also stresses the TSDB compaction process. Prometheus compacts head block data into persistent blocks every two hours. With millions of series, compaction can take tens of seconds to minutes, during which write performance degrades. prometheus_tsdb_compaction_duration_seconds rising above 30 seconds is a warning sign. Compaction failures leave orphaned blocks on disk, gradually consuming storage and potentially corrupting the TSDB if left unaddressed.
Short-Term Fixes: Tactical Remediation
When you are dealing with a Prometheus instance under active stress, you need immediate relief before you can implement architectural changes. These techniques can be applied quickly and provide meaningful headroom while longer-term solutions are planned.
Recording Rules: Pre-Computing Aggregations
Recording rules are the most underutilized tool in the Prometheus toolbox. They allow you to pre-compute expensive PromQL expressions and store the results as new time series. The key benefit for scalability is that you can aggregate away high-cardinality dimensions, dramatically reducing the number of series that dashboards and alerts need to query at runtime.
Consider an example where you have per-pod HTTP request rates with labels for pod, namespace, service, method, and status_code. Your dashboards mostly need service-level aggregations, not per-pod breakdowns. A recording rule can produce that aggregation once per evaluation interval:
groups:
- name: http_aggregations
interval: 30s
rules:
- record: job:http_requests_total:rate5m
expr: |
sum by (job, namespace, method, status_code) (
rate(http_requests_total[5m])
)
- record: job:http_request_duration_seconds:p99_5m
expr: |
histogram_quantile(0.99,
sum by (job, namespace, le) (
rate(http_request_duration_seconds_bucket[5m])
)
)
- record: namespace:http_requests_total:rate5m
expr: |
sum by (namespace, status_code) (
rate(http_requests_total[5m])
)Notice that the pod label is dropped in all three rules. If you had 500 pods, you have just reduced the cardinality of these series by a factor of 500. Dashboards querying job:http_requests_total:rate5m instead of computing rate(http_requests_total[5m]) on the fly will return results orders of magnitude faster.
The naming convention level:metric:operations is the Prometheus community standard. Following it consistently makes recording rules self-documenting and helps teams understand the aggregation level at a glance.
Metric Dropping via Relabeling
Relabeling gives you surgical control over what metrics Prometheus actually ingests. There are two stages where relabeling applies: relabel_configs (applied before scraping, based on target metadata) and metric_relabel_configs (applied after scraping, based on scraped metric names and labels). For cardinality control, metric_relabel_configs is your primary tool.
Dropping entire metric families that you do not use is the most impactful change you can make. Many exporters emit dozens of metrics that are irrelevant for most use cases:
scrape_configs:
- job_name: kubernetes-pods
metric_relabel_configs:
# Drop metrics we never query
- source_labels: [__name__]
regex: 'go_gc_.*|go_memstats_.*|process_.*'
action: drop
# Drop high-cardinality label values while keeping the metric
- source_labels: [__name__, pod]
regex: 'http_requests_total;.*'
target_label: pod
replacement: ''
# Drop entire time series based on label combinations
- source_labels: [__name__, le]
regex: 'http_request_duration_seconds_bucket;(\+Inf|100|250|500)'
action: keep
# Replace high-cardinality endpoint paths with normalized versions
- source_labels: [endpoint]
regex: '/api/v1/users/[0-9]+'
target_label: endpoint
replacement: '/api/v1/users/:id'Be careful with metric_relabel_configs — they are applied per scraped sample, so computationally expensive regex patterns across high-frequency scrapes can add CPU overhead. Test regex patterns and prefer anchored, non-backtracking expressions.
Cardinality Limits as a Safety Net
Prometheus 2.x introduced per-scrape sample limits as a defensive mechanism. These do not solve cardinality problems but prevent a single misbehaving exporter from taking down your entire Prometheus instance:
global:
# Global limit across all scrapes
sample_limit: 0 # 0 = no limit
scrape_configs:
- job_name: application-pods
# Reject scrapes that return more than 50k samples
sample_limit: 50000
# Limit unique label sets per scrape
label_limit: 64
# Limit label name and value lengths
label_name_length_limit: 256
label_value_length_limit: 1024
kubernetes_sd_configs:
- role: podWhen a scrape exceeds sample_limit, Prometheus rejects the entire scrape and marks the target as having failed. This is a hard circuit breaker, not a graceful degradation — the target’s up metric goes to 0. Set limits conservatively above your expected maximum to avoid false positives, and alert on prometheus_target_scrapes_exceeded_sample_limit_total > 0.
Architectural Solutions: Federation and Remote Write
Once you have exhausted tactical optimizations or when your scale genuinely exceeds what a single Prometheus instance can handle, architectural changes become necessary. Prometheus offers two built-in mechanisms for scaling horizontally: federation and remote_write.
Federation: Hierarchical Scraping
Prometheus federation allows one Prometheus instance to scrape aggregated metrics from other Prometheus instances via the /federate endpoint. In a typical setup, leaf-level Prometheus instances collect raw metrics from targets, while a global Prometheus instance federates pre-aggregated recording rule results from the leaves.
# Global Prometheus configuration federating from regional instances
scrape_configs:
- job_name: federate-eu-west
scrape_interval: 15s
honor_labels: true
metrics_path: /federate
params:
match[]:
# Only federate pre-aggregated recording rule metrics
- '{__name__=~"job:.*"}'
- '{__name__=~"namespace:.*"}'
- '{__name__=~"cluster:.*"}'
# Federate key infrastructure alerts
- 'up{job="kubernetes-apiservers"}'
static_configs:
- targets:
- prometheus-eu-west.monitoring.svc:9090
- prometheus-us-east.monitoring.svc:9090
- prometheus-ap-south.monitoring.svc:9090Federation works well for multi-region global dashboards and cross-cluster alerting on aggregated signals. Its limitations are significant, though: the /federate endpoint is a point-in-time snapshot, so you cannot run range queries against federated data effectively. It also creates a single point of failure at the global layer and does not provide true long-term storage. For those requirements, remote_write is the better path.
Remote Write: Streaming to Durable Storage
Remote write allows Prometheus to stream all ingested samples to an external storage backend in real time. The external backend handles long-term retention, multi-tenancy, and global query federation. Prometheus itself becomes a stateless collection agent that maintains only a short local retention window for resilience against network outages.
remote_write:
- url: https://thanos-receive.monitoring.svc:19291/api/v1/receive
# Authentication for the remote endpoint
basic_auth:
username: prometheus
password_file: /etc/prometheus/secrets/remote-write-password
# Tune the write queue for throughput vs. latency
queue_config:
# Number of shards (parallel write connections)
max_shards: 200
min_shards: 1
# Samples to batch before flushing
max_samples_per_send: 500
# Time to wait before flushing an incomplete batch
batch_send_deadline: 5s
# In-memory buffer capacity per shard
capacity: 2500
# How long to retry failed writes
min_backoff: 30ms
max_backoff: 5s
# Metadata configuration
metadata_config:
send: true
send_interval: 1m
# Filter what gets remote-written (reduce egress)
write_relabel_configs:
- source_labels: [__name__]
regex: 'go_gc_.*|go_memstats_.*'
action: dropThe queue_config tuning is critical and frequently misunderstood. Each shard maintains its own connection to the remote endpoint and its own in-memory queue. Increasing max_shards increases parallelism and throughput but also increases memory consumption and load on the remote endpoint. The right values depend heavily on your sample ingestion rate and network latency to the remote endpoint. Monitor prometheus_remote_storage_queue_highest_sent_timestamp_seconds versus prometheus_remote_storage_highest_timestamp_in_seconds — the lag between them tells you how far behind your remote write queue is.
Long-Term Solutions: Thanos vs Grafana Mimir vs VictoriaMetrics
For production systems that need long-term storage, global query capability, high availability, and genuine horizontal scalability, purpose-built solutions are the right answer. Three projects dominate this space: Thanos, Grafana Mimir, and VictoriaMetrics. They share similar goals but differ significantly in architecture, operational complexity, and trade-offs.
| Criterion | Thanos | Grafana Mimir | VictoriaMetrics |
|---|---|---|---|
| Architecture | Sidecar + object store; modular components | Fully distributed; Cortex-derived microservices | Single binary or cluster mode |
| Storage backend | Any S3-compatible object store | Any S3-compatible object store | Own TSDB format on local or object store |
| PromQL compatibility | Full PromQL; own query engine | Full PromQL; Mimir-specific extensions | MetricsQL (PromQL superset) |
| Operational complexity | Medium — multiple components, each simple | High — many microservices with complex config | Low — minimal components, simple config |
| Ingest scalability | Scales via Thanos Receive fan-out | Horizontally scalable distributors + ingesters | Excellent; handles millions of samples/sec per node |
| Query performance | Good; Store Gateway caches object store data | Good; query sharding and caching built in | Excellent; highly optimized query engine |
| Multi-tenancy | Limited; tenant isolation via external labels | Native; per-tenant limits and isolation | Enterprise only; basic in cluster mode |
| Deduplication | Built-in; replica dedup at query time | Built-in; ingest-time and query-time dedup | Built-in; dedup with downsampling |
| Downsampling | Yes; Thanos Compactor handles it | Yes; configurable per tenant | Yes; automatic with vmbackupmanager |
| License | Apache 2.0 (fully open source) | AGPL-3.0 (open source) + enterprise tier | Apache 2.0 (community); proprietary enterprise |
| Best fit | Teams already running Prometheus wanting minimal disruption | Large orgs needing multi-tenant SaaS-grade monitoring | Teams prioritizing simplicity and raw performance |
Thanos: The Incremental Path
Thanos integrates with existing Prometheus deployments through a sidecar process that runs alongside each Prometheus pod. The sidecar uploads completed TSDB blocks to object storage (S3, GCS, Azure Blob) and exposes a gRPC Store API that Thanos Query uses to federate queries across all Prometheus instances plus historical data in the object store. This makes Thanos the lowest-friction path for teams with existing Prometheus infrastructure.
Thanos Receive is an alternative ingest path that accepts remote_write directly, which is useful when you want to decouple Prometheus instances from the query layer or implement active-active HA without relying on Prometheus replication. Thanos Compactor handles block compaction and downsampling on the object store, creating 5-minute and 1-hour resolution downsamples automatically for efficient long-range queries.
Grafana Mimir: Enterprise-Grade Multi-Tenancy
Mimir is a fork of Cortex, rewritten by Grafana Labs to address operational complexity issues in Cortex’s architecture. It follows the same microservices pattern — Distributor, Ingester, Querier, Query Frontend, Store Gateway, Compactor, Ruler — but with significantly improved defaults and a monolithic deployment mode that simplifies small-scale deployments. Mimir’s headline feature is native multi-tenancy with per-tenant cardinality limits, query limits, and ingestion rate limits enforced at the distributor layer.
Mimir is the right choice when you need to run monitoring as an internal platform service for multiple teams or business units, each with independent resource quotas and data isolation. The operational overhead is substantial, but for large organizations it is justified by the isolation and governance capabilities.
VictoriaMetrics: Simplicity and Raw Performance
VictoriaMetrics takes a fundamentally different approach: rather than building on top of Prometheus’s TSDB format, it implements its own highly optimized storage engine. The result is dramatically better compression (often 5–10x better than Prometheus TSDB) and query performance that consistently outperforms Thanos and Mimir in benchmarks, particularly for high-cardinality workloads and large time ranges. The single-node binary handles workloads that would require a full Thanos cluster, and the cluster version adds horizontal scalability with fewer moving parts than Thanos or Mimir.
VictoriaMetrics also supports MetricsQL, a superset of PromQL that adds useful functions like outlierIQR(), limitOffset(), and improved histogram handling. Grafana datasource compatibility is maintained through a PromQL-compatible API, so existing dashboards work without modification.
Practical Guide: Choosing Your Scaling Approach
The right solution depends on your current scale, team capacity, and trajectory. This is not a one-size-fits-all decision. Here is a pragmatic framework for matching the solution to the problem.
Stage 1: Under 1 Million Active Series
A single Prometheus instance with proper tuning should handle this comfortably. Focus on recording rules to eliminate expensive dashboard queries, implement metric_relabel_configs to drop unused metrics, and set sample_limit guards. Increase Prometheus memory limits to give it adequate headroom (at minimum 8 GB, ideally 16 GB for instances approaching 1M series). Set --storage.tsdb.retention.time to the minimum that satisfies your compliance and debugging needs — 15 days is often enough if you have remote_write configured to a longer-term store.
Stage 2: 1–5 Million Active Series
At this scale, a single instance is viable but requires vertical scaling and aggressive optimization. Consider sharding your Prometheus deployment by functional area: one instance for infrastructure metrics, one for application metrics, one for business metrics. This is horizontal scaling via functional decomposition, not true distributed architecture. Add remote_write to object storage for long-term retention. If you are running Kubernetes, the Prometheus Operator with multiple Prometheus custom resources per namespace group is a clean implementation of this pattern.
This is also the stage where VictoriaMetrics single-node becomes compelling — it can handle this range comfortably with far less RAM than Prometheus and simpler operations than a full distributed system.
Stage 3: 5 Million+ Active Series or Global Requirements
At this scale, a distributed architecture is necessary. Your choice among Thanos, Mimir, and VictoriaMetrics Cluster depends primarily on:
- Existing Prometheus investment + incremental migration: Thanos Sidecar is the path of least resistance. Your existing Prometheus instances keep working; you add sidecars and deploy Thanos query components.
- Multi-tenant platform with governance requirements: Grafana Mimir, accepting the operational complexity in exchange for native tenant isolation and limits.
- Maximum performance with minimal operational burden: VictoriaMetrics Cluster, replacing Prometheus entirely or alongside it via remote_write, with dramatically simpler operations than Thanos or Mimir.
- Multi-region, cross-cloud global monitoring: Thanos or Mimir, both have mature multi-region architectures; VictoriaMetrics Enterprise has similar capabilities but is not open source.
Complementary Configuration: Thanos Sidecar Example
For teams adopting Thanos, the sidecar configuration alongside a Prometheus deployment looks like this in a Kubernetes environment:
# Thanos sidecar configuration (as part of Prometheus pod spec)
containers:
- name: prometheus
image: prom/prometheus:v2.48.0
args:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
# Keep 2h locally; Thanos handles long-term
- --storage.tsdb.retention.time=2h
# Thanos requires min-block-duration = max-block-duration for sidecar
- --storage.tsdb.min-block-duration=2h
- --storage.tsdb.max-block-duration=2h
- --web.enable-lifecycle
- name: thanos-sidecar
image: quay.io/thanos/thanos:v0.32.0
args:
- sidecar
- --tsdb.path=/prometheus
- --prometheus.url=http://localhost:9090
- --grpc-address=0.0.0.0:10901
- --http-address=0.0.0.0:10902
# Object store configuration
- --objstore.config-file=/etc/thanos/objstore.yml
volumeMounts:
- name: prometheus-data
mountPath: /prometheus
- name: thanos-objstore-config
mountPath: /etc/thanos
---
# Object store configuration (s3-compatible)
# /etc/thanos/objstore.yml
type: S3
config:
bucket: my-thanos-metrics
endpoint: s3.eu-west-1.amazonaws.com
region: eu-west-1
# Use IAM role or provide credentials via environment
access_key: ""
secret_key: ""VictoriaMetrics as Remote Write Target
If you choose VictoriaMetrics as your remote storage backend, the integration with existing Prometheus instances is straightforward. VictoriaMetrics exposes a remote_write compatible endpoint at /api/v1/write:
# prometheus.yml — remote write to VictoriaMetrics
remote_write:
- url: http://victoriametrics:8428/api/v1/write
queue_config:
max_samples_per_send: 10000
capacity: 20000
max_shards: 30
# VictoriaMetrics single-node startup (Docker Compose example)
services:
victoriametrics:
image: victoriametrics/victoria-metrics:v1.95.1
command:
- -storageDataPath=/victoria-metrics-data
# Retain 1 year of data
- -retentionPeriod=12
# Enable deduplication (for HA Prometheus pairs)
- -dedup.minScrapeInterval=15s
# Memory limit
- -memory.allowedPercent=60
ports:
- "8428:8428"
volumes:
- vm-data:/victoria-metrics-data
volumes:
vm-data:VictoriaMetrics also exposes a Prometheus-compatible query API at /api/v1/query and /api/v1/query_range, so Grafana datasources pointing at it need only a URL change — no plugin installation required for basic use. For MetricsQL-specific functions, use the VictoriaMetrics datasource plugin available in Grafana’s plugin catalog.
Observing Your Prometheus Health
Before implementing any of these solutions, establish a baseline understanding of your Prometheus instance’s current health. The following PromQL expressions give you immediate visibility into the key indicators:
# Total active time series in the head block
prometheus_tsdb_head_series
# Series created vs. removed (churn indicator)
rate(prometheus_tsdb_head_series_created_total[5m])
rate(prometheus_tsdb_head_series_removed_total[5m])
# Memory usage of the head block chunks
prometheus_tsdb_head_chunks_storage_size_bytes
# Remote write lag (seconds behind)
(
prometheus_remote_storage_highest_timestamp_in_seconds
- prometheus_remote_storage_queue_highest_sent_timestamp_seconds
)
# Top cardinality contributors (requires Prometheus 2.14+)
# Run this in Prometheus /api/v1/query:
topk(20,
count by (__name__) ({__name__!=""})
)
# Alert evaluation lag
rate(prometheus_rule_evaluation_duration_seconds_sum[5m])
/ rate(prometheus_rule_evaluation_duration_seconds_count[5m])Prometheus also exposes a /api/v1/status/tsdb endpoint that returns cardinality statistics including the top 10 metrics by series count and the top 10 label names by cardinality. This is invaluable for identifying which specific metrics or labels are causing problems and should be your first stop when investigating a new cardinality issue.
Frequently Asked Questions
How do I identify which metrics are causing my cardinality explosion?
Start with the /api/v1/status/tsdb endpoint on your Prometheus instance. It returns a JSON response with seriesCountByMetricName, seriesCountByLabelValuePair, and seriesCountByLabelName arrays, each showing the top contributors to your total series count. This points you directly at the offending metrics and labels without any external tooling. Complement this with topk(20, count by (__name__)({__name__!=""})) in PromQL, which gives you the same information in a queryable format you can alert on. Once you know the metric name, query count by (label1, label2) (your_metric_name) replacing label pairs to identify which specific label dimensions are driving the high count.
Can I run Prometheus in HA without external dependencies?
Yes, but with important caveats. The standard HA pattern for standalone Prometheus is to run two identical Prometheus instances scraping the same targets. Both instances collect data independently, and Alertmanager deduplicates alerts from both using its mesh clustering (run multiple Alertmanager instances in a cluster and point both Prometheus instances at all of them). This provides alerting HA — alerts fire even if one Prometheus instance is down. It does not provide query HA in the traditional sense, because each instance has its own independent data and queries against a failed instance simply fail. Dashboards pointing at a specific instance will show gaps during that instance’s downtime. For true query HA with failover and deduplication, you need Thanos Query (which can deduplicate replica series at query time using the replica external label) or a similar solution. Running Prometheus without any external dependencies means accepting these query HA limitations.
What is a safe maximum cardinality for a single Prometheus instance?
There is no universal number — it depends heavily on your scrape interval, available RAM, and query patterns. A practical guideline: allocate 3–4 GB of RAM per million active series for the head block alone, then add 50% headroom for query processing. A Prometheus instance with 16 GB of RAM can comfortably handle 2–3 million active series under typical workloads. Beyond 5 million series, even well-resourced single instances start showing query performance degradation that impacts alert evaluation reliability. The more meaningful limit to enforce operationally is series churn rate: an instance creating more than 100,000 new series per minute will struggle regardless of total series count, because the head block indexing operations become a bottleneck. Monitor rate(prometheus_tsdb_head_series_created_total[5m]) and treat sustained values above 50,000/minute as a warning condition.
Should I use Thanos Sidecar or Thanos Receive for ingestion?
The choice comes down to whether you want to keep Prometheus as the authoritative ingest layer or move toward a push-based architecture. Thanos Sidecar is the simpler, lower-risk option: Prometheus continues operating normally, the sidecar uploads completed blocks to object storage in the background, and you gain long-term storage and global query capability with minimal disruption. The drawback is that Prometheus must have local storage for at least 2 hours (one block duration), and the sidecar requires that min-block-duration equals max-block-duration, which prevents Prometheus from doing its own compaction. Thanos Receive accepts remote_write from any Prometheus instance, which enables active-active HA setups where multiple Prometheus replicas write to a Receive hashring simultaneously, and Receive handles deduplication. This is more complex to operate but provides better ingest-side redundancy. For most teams starting with Thanos, Sidecar is the right first step. Receive makes sense when you are building a centralized monitoring platform that accepts writes from many Prometheus instances across different clusters or environments.
Is it worth migrating from Thanos to VictoriaMetrics or Mimir once you are already running Thanos?
Migration from Thanos to an alternative should be driven by specific pain points, not by benchmark numbers alone. If your team is spending significant time operating Thanos (debugging query Store Gateway cache issues, managing compactor conflicts, handling block upload failures), and your primary need is simplicity and query performance rather than multi-tenancy, VictoriaMetrics is worth evaluating seriously. The migration path is smooth: run VictoriaMetrics alongside Thanos temporarily, migrate remote_write targets to VictoriaMetrics, and decommission Thanos once you are satisfied. Historical data in your object store can be imported using VictoriaMetrics’s vmctl tool. If your pain point is multi-tenancy and governance — multiple teams with independent data isolation, per-tenant rate limits, chargeback requirements — Mimir is the right destination and the operational complexity is justified. The one scenario where staying with Thanos is usually the right call is when your organization has invested heavily in Thanos tooling, has stable operations, and does not have specific unmet needs. Migration carries real costs in engineering time and operational risk; make sure the benefits are concrete and quantified before committing.