Kubernetes & Cloud-Native Articles

Radar: A New Kubernetes IDE Worth Knowing About (vs OpenLens, FreeLens)

2026-05-09 by Alexandre Vazquez

Radar sweep showing OpenLens, FreeLens and Radar as a Kubernetes IDE comparison

If you’ve been following Kubernetes tooling, you’ve probably already been through the Lens saga: Lens went commercial, OpenLens emerged as the community fork, then FreeLens appeared when OpenLens maintenance slowed. The pattern is familiar — a useful desktop tool, a licensing decision, a fork, another fork.

Radar is not a fork. It’s a different approach to the same problem: giving engineers a useful interface for Kubernetes clusters without the friction of kubectl for every task. Built by Skyhook (YC-backed, Google Cloud Partner), it’s been live since 2025, has 1.7k+ GitHub stars, releases weekly, and the founder reaches out to the community directly. That’s usually a good signal that someone is genuinely building in public.

This article covers what Radar actually does, where it pulls ahead of OpenLens and FreeLens, and when those tools are still the right choice.

The State of Kubernetes Desktop Tooling in 2026

Before getting into Radar specifically, it’s worth naming the landscape clearly:

Lens — the original. Electron-based, polished, now commercial (Mirantis). The free Personal tier is non-commercial only. Pro is ~$22-35/user/month.
OpenLens — the community fork of Lens before Mirantis closed exec/logs/shell in v6.3 (January 2023). Maintenance has slowed significantly. No active release cadence.
FreeLens — a more active community fork, filling the gap left by OpenLens’ decline. Restores the missing features. No commercial backing.
k9s — terminal TUI, fast, keyboard-driven, single-cluster. Different audience.
Headlamp — CNCF Sandbox project, plugin-extensible, web-based.
Radar — Go binary, Apache 2.0, team-oriented, topology and event timeline focused.

The problem with OpenLens and FreeLens is not that they’re bad tools — they’re genuinely useful for the solo developer with one or two clusters. The problem is that they’re single-cluster-at-a-time desktop apps with no concept of team, no persistent state, and no awareness of the modern Kubernetes ecosystem (ArgoCD, Flux, Karpenter, KEDA). As your infrastructure grows, you outgrow them.

What Radar Actually Is

Radar is available in two forms:

Radar OSS — a single ~30MB Go binary, Apache 2.0, free forever. Can run locally (desktop app) or deployed in-cluster via Helm. No sidecars, no feature gates.
Radar Cloud — same binary, adds a hosted control plane with fleet aggregation, 30-day event retention, SSO/SCIM, scoped RBAC, and shared URLs for team incident response. Priced per cluster ($99/cluster/month for Team), not per user.

The per-cluster pricing is a deliberate design decision — teams don’t pay more as they add engineers, only as they add clusters. For a 20-person platform engineering team managing 5 clusters, Radar Cloud runs $495/month. The equivalent Lens Pro seats would cost $2,200-4,200/month.

For most self-hosted environments, the OSS version is sufficient and costs nothing.

Key Features

Topology View

This is the most visually distinctive feature. Radar renders a live service graph for your cluster: deployments, services, ingresses, cross-namespace dependencies, and east-west traffic flows — all in a single view without running kubectl get all -A and stitching the output together mentally.

OpenLens and FreeLens have resource list views. They show you what exists. Radar shows you how things connect — which is what you actually need when debugging why Service A can’t reach Service B.

Persistent Event Timeline

Kubernetes events are ephemeral by default — they expire after approximately one hour. When something breaks at 2am and you’re looking at it at 9am, the events that explain what happened are gone. Logs may still be there if you’re running a log aggregator, but the Kubernetes-level events (pod restarts, scheduling failures, node pressure events, probe failures) are gone.

Radar retains events. The OSS version extends this beyond the default 1-hour cluster retention. The Cloud version retains 30 days. You can rewind the timeline to any point and reconstruct what the cluster looked like at that moment.

Neither OpenLens nor FreeLens have any event retention beyond what the cluster itself provides.

GitOps Integration (ArgoCD + Flux)

Radar auto-detects ArgoCD and Flux and surfaces sync state, drift, and health directly in the UI. You can see whether a deployment is in sync, when it last synced, and whether it drifted from the desired state in Git.

In OpenLens and FreeLens, ArgoCD resources appear as generic Kubernetes custom resources. You can see the CRDs, but there’s no purpose-built understanding of what they mean — no sync status visualization, no diff view, no rollback trigger.

Helm Management

Radar tracks Helm releases with full revision history and supports one-click rollbacks from the UI. This is similar to what OpenLens/FreeLens offer via the Helm releases view, but Radar adds revision diffing — you can see what changed between release 5 and release 6 before deciding to roll back.

Image Filesystem

You can browse container image filesystems through Radar without needing kubectl exec into a running pod or access to the container registry. Useful for security audits and debugging — you can verify what’s actually in an image at rest.

MCP Server (AI Integration)

Radar ships with an MCP (Model Context Protocol) server, which means you can connect Claude, Cursor, or GitHub Copilot directly to your cluster context and ask questions about it in natural language. The MCP server is token-optimized — it doesn’t dump raw YAML at the model, it structures cluster state into meaningful context.

This is something neither OpenLens nor FreeLens have. It’s also something that’s genuinely useful if you’re already using AI assistants for development work.

Cluster Audit

30 built-in best-practice checks — resource requests/limits, RBAC permissions, image pinning, network policies, security contexts. The checks are labeled by compliance framework. This is not a replacement for dedicated security tooling (Trivy, Falco, Polaris), but it’s a useful first-pass audit without leaving the tool you’re already using.

Multi-Cluster Support (Cloud)

The Cloud tier adds fleet-level visibility: a single view across all clusters, cross-cluster search, and drift detection between environments (e.g., staging vs. production). This is the feature that changes the calculus for platform engineering teams managing 5+ clusters.

OpenLens and FreeLens require you to switch cluster context manually. There is no fleet view.

Architecture: Why a Go Binary Matters

OpenLens and FreeLens are Electron apps — Chromium + Node.js wrapped in a desktop shell. This means:

200-500MB install size
1-2 second startup time on a fast machine, more on slower ones
Memory footprint in the hundreds of megabytes
Local kubeconfig required on each engineer’s machine

Radar’s in-cluster deployment is a single Go binary (~30MB) that runs as a Pod with a ServiceAccount. It connects to the hosted control plane over outbound WebSocket + TLS. No inbound firewall rules, no kubeconfig distribution, no per-engineer setup.

The local desktop app is also a lightweight Go binary — 65-second startup was demonstrated on a 322-node cluster. That’s not a typo.

For in-cluster deployment, the architecture means security is handled at the ServiceAccount level, not by distributing kubeconfigs to engineer laptops. That matters for teams with security requirements around credential management.

Feature Comparison

Feature	Radar OSS	Radar Cloud	OpenLens	FreeLens
License	Apache 2.0	Proprietary (hosted)	MIT/GPL	MIT
Maintenance	Active (weekly releases)	Active	Stalled	Active (community)
Architecture	Go binary / in-cluster	In-cluster + hosted	Electron	Electron
Multi-cluster	Basic	Fleet view	❌	❌
Event retention	Extended	30 days	Cluster default (~1h)	Cluster default (~1h)
Topology view	✅	✅	❌	❌
GitOps (ArgoCD/Flux)	✅	✅	CRDs only	CRDs only
Helm management	✅	✅	✅	✅
kubectl exec / logs / shell	✅	✅	✅ (restored)	✅
MCP / AI integration	✅	✅	❌	❌
Cluster audit	✅	✅	❌	❌
SSO / SCIM	❌	✅	❌	❌
Shared incident URLs	❌	✅	❌	❌
Image filesystem browser	✅	✅	❌	❌
Cost tracking	✅ (OpenCost)	✅	❌	❌
Price	Free	$99/cluster/month	Free	Free

When Radar Makes Sense

You’re managing multiple clusters. Even with the OSS version, the topology view and event timeline make Radar more useful than OpenLens/FreeLens at 3+ clusters. The Cloud fleet view is the compelling option at 5+.

Your team uses GitOps. If ArgoCD or Flux is part of your workflow, Radar’s native understanding of sync state and drift is meaningfully better than seeing CRDs in a generic list view.

You need post-mortem capability. If your incident review process involves looking at what the cluster was doing when the alert fired, you need event retention. Radar has it; OpenLens and FreeLens don’t.

You’re adopting AI tooling. The MCP server is the most forward-looking feature here. If you use Claude Code, Cursor, or Copilot for your infrastructure work, having cluster context available to those tools without copy-pasting YAML is a genuine productivity improvement.

You have a platform engineering team. Per-cluster pricing, SSO, SCIM, and shared incident URLs are features that only matter if you have more than one person managing infrastructure.

When OpenLens or FreeLens Still Makes Sense

You’re a solo developer with one or two clusters. OpenLens and FreeLens are familiar, local, and have zero setup overhead. If you don’t need team features, event retention, or topology views, they remain perfectly functional tools.

You’re deeply invested in the Lens UX. The resource tree, the terminal integration, the way Lens presents namespace-scoped resources — if your muscle memory is built around that interface, switching has a real cost. Radar is different, not just better.

You need maximum customization. OpenLens and FreeLens support plugins. Radar does not currently have a plugin system.

Your environment is air-gapped or has strict egress restrictions. Radar OSS can run fully in-cluster, but Radar Cloud requires outbound connectivity to the hosted control plane. OpenLens and FreeLens are fully local.

Getting Started

OSS installation takes about two minutes:

# Homebrew (macOS/Linux)
brew install skyhook-io/tap/radar

# Helm (in-cluster)
helm repo add skyhook https://charts.skyhook.io
helm install radar skyhook/radar \
  --namespace radar \
  --create-namespace \
  --set service.type=ClusterIP

Or download the binary directly from radarhq.io.

Verdict

Radar is the most interesting new entrant in the Kubernetes tooling space in a while — not because it replaces everything else, but because it addresses the specific gap that OpenLens and FreeLens never covered: teams, multiple clusters, and persistent state.

For a solo developer, OpenLens or FreeLens are still completely reasonable choices. For a platform engineering team managing more than two clusters with ArgoCD or Flux, Radar’s feature set is materially better and the OSS version costs nothing.

The active release cadence and the YC backing suggest this isn’t a one-person side project — there’s a team actively working on it. Whether the Cloud pricing sticks long-term is a question only usage will answer, but the Apache 2.0 core with an explicit “always open source” commitment is the right foundation.

Worth evaluating if you haven’t already.

Tested with Radar OSS v0.x on Kubernetes 1.29–1.32. Pricing and feature availability as of May 2026.

Kubernetes Cluster Autoscaler vs Karpenter: When to Use Each (2026)

2026-05-102026-05-02 by Alexandre Vazquez

Your pods are pending. Your on-call engineer is getting paged. Somewhere in the chain between “I need more compute” and “compute is available,” something is too slow. That something is almost always node provisioning — and the tool you chose to manage it determines whether that delay is 4 minutes or 45 seconds.

Node autoscaling is one of those infrastructure decisions that looks simple until you’re running it in production. Two schedulable pods sitting in Pending state doesn’t just mean a delayed deployment — it means latency spikes, dropped traffic, breached SLOs, and engineers debugging things that should have been invisible. At scale, it also means either burning money on over-provisioned nodes or gambling on under-provisioning at the worst possible moment.

Cluster Autoscaler (CA) has been the default answer for years. Karpenter emerged from AWS in 2021, graduated to stable in 2023, and by 2025 had become the default recommendation for most AWS-native clusters. In 2026, both tools are mature, widely deployed, and genuinely good — but they solve the problem differently, and picking the wrong one for your environment has real consequences.

This article is a deep technical comparison. It assumes you already know what Kubernetes is and have opinions about infrastructure. The goal is to give you a clear picture of how each tool works, where each one wins, and a decision framework you can actually use.

Why Node Autoscaling Is Hard

The fundamental tension in autoscaling is this: you want compute available before you need it, but you don’t want to pay for compute you’re not using. These goals are in direct conflict, and every autoscaling system is an attempt to find the least-bad trade-off.

Without autoscaling, you’re doing one of two things:

Over-provisioning — you run enough nodes to handle peak load at all times. Your average utilization sits at 20–30%, and you’re paying for the other 70–80% to sit idle.
Under-provisioning — you run lean, and when traffic spikes, pods go Pending. Your SLOs breach. You get paged at 3am to manually scale.

A common failure mode with poorly tuned autoscaling is the “thundering herd at scale-up” pattern: HPA creates new pods faster than node autoscaling can provision capacity. The provisioning window matters. With CA and typical ASG-backed node groups on AWS, you’re looking at 4–8 minutes. With Karpenter, 60–90 seconds. At 100 RPS and a 3-minute window, that’s 18,000 requests under degraded conditions.

Cluster Autoscaler: How It Actually Works

Cluster Autoscaler is a Kubernetes-native project under the kubernetes/autoscaler repository, in production since 2016, supporting AWS, GCP, Azure, Alibaba, DigitalOcean, and more.

The Node Group Model

CA operates on node groups — ASGs on AWS, MIGs on GCP, VMSSs on Azure. CA’s job is to decide when to increase or decrease the desired capacity of these groups. CA does not provision individual nodes. It scales node groups, and the node group provisions nodes. This indirection adds latency and reduces flexibility.

Scale-Up: Detecting Unschedulable Pods

CA runs a control loop (default scan interval: 10 seconds). For each Pending pod with PodScheduled=False, CA simulates adding a node of each known node group type and checks if the pod would become schedulable. When a node group is selected, CA applies an expander to choose which group to scale:

least-waste — minimizes CPU/memory waste after scheduling (best default for cost)
most-pods — maximizes pods scheduled per scale-up operation
priority — lets you define ordering via ConfigMap
grpc — delegates to an external gRPC service

# Cluster Autoscaler deployment — AWS, production-tuned
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
    spec:
      priorityClassName: system-cluster-critical
      serviceAccountName: cluster-autoscaler
      containers:
        - image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.30.3
          name: cluster-autoscaler
          resources:
            requests:
              cpu: 100m
              memory: 600Mi
            limits:
              cpu: 200m
              memory: 1Gi
          command:
            - ./cluster-autoscaler
            - --cloud-provider=aws
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster
            - --expander=least-waste
            - --balance-similar-node-groups=true
            - --scale-down-delay-after-add=10m
            - --scale-down-unneeded-time=10m
            - --scale-down-utilization-threshold=0.5
            - --max-graceful-termination-sec=600
            - --scan-interval=10s

Scale-Down: The Conservative Approach

A node is a scale-down candidate only if:
– CPU and memory utilization (by requests) is below threshold (default: 50%)
– All pods could be rescheduled elsewhere
– No pod has cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
– The node has been underutilized for at least --scale-down-unneeded-time (default: 10m)

This conservatism prevents churn — a feature, not a bug.

Karpenter: How It Actually Works

Karpenter is a CNCF incubating project originally built by AWS, donated to CNCF in 2023, GA (v1.0) in mid-2024. Providers exist for AWS (stable), Azure (stable), and GCP (beta).

The Core Insight: Bypass the Node Group

Karpenter calls the EC2 RunInstances API directly — no ASG involvement. This means:
– Any instance type in a single request, without pre-configuring a node group
– No intermediary: Karpenter → EC2 API → node joins cluster
– Right-size nodes to exactly what workloads need, across the full instance catalog
– Karpenter handles full node lifecycle, including termination

NodePool and EC2NodeClass

# EC2NodeClass — cloud-specific parameters
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiSelectorTerms:
    - alias: al2023@latest
  role: "KarpenterNodeRole-my-cluster"
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "my-cluster"
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "my-cluster"
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 50Gi
        volumeType: gp3
        encrypted: true
  metadataOptions:
    httpTokens: required
    httpPutResponseHopLimit: 1
---
# NodePool — intent and constraints
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: general-purpose
spec:
  template:
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand", "spot"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["5"]
        - key: karpenter.k8s.aws/instance-size
          operator: NotIn
          values: ["nano", "micro", "small", "medium", "large"]
      expireAfter: 720h
  limits:
    cpu: "1000"
    memory: 1000Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 5m
    budgets:
      - nodes: "5%"
        schedule: "0 8 * * mon-fri"
        duration: 10h
      - nodes: "25%"

Just-in-Time Provisioning and Bin Packing

When pods go Pending, Karpenter watches the event (not polls) and immediately:
1. Collects all Pending pods
2. Simulates bin packing — fewest possible nodes across the full instance catalog
3. Selects instances that satisfy all pod requirements
4. Calls EC2 API to provision the optimal instance(s)

Disruption and Consolidation

Karpenter’s differentiated value: active cluster consolidation. It evaluates whether nodes can be removed by redistributing pods onto others, or replaced with a smaller instance type. A c5.4xlarge running 4 vCPU worth of pods gets replaced with a c5.xlarge. Teams commonly report 30–50% compute cost reduction.

consolidationPolicy options:
– WhenEmpty — only remove nodes with no workload pods (safest)
– WhenEmptyOrUnderutilized — also replace underutilized nodes with smaller ones

Architecture Comparison

Dimension	Cluster Autoscaler	Karpenter
Node provisioning model	Scales node groups (ASG/MIG/VMSS)	Direct cloud API, no node groups
Instance flexibility	Pre-defined node group types	Full instance catalog at runtime
Scale-up trigger	Polling (10s scan interval)	Watch-based event (near-instant)
Scale-down	Removes underutilized nodes	Removes + consolidates + right-sizes
Spot handling	Via ASG + AWS Node Termination Handler	Native, first-class, no NTH needed
Configuration model	Deployment flags	Declarative CRDs
Cloud support	All major + on-prem	AWS (GA), Azure (stable), GCP (beta)
Consolidation	No	Yes
Community maturity	Very mature (since 2016)	Mature (GA 2024)

Scaling Speed: The Numbers

Cluster Autoscaler on AWS (typical):
1. Pod Pending → CA scan detects (0–10s)
2. ASG UpdateAutoScalingGroup API call (~15–30s)
3. EC2 instance starts (1–2 min)
4. Node bootstrap + kubelet registration (30–60s)
5. Pod scheduled (5–10s)

Total: 3–6 minutes (up to 12 min during high-demand periods)

Karpenter on AWS (typical):
1. Pod Pending → watch event fires (~1s)
2. EC2 RunInstances API call (~2–3s)
3. EC2 instance starts (same hardware — 1–2 min)
4. Node bootstraps + Ready (30–60s)
5. Pod scheduled (~5s)

Total: 90 seconds to 3 minutes

Cost Optimization: Where Karpenter Pulls Ahead

Right-sizing: CA requires pre-defined node groups. Karpenter selects the minimum viable instance for the pending workload from the full catalog.

Consolidation vs scale-down: CA removes underutilized nodes. Karpenter replaces a large underutilized node with a smaller one that still fits all pods. This produces compounding savings over time.

Spot handling: Karpenter receives EC2 interruption notices, pre-provisions a replacement, and drains the node — all within the 2-minute window. No AWS Node Termination Handler required. It also diversifies spot requests across instance types automatically to reduce simultaneous interruption risk.

Multi-Cloud Support in 2026

Cloud	Cluster Autoscaler	Karpenter
AWS	✅ Production-stable	✅ Production-stable (reference impl.)
GCP	✅ Production-stable	⚠️ Beta (karpenter-provider-gcp)
Azure	✅ Production-stable	✅ Stable (karpenter-provider-azure)
Alibaba	✅ Supported	❌ No provider
DigitalOcean	✅ Supported	❌ No provider
On-premises / Cluster API	✅ Supported	❌ Not supported

When Cluster Autoscaler Is Still the Right Choice

Non-AWS environments — GCP, Alibaba, DigitalOcean, on-prem with Cluster API
Existing node group architecture — significant investment in ASG design, compliance tooling
Regulatory constraints — some frameworks require ASG-backed provisioning audit trails
Cluster API / bare metal — CA is the only mature option
Team familiarity and working-well CA deployment — migration cost may not justify benefit

When Karpenter Is the Right Choice

AWS-native, cost optimization priority — right-sizing + consolidation = meaningful cost reduction
Diverse and variable workloads — batch, spot, GPU, stateless APIs — Karpenter handles all with a few NodePools
Spot-heavy clusters — native interruption handling, diversification, no NTH
Declarative infrastructure-as-code culture — NodePools version cleanly in Git
Low-latency scaling requirements — event-driven workloads, KEDA-triggered jobs, sharp traffic spikes

Running Both: Migration Path and Gotchas

Separating Responsibility

Use labels and taints to prevent CA and Karpenter from managing the same nodes:

# NodePool with taint — CA-managed pods won't tolerate this
spec:
  template:
    metadata:
      labels:
        provisioner: karpenter
    spec:
      taints:
        - key: karpenter.sh/provisioned
          value: "true"
          effect: NoSchedule

Gradual Migration

Phase 1 — Karpenter manages spot/batch workloads. CA manages on-demand production nodes.
Phase 2 — Migrate spot workloads fully. Remove AWS NTH.
Phase 3 — Migrate on-demand. Reduce CA node group capacity gradually.
Phase 4 — Decommission CA once all groups are empty.

Key Gotchas

Karpenter consolidation + permissive PDBs — maxUnavailable: 100% will cause disruptive consolidation. Audit PDBs before enabling WhenEmptyOrUnderutilized.
NodePool limits are hard stops — pods go Pending indefinitely at limit. Monitor utilization.
AMI drift — @latest alias picks up new AMIs on new nodes. Consider pinning for strict change control.
Simultaneous scale-down conflicts — use strict label/taint segregation during migration.

Decision Framework

Factor	Cluster Autoscaler	Karpenter
Cloud support	All clouds + on-prem	AWS (GA), Azure (stable), GCP (beta)
Provisioning speed	4–8 minutes	60–120 seconds
Instance flexibility	Node group pre-config required	Full catalog, runtime selection
Cost optimization	Scale-down only	Scale-down + consolidation + right-sizing
Spot integration	Via ASG + NTH	Native, first-class
Operational complexity	Lower	Moderate
Cluster API / bare metal	Yes	No
Consolidation	No	Yes

Running on AWS?
├── No → Azure? → Karpenter (stable) or CA
│        GCP?   → CA or GKE NAP (preferred)
│        Other  → Cluster Autoscaler
│
└── Yes → Hard regulatory constraints on non-ASG provisioning?
          ├── Yes → Cluster Autoscaler
          └── No → Cost optimization priority or diverse workloads?
                   ├── Yes → Karpenter
                   └── No → Either (flip for team preference)

FAQ

Is Karpenter a drop-in replacement for Cluster Autoscaler?

No. Different configuration model, different concepts. Migration requires re-expressing node group config as NodePools/NodeClasses, auditing PDBs, and running both in parallel. Budget at least a sprint for a medium-sized cluster.

Can I run Karpenter on self-managed Kubernetes (not EKS)?

Yes, but non-trivial. Karpenter requires IAM credentials (IRSA or equivalent) to call EC2 APIs. On self-managed clusters, this requires more setup than on EKS where IRSA is built-in.

How does Karpenter interact with HPA and VPA?

No conflict. HPA creates pods → pods go Pending if insufficient nodes → Karpenter provisions nodes → pods scheduled. VPA adjusts pod resource requests, which Karpenter uses as inputs for bin packing.

What happens when Karpenter itself goes down?

Existing nodes and pods continue normally. New pods requiring provisioning go Pending until Karpenter recovers. Scale-down and consolidation pause. Deploy multiple replicas with leader election for production.

Does Karpenter support GPU nodes?

Yes. GPU instance types (p3, p4, g4, g5) can be included in NodePool requirements. Create dedicated NodePools with appropriate taints for GPU-requesting pods.

How does Karpenter handle AMI updates?

The expireAfter field forces node rotation. When a node expires, Karpenter pre-provisions a replacement with the latest AMI per EC2NodeClass, then drains and terminates the old node — a rolling AMI update mechanism without additional tooling.

Is Cluster Autoscaler still actively maintained?

Yes. CA remains under active development in kubernetes/autoscaler, with releases tracking Kubernetes minor versions. It is not being deprecated. For non-AWS environments and working CA deployments, it remains a fully supported and rational choice.

Tested against Kubernetes 1.28–1.32. Karpenter v1.x API (GA). CA v1.30.x. AWS provider examples; Azure and GCP provider details may differ.

Kubernetes Resource Requests and Limits: The Complete Production Guide

2026-05-102026-05-02 by Alexandre Vazquez

Your pods are being OOMKilled at 3 AM. Your latency p99 spikes every few minutes with no obvious cause. Your cluster scheduler is placing workloads on nodes that can’t sustain them. In most production Kubernetes incidents, misconfigured resource requests and limits are either the direct cause or an accelerating factor.

This is not a “what are requests and limits” tutorial. It is a deep technical guide for engineers who run Kubernetes in production and need to understand what actually happens inside the kernel when these values are set — and what the consequences are when they are wrong.

What Requests and Limits Actually Are

The Kubernetes documentation explains requests and limits at the API level. What it underexplains is the enforcement mechanism: cgroups.

When the kubelet admits a pod onto a node, it creates a cgroup hierarchy for that pod under /sys/fs/cgroup/. Each container in the pod gets its own cgroup. The values you set in your pod spec translate directly into cgroup parameters:

CPU request → cpu.shares (cgroups v1) or cpu.weight (cgroups v2)
CPU limit → cpu.cfs_quota_us and cpu.cfs_period_us
Memory request → memory.soft_limit_in_bytes (advisory, used for eviction scoring)
Memory limit → memory.limit_in_bytes (hard enforcement, triggers OOMKill)

The scheduler uses requests to make placement decisions. It does not know about actual utilization — it knows about committed capacity. A node with 4 cores where running pods have a total CPU request of 3.5 cores has 0.5 cores of schedulable capacity remaining, even if actual CPU utilization is 15%.

This is why you can have a fully “utilized” cluster (by requests) where nodes are idle, and why you can have nodes at 95% CPU utilization that still accept new pods because their requests are low.

The kubelet uses limits to enforce runtime constraints via those cgroup parameters. The scheduler never sees limits.

CPU vs Memory: Why They Behave Fundamentally Differently

This is the most consequential thing to understand about Kubernetes resource management, and it is routinely misunderstood even by experienced engineers.

CPU Is Compressible

CPU is a time-shared resource. If your container tries to use more CPU than its limit allows, the Linux CFS scheduler simply throttles it — it stops getting CPU time until the next scheduling period. The process continues. It just waits.

From the application’s perspective: things slow down. Latency increases. Throughput drops. But the process does not die.

Memory Is Not Compressible

Memory is not time-shared. If your container tries to allocate memory beyond its limit, there is no “slow down” path. The Linux OOM killer selects a process in the cgroup and kills it. The container dies.

From the application’s perspective: the process is terminated. Kubernetes restarts the container. You see OOMKilled in kubectl describe pod.

Property	CPU	Memory
Enforcement	CFS throttling	OOM Kill
Process survives?	Yes (degraded performance)	No (killed and restarted)
Compressible?	Yes	No
Scheduler visibility	Requests only	Requests only
Over-limit consequence	Latency spikes	Container restart
Setting limits: recommended?	Situational (see below)	Always

This asymmetry drives every recommendation in the rest of this guide.

QoS Classes: Eviction Priority Under Pressure

Kubernetes assigns each pod a Quality of Service (QoS) class based on the requests and limits set across all its containers. This class determines eviction priority when a node is under memory pressure.

Guaranteed

Condition: Every container has CPU and memory requests and limits set, and requests equal limits for both CPU and memory.

resources:
  requests:
    cpu: "500m"
    memory: "512Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"

Guaranteed pods are the last to be evicted. The kubelet will exhaust BestEffort and Burstable pods before touching these. They get the most predictable resource allocation on the node.

Warning: Guaranteed does not mean “always available.” It means “last to be killed.” On a heavily overloaded node, even Guaranteed pods can be evicted.

Burstable

Condition: At least one container has a CPU or memory request or limit set, but the pod does not meet Guaranteed criteria.

resources:
  requests:
    cpu: "250m"
    memory: "256Mi"
  limits:
    cpu: "1000m"
    memory: "1Gi"

Burstable pods are evicted after BestEffort but before Guaranteed. They can burst above their request when capacity is available, but they are not protected when the node is under pressure.

BestEffort

Condition: No container in the pod has any CPU or memory requests or limits set.

# No resources block at all

BestEffort pods are evicted first, always. They get whatever capacity is left over after scheduled workloads consume their requested share. On a loaded node, they may be starved entirely.

In production: never run stateful workloads or business-critical services as BestEffort. The Kubernetes scheduler will place them anywhere, and the kubelet will kill them first.

Common Misconfiguration Patterns and Their Consequences

Pattern 1: No Requests or Limits Set

Effect: BestEffort QoS. First to be evicted under memory pressure. Scheduler places pods arbitrarily — it has no data for placement decisions, so it defaults to LeastRequestedPriority, which effectively means these pods may land on the same nodes as heavily-loaded workloads.

Real consequence: Your “lightweight” background jobs kill your API servers at 3 AM when a memory spike triggers eviction and BestEffort pods happen to be sitting next to them on the same node.

Pattern 2: Requests Equal Limits (Guaranteed QoS)

This is the common “safe” pattern recommended in older Kubernetes documentation. It is not wrong, but it has a trap:

CPU limits = CPU requests means CPU throttling is guaranteed to trigger. Your pod will be throttled the moment it tries to burst above the request — during startup, during GC, during a traffic spike — even if the node has abundant free CPU.

For latency-sensitive applications, this means predictable throttling spikes at exactly the moments you need the most CPU.

Memory: Setting memory request = memory limit is appropriate and recommended. The behavior is correct: the pod runs in a controlled memory budget.

Pattern 3: Limits Much Higher Than Requests (Burstable with High Ratio)

resources:
  requests:
    cpu: "100m"
    memory: "128Mi"
  limits:
    cpu: "4000m"
    memory: "4Gi"

This is the opposite extreme. The scheduler thinks this pod needs 100m CPU and 128Mi memory. Dozens of these can be scheduled onto a single node. When they all burst simultaneously — which they will, during a deployment, a traffic event, or a GC cycle — the node is overloaded, memory pressure triggers OOMKill cascades, and the scheduler has no idea anything is wrong because the committed capacity (by requests) looks fine.

The limit:request ratio matters. A 10x or 20x memory limit:request ratio on many pods is a recipe for node instability. A reasonable starting point is 2x–4x for memory, less for CPU.

Pattern 4: CPU Limits Set to “Be Safe”

This is the subtlest misconfiguration and the one with the most hidden latency impact. We cover it in depth in the next section.

The CPU Throttling Problem: CFS Bandwidth and Hidden Latency

This is where many production Kubernetes deployments have a silent performance problem they cannot easily diagnose.

How CFS Bandwidth Throttling Works

The Linux Completely Fair Scheduler (CFS) enforces CPU limits using bandwidth control. The relevant parameters are:

cpu.cfs_period_us: the accounting period, default 100ms
cpu.cfs_quota_us: how many microseconds of CPU time the cgroup can use per period

If you set cpu: "500m" as a limit, Kubernetes sets cpu.cfs_quota_us = 50000 (50ms per 100ms period). This means the container can use at most 50% of one CPU core per 100ms window.

The problem: quota is enforced per period, not as a moving average. If your container uses its full 50ms allocation in the first 60ms of a period, it is throttled for the remaining 40ms — even if the node has 7 idle CPUs. The CPU sits idle. Your container waits.

Why This Causes Latency Spikes Even at Low Utilization

This is counterintuitive and the source of many production mysteries. You can have a container running at 10% average CPU utilization that is regularly throttled, because its instantaneous CPU usage within a single 100ms window exceeds its quota.

Java applications with JVM garbage collection are particularly vulnerable. GC causes a CPU burst of short duration. If that burst exceeds the per-period quota, the GC pause is extended artificially by throttling — even though the GC event itself would have been short.

The same applies to Node.js event loop processing, Python import at startup, and any application that has bursty CPU behavior (which is most of them).

The Cloudflare and Netflix Evidence

Cloudflare published findings showing that CPU throttling was responsible for significant tail latency increases in their containerized workloads, and that removing CPU limits reduced p99 latency substantially for services that appeared to have headroom. Netflix has documented similar patterns in their capacity planning work, noting that per-period quota enforcement does not model real application CPU behavior accurately.

The kernel community has been aware of this for years. The fix — moving to cgroups v2 with better scheduler integration — helps but does not eliminate the problem. Kubernetes 1.25+ with cgroups v2 nodes experience less throttling under the same limits, but the fundamental issue remains: CPU limits throttle bursty applications unpredictably.

The Recommendation: Consider Not Setting CPU Limits

This is controversial but grounded in the evidence:

For latency-sensitive services: do not set CPU limits. Set CPU requests accurately and rely on the scheduler for placement.

The argument:
– CPU throttling is a soft failure mode that is hard to observe and diagnose
– OOMKill is a hard failure mode that is visible and recoverable
– CPU requests give the scheduler accurate placement data without creating throttling
– Nodes handle CPU oversubscription gracefully through time-sharing; they do not handle memory oversubscription gracefully

When to still set CPU limits:
– Multi-tenant clusters where noisy neighbor isolation is critical
– Batch workloads where predictable CPU allocation matters more than latency
– When your monitoring and alerting can catch CPU starvation at the node level

When you do not set CPU limits, you must set CPU requests accurately. A request of 100m for a service that normally uses 800m means the scheduler places it on a node that cannot actually sustain it. The result is real CPU starvation, not artificial throttling — but it is CPU starvation nonetheless.

Memory: Always Set Limits

The contrast with CPU is direct. Memory is non-compressible. A container that leaks memory or has a runaway allocation will consume all available node memory if unconstrained. This does not degrade gracefully — it triggers the OOM killer, which may kill unrelated processes on the node.

Always set memory limits. Always.

The consequence — OOMKill — is visible, logged, and Kubernetes handles it by restarting the container. An OOMKilled exit code is actionable: you either have a memory leak, your limit is too low, or your sizing methodology is wrong. All three are diagnosable.

The alternative — no memory limit — means a single leaking pod can destabilize an entire node and trigger eviction cascades affecting unrelated workloads.

Set memory requests equal to the p95 steady-state usage of your application. Set memory limits at 1.5x–2x the request to absorb traffic spikes and GC pressure. Profile your application under load to establish these baselines.

Vertical Pod Autoscaler (VPA)

VPA is the Kubernetes component designed to solve the sizing problem automatically. It observes actual resource utilization and recommends (or applies) adjusted requests.

How VPA Works

VPA has three components:

Recommender: Watches historical metrics and computes recommended requests based on observed utilization. Does not modify pods.
Updater: Evicts pods whose current requests differ significantly from recommendations (when VPA mode is Auto or Recreate).
Admission Controller: Mutates pod specs at admission time to apply recommendations from the Recommender.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: "Off"   # Recommend only — do not evict pods
  resourcePolicy:
    containerPolicies:
    - containerName: api-server
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 4000m
        memory: 4Gi
      controlledResources: ["cpu", "memory"]
      controlledValues: RequestsAndLimits

VPA Modes

Mode	Behavior
`Off`	Compute recommendations only. No pod mutations.
`Initial`	Apply recommendations to new pods only. Do not evict running pods.
`Recreate`	Evict pods when recommendations change significantly.
`Auto`	Currently equivalent to Recreate. May change in future versions.

When to Use VPA

Right-sizing during initial rollout: Run VPA in Off mode for 1–2 weeks on a new service. Review recommendations before applying. This is the most valuable use case.

Services with unpredictable or seasonal load patterns: VPA adapts requests based on observed behavior. Combined with HPA for horizontal scaling, this gives you right-sized replicas that scale out horizontally.

VPA and HPA cannot both manage the same metric. If HPA is scaling on CPU utilization, do not use VPA with controlledValues: RequestsAndLimits for CPU — they will fight each other. Use controlledValues: RequestsOnly and let HPA manage scale.

VPA limitations:
– Requires pod restarts to apply recommendations (Updater evicts pods)
– Does not work well with stateful workloads in strict availability windows
– Recommender needs sufficient history (at least a few days) to produce reliable recommendations
– Does not account for traffic spikes that haven’t been observed yet

LimitRange and ResourceQuota: Namespace-Level Guardrails

Requests and limits on individual pods solve the per-workload problem. LimitRange and ResourceQuota solve the namespace and cluster-level governance problem.

LimitRange

LimitRange sets default requests and limits for containers in a namespace, and enforces minimum/maximum boundaries. Any pod admitted to the namespace that does not have explicit requests/limits set will receive the defaults.

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
  - type: Container
    default:
      cpu: "500m"
      memory: "512Mi"
    defaultRequest:
      cpu: "100m"
      memory: "128Mi"
    max:
      cpu: "4000m"
      memory: "8Gi"
    min:
      cpu: "50m"
      memory: "64Mi"
  - type: Pod
    max:
      cpu: "8000m"
      memory: "16Gi"
  - type: PersistentVolumeClaim
    max:
      storage: "50Gi"
    min:
      storage: "1Gi"

Key behaviors:
– default applies as the limit for containers that set a request but no limit
– defaultRequest applies as the request for containers that set no request
– max and min cause admission to fail if violated
– LimitRange applies at admission time — changing it does not affect running pods

Use LimitRange to:
– Prevent BestEffort pods from being admitted (by setting defaultRequest values)
– Enforce organizational standards for minimum resource specifications
– Protect the cluster from pods requesting unbounded resources

ResourceQuota

ResourceQuota limits the total amount of resources that can be consumed by all pods in a namespace. This is the multi-tenant governance tool.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    requests.cpu: "20"
    requests.memory: "40Gi"
    limits.cpu: "40"
    limits.memory: "80Gi"
    pods: "100"
    persistentvolumeclaims: "20"
    requests.storage: "500Gi"
    count/deployments.apps: "50"
    count/services: "50"
    count/secrets: "100"
    count/configmaps: "100"

Critical interaction with LimitRange: When ResourceQuota is active in a namespace, every pod must have requests and limits set or it will be rejected. This is why LimitRange defaults are important — they ensure pods without explicit resources are not rejected by the quota system.

Use ResourceQuota to:
– Enforce team/application resource budgets in shared clusters
– Prevent runaway deployments from consuming all cluster capacity
– Implement chargeback policies (track resource consumption per namespace)

Practical Sizing Methodology

Step 1: Instrument Before You Set Values

Deploy initially with only requests set (no CPU limits, memory limits set conservatively high) and monitor for 2–4 weeks under realistic load.

Useful PromQL queries for sizing:

# p95 CPU usage over the last 7 days
histogram_quantile(0.95,
  rate(container_cpu_usage_seconds_total{
    container="api-server",
    namespace="production"
  }[5m])
)

# p99 memory working set over the last 7 days
quantile_over_time(0.99,
  container_memory_working_set_bytes{
    container="api-server",
    namespace="production"
  }[7d]
)

# CPU throttling ratio (alert if >5%)
rate(container_cpu_cfs_throttled_seconds_total{container="api-server"}[5m])
/
rate(container_cpu_cfs_periods_total{container="api-server"}[5m])

Step 2: Set CPU Requests from p95 Observations

Set CPU request = p95 CPU usage under realistic production load. For latency-sensitive services: do not set CPU limits. For batch or background jobs: set CPU limits at 2x–4x the request.

Step 3: Set Memory Requests and Limits

Set memory request = p95 memory working set over at least 7 days. Set memory limit = max(observed peak, 1.5 × request). For Java/Python with large processing, use 2x.

# Production example: Java microservice
resources:
  requests:
    cpu: "500m"       # p95 observed: ~420m
    memory: "768Mi"   # p95 observed: ~680Mi
  limits:
    # No CPU limit — latency-sensitive service
    memory: "1.5Gi"   # 2x request, covers GC pressure

Step 4: Use VPA Recommendations to Validate

Run VPA in Off mode alongside your manually-set values. After 1–2 weeks, compare VPA recommendations to your current settings.

Step 5: Adjust for Workload Lifecycle Events

Account for: JVM warmup at startup (CPU spike 3–10x steady-state), rolling deployment overlap (namespace quota headroom), and known traffic peaks (size to peak, not average).

Decision Framework: What to Set Based on Workload Type

Workload Type	CPU Request	CPU Limit	Memory Request	Memory Limit	QoS Target
Latency-sensitive API (Go, Java, Node)	p95 observed	Do not set	p95 observed	1.5–2x request	Burstable
Batch / background jobs	p50 observed	2–4x request	p95 observed	1.5x request	Burstable
System-critical (coredns, metrics-server)	Conservative	Equal to request	Conservative	Equal to request	Guaranteed
Stateful / databases (in-cluster)	p95 observed	Do not set	p99 observed	1.25x request	Burstable
Dev/test workloads	Low (100m)	2x request	Low (128Mi)	2x request	Burstable
Sidecar containers (envoy, otel-collector)	Profile individually	Contextual	Profile individually	1.5x request	Matches primary

Monitoring and Alerting

# OOMKill rate
- alert: ContainerOOMKilled
  expr: increase(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[5m]) > 0
  for: 0m
  labels:
    severity: warning

# CPU throttling >10%
- alert: CPUThrottlingHigh
  expr: |
    rate(container_cpu_cfs_throttled_seconds_total[5m])
    /
    rate(container_cpu_cfs_periods_total[5m])
    > 0.10
  for: 5m
  labels:
    severity: warning

# Memory near limit >85%
- alert: MemoryNearLimit
  expr: |
    container_memory_working_set_bytes
    /
    (container_spec_memory_limit_bytes > 0)
    > 0.85
  for: 5m
  labels:
    severity: warning

FAQ

Q: My Java application keeps getting OOMKilled but I’ve set limits at 2x average usage. What am I missing?

The JVM heap (-Xmx) is not the only memory consumer. Off-heap buffers, Metaspace, thread stacks, and JVM overhead add 25–40% on top. Set -Xmx at ~75% of your container memory limit. For a 1Gi limit: -Xmx768m is a safe starting point.

Q: Should I set the same resources in all environments?

No. Dev/test can use lower values. But the ratio between request and limit should be similar, and the resource profile should be close enough to catch misconfigurations before production.

Q: Can I use HPA and VPA together?

Yes, carefully. Use HPA for replica scaling (CPU or custom metrics) and VPA in Off mode or controlledValues: RequestsOnly for right-sizing guidance. Never have both managing the same metric simultaneously.

Q: My cluster uses cgroups v2. Does CPU throttling still apply?

Improved but not eliminated. cgroups v2 uses a weight-based scheduler that reduces throttling artifacts. However, cpu.cfs_quota_us enforcement still exists when CPU limits are set. For latency-sensitive workloads, the case for not setting CPU limits remains valid on cgroups v2.

Q: What is a realistic cluster overcommit ratio?

CPU: 5–10x overcommit (total requests vs physical cores) is common for mixed workloads with accurate requests. Memory: 1.5–2x cluster-level overcommit is manageable at 1.5x request:limit ratios. Beyond 2x, node memory pressure events become frequent.

Q: LimitRange is set but pods are still admitted without resources. Why?

LimitRange defaults only apply to containers with no resource specification at all. If a container specifies requests.cpu but not limits.cpu, the LimitRange default for CPU does not fill in the missing limit. Also verify the LimitRange is in the correct namespace: kubectl get limitrange -n <namespace>.

Q: What does a pod with no memory limit do to a node?

It can consume all available node memory unconstrained. This triggers the Linux OOM killer at the node level, which may kill processes outside the container — including the kubelet itself in extreme cases. Memory limits are non-negotiable in production.

Tested against Kubernetes 1.28–1.32. cgroups v2 behavior noted where it differs from v1. VPA examples use autoscaling.k8s.io/v1 API (VPA 0.14+).

Prometheus 3.0 and OpenTelemetry: Native OTLP Support Explained

2026-05-022026-04-21 by

Seven years is a long time in observability. Since Prometheus 2.0 landed in 2017, the ecosystem has been transformed by cloud-native adoption, the rise of distributed tracing, and the emergence of OpenTelemetry as the de facto standard for instrumentation. Prometheus 3.0, released in November 2024, is the project’s answer to that transformation — and its most significant change is the native ability to ingest OpenTelemetry metrics directly, without an intermediary collector standing in the way.

This article goes deep on what Prometheus 3.0 actually changes for platform engineers and cloud architects who are running — or planning to run — OTel-instrumented workloads alongside Prometheus-based monitoring stacks. We will cover the native OTLP ingestion endpoint, UTF-8 metric name support, Remote Write 2.0, migration considerations, and the architectural patterns that still make sense even when native OTLP is available.

What Changed in Prometheus 3.0: The OTel-Relevant Picture

Prometheus 3.0 ships a substantial set of changes. Not all of them are equally relevant to OpenTelemetry integration, so let’s focus on what actually moves the needle for OTel users before diving into each area in detail.

Native OTLP Ingestion

The flagship feature: Prometheus 3.0 ships with a built-in OTLP receiver that exposes an HTTP endpoint accepting metrics in the OpenTelemetry Protocol format. Applications instrumented with any OTel SDK can now push metrics directly to Prometheus without routing through an OpenTelemetry Collector. This is not a sidecar, not a plugin, not an external adapter — it is a first-class endpoint in the Prometheus binary itself.

UTF-8 Metric Names

Prometheus historically restricted metric names to [a-zA-Z_:][a-zA-Z0-9_:]*. OpenTelemetry uses dots and slashes in metric names by convention — http.server.request.duration is a canonical OTel metric name. Prometheus 3.0 lifts this restriction and supports arbitrary UTF-8 characters in metric names and label names, which is the single most important compatibility change for OTel interoperability.

Remote Write 2.0

Remote Write 2.0 replaces the original protocol with a more efficient encoding based on protobuf, adds native histogram support in the wire format, and reduces bandwidth consumption significantly for large-scale deployments. If you are federating metrics to Thanos, Mimir, or Cortex, this matters for operational cost.

New UI

The Prometheus web UI has been completely rewritten. The new UI uses React, supports metric metadata exploration, and provides a significantly improved query-building experience. This is a quality-of-life improvement rather than an architectural change, but it reduces the dependency on external tools like Grafana for ad-hoc investigation.

Breaking Changes Summary

Prometheus 3.0 removes several features that were deprecated in 2.x. The most operationally significant are: removal of the --web.enable-admin-api deprecated flag path, removal of certain legacy storage format options, changes to default scrape timeouts, and stricter validation of configuration that was previously silently accepted. We cover a migration checklist later in this article.

The OTLP Receiver: How It Works and What It Accepts

The OTLP receiver in Prometheus 3.0 is implemented as an optional feature that must be explicitly enabled. Once enabled, it exposes an HTTP endpoint at /api/v1/otlp/v1/metrics that accepts protobuf-encoded OTLP ExportMetricsServiceRequest payloads — the same wire format used by the OpenTelemetry Collector’s OTLP exporter.

What It Accepts (and What It Does Not)

This is critical to understand before you architect around native OTLP ingestion: Prometheus 3.0 OTLP support is metrics-only. It does not accept traces or logs. OTLP is a unified protocol covering all three signals, but Prometheus is a metrics store — the receiver handles only the metrics portion of the OTLP specification.

Supported metric types in the OTLP receiver:

Gauge — maps directly to a Prometheus Gauge
Sum (monotonic) — maps to a Prometheus Counter
Sum (non-monotonic) — maps to a Prometheus Gauge
Histogram (explicit bucket) — maps to a Prometheus Histogram
ExponentialHistogram — maps to Prometheus Native Histograms (a 3.0 feature)
Summary — maps to a Prometheus Summary

Resource attributes from the OTLP payload — things like service.name, k8s.pod.name, cloud.region — are converted to Prometheus labels. This conversion is configurable, and by default Prometheus applies a promotion strategy that converts the most common resource attributes to labels while discarding ones that would create extremely high cardinality.

Enabling the OTLP Receiver

Enabling native OTLP ingestion requires two things: a feature flag and a configuration block in prometheus.yml.

Start the Prometheus binary with the feature flag:

prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --enable-feature=otlp-write-receiver

Then add the OTLP receiver configuration to your prometheus.yml:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

otlp:
  # Promote these OTLP resource attributes to Prometheus labels
  promote_resource_attributes:
    - service.name
    - service.namespace
    - service.instance.id
    - k8s.namespace.name
    - k8s.pod.name
    - k8s.node.name
    - cloud.region
    - deployment.environment

With this configuration, Prometheus will listen on port 9090 (default) and accept OTLP metrics at http://<prometheus-host>:9090/api/v1/otlp/v1/metrics.

Resource Attribute Promotion Strategy

The promote_resource_attributes list deserves careful thought. OTLP carries rich resource-level context — every metric payload includes a ResourceMetrics object with attributes describing the source: service name, version, environment, Kubernetes pod, node, cluster, cloud provider details, and more. Prometheus labels are flat key-value pairs on each time series. Promoting too many resource attributes explodes cardinality; promoting too few loses important context.

A pragmatic starting list for Kubernetes deployments:

otlp:
  promote_resource_attributes:
    - service.name          # Critical: identifies the service
    - service.namespace     # Logical grouping
    - deployment.environment  # prod/staging/dev
    - k8s.namespace.name    # Kubernetes namespace
    - k8s.pod.name          # Pod-level cardinality — consider omitting in high-scale
    - k8s.node.name         # Useful for infrastructure correlation

Avoid blindly promoting k8s.pod.name at scale — in a cluster with thousands of short-lived pods, this creates significant cardinality pressure. Prefer service.name and service.namespace for most alerting use cases, reserving pod-level labels for debugging dashboards.

UTF-8 Metric Names: Why This Is the Real Game-Changer

To appreciate why UTF-8 metric name support matters so much, you need to understand the friction it eliminates. OpenTelemetry semantic conventions define metric names using dots as namespace separators. The canonical HTTP server duration metric is http.server.request.duration. The canonical database query duration is db.client.operation.duration. These names are standardized across languages and frameworks — your Go service and your Java service and your Python service all emit the same metric name when instrumented with OTel.

Prometheus 2.x could not store these names. The dots are illegal characters in Prometheus metric naming. Every OTel-to-Prometheus bridge — the OpenTelemetry Collector’s Prometheus exporter, prom-client compatibility layers, the older prometheusremotewrite exporter — had to translate these names, typically by replacing dots with underscores: http_server_request_duration.

This translation is lossy and creates multiple problems:

Name collisions: http.server.request_duration and http.server.request.duration both become http_server_request_duration
Dashboard breakage: Grafana dashboards built against OTel semantic conventions don’t work against translated Prometheus metrics without modification
Cross-signal correlation: Trace attributes use dot notation; when metric names differ, automated correlation tools lose the thread
Vendor lock-in pressure: Teams end up with separate naming conventions for “Prometheus metrics” vs “OTel metrics” and maintain both

Prometheus 3.0 with UTF-8 support stores http.server.request.duration natively. No translation. No collision. The metric name you instrument with is the metric name you query.

Enabling UTF-8 Metric Names

UTF-8 metric names require the utf8-names feature flag:

prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --enable-feature=utf8-names \
  --enable-feature=otlp-write-receiver

Once enabled, PromQL queries must use quoted metric names when the name contains characters outside the legacy character set:

# Legacy metric name — unquoted works fine
http_server_requests_total

# OTel metric name with dots — requires quoting in PromQL
{"__name__"="http.server.request.duration"}

# Or using the new PromQL syntax in Prometheus 3.0
http.server.request.duration{service_name="api-gateway"}

The PromQL parser in Prometheus 3.0 has been updated to handle quoted metric names as a first-class construct. Grafana’s PromQL engine has also been updated to handle this syntax — verify your Grafana version (10.3+ has full support) before deploying.

OTel SDK to Prometheus 3.0 Directly: No Collector Required

For teams that only need to get application metrics into Prometheus, native OTLP ingestion enables a dramatically simpler architecture. Here’s what it looks like with different OTel SDKs.

Go (OpenTelemetry SDK)

package main

import (
    "context"
    "time"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp"
    "go.opentelemetry.io/otel/sdk/metric"
    "go.opentelemetry.io/otel/sdk/resource"
    semconv "go.opentelemetry.io/otel/semconv/v1.26.0"
)

func initMetrics(ctx context.Context) (*metric.MeterProvider, error) {
    res, err := resource.New(ctx,
        resource.WithAttributes(
            semconv.ServiceName("my-api"),
            semconv.ServiceNamespace("platform"),
            semconv.DeploymentEnvironment("production"),
        ),
    )
    if err != nil {
        return nil, err
    }

    // Point directly at Prometheus 3.0 OTLP endpoint
    exporter, err := otlpmetrichttp.New(ctx,
        otlpmetrichttp.WithEndpoint("prometheus:9090"),
        otlpmetrichttp.WithURLPath("/api/v1/otlp/v1/metrics"),
        otlpmetrichttp.WithInsecure(), // Use WithTLSClientConfig for production
    )
    if err != nil {
        return nil, err
    }

    provider := metric.NewMeterProvider(
        metric.WithResource(res),
        metric.WithReader(
            metric.NewPeriodicReader(exporter,
                metric.WithInterval(30*time.Second),
            ),
        ),
    )

    otel.SetMeterProvider(provider)
    return provider, nil
}

Python (OpenTelemetry SDK)

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME, SERVICE_NAMESPACE

resource = Resource.create({
    SERVICE_NAME: "my-api",
    SERVICE_NAMESPACE: "platform",
    "deployment.environment": "production",
})

exporter = OTLPMetricExporter(
    endpoint="http://prometheus:9090/api/v1/otlp/v1/metrics",
)

reader = PeriodicExportingMetricReader(
    exporter,
    export_interval_millis=30_000,
)

provider = MeterProvider(resource=resource, metric_readers=[reader])
metrics.set_meter_provider(provider)

# Use the meter
meter = metrics.get_meter("my-api")
request_counter = meter.create_counter(
    name="http.server.request.count",
    description="Total HTTP server requests",
    unit="1",
)
request_duration = meter.create_histogram(
    name="http.server.request.duration",
    description="HTTP server request duration",
    unit="s",
)

Java (OpenTelemetry SDK with Spring Boot)

# application.properties (Spring Boot with OTel auto-instrumentation)
otel.service.name=my-api
otel.resource.attributes=service.namespace=platform,deployment.environment=production

# Configure OTLP exporter to push directly to Prometheus
otel.metrics.exporter=otlp
otel.exporter.otlp.metrics.endpoint=http://prometheus:9090/api/v1/otlp/v1/metrics
otel.exporter.otlp.metrics.protocol=http/protobuf

# Export interval
otel.metric.export.interval=30000

With Spring Boot and the OTel Java agent, no code changes are required beyond configuration — the agent instruments your HTTP server, database clients, and messaging systems automatically and pushes metrics using the names defined in OTel semantic conventions.

OTel Collector to Prometheus 3.0: When You Need the Intermediary

Native OTLP ingestion is compelling, but the OpenTelemetry Collector remains relevant for a significant set of use cases. Understanding when each pattern is appropriate is the core architectural decision you will face when adopting Prometheus 3.0 in an OTel environment.

Pattern 1: OTel Collector as Fan-Out Gateway

When you need to send metrics to multiple backends simultaneously — Prometheus for alerting, a long-term store like Thanos for historical analysis, and a commercial observability platform for full-stack correlation — the OTel Collector handles fan-out efficiently. Applications push once to the Collector; the Collector distributes to all backends.

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s
    send_batch_size: 1000
  memory_limiter:
    check_interval: 1s
    limit_mib: 512

exporters:
  # Push to Prometheus 3.0 via OTLP
  otlphttp/prometheus:
    endpoint: http://prometheus:9090/api/v1/otlp
    tls:
      insecure: true

  # Fan-out to Thanos via remote_write
  prometheusremotewrite/thanos:
    endpoint: http://thanos-receive:10908/api/v1/receive
    resource_to_telemetry_conversion:
      enabled: true

  # Fan-out to commercial backend
  otlp/datadog:
    endpoint: https://otel-intake.datadoghq.com
    headers:
      DD-API-KEY: "${DD_API_KEY}"

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlphttp/prometheus, prometheusremotewrite/thanos, otlp/datadog]

Pattern 2: Collector for Metric Transformation

The OTel Collector’s transform processor and metricstransform processor allow you to reshape metrics before they reach Prometheus: rename labels, add static attributes, filter out high-cardinality series, aggregate metrics to reduce storage cost, or apply unit conversions. These operations are not available in Prometheus’s native OTLP receiver.

processors:
  transform/metrics:
    metric_statements:
      - context: metric
        statements:
          # Drop internal debug metrics
          - delete_matching_keys(attributes, "internal.*")
          # Normalize environment label values
          - set(attributes["deployment.environment"], "prod")
            where attributes["deployment.environment"] == "production"

  filter/drop_debug:
    metrics:
      exclude:
        match_type: regexp
        metric_names:
          - ".*\.debug\..*"
          - "runtime\.go\.internal\..*"

  metricstransform:
    transforms:
      # Rename a metric to match your existing Prometheus naming convention
      - include: http.server.request.duration
        action: update
        new_name: http_server_request_duration_seconds

Pattern 3: Collector for Traces and Logs (Always Required)

If your architecture includes traces and logs alongside metrics — and in 2025 it almost certainly does — you need an OTel Collector regardless of what you do with metrics. Prometheus does not accept traces or logs. Jaeger, Tempo, and Loki all have their own ingestion protocols. The Collector is the universal routing layer for the three pillars of observability.

In this architecture, it is usually simpler to route all three signals through the Collector and let it push metrics to Prometheus via OTLP or remote_write, rather than splitting metrics to go directly and everything else through the Collector.

When to Use Native OTLP vs. OTel Collector: Decision Framework

Scenario	Native OTLP	OTel Collector
Single metrics backend (Prometheus only)	Preferred	Overkill
Multiple metrics backends	Not sufficient	Required
Traces + Logs in scope	Not applicable	Required
Metric transformation/filtering needed	Not supported	Required
Simple Kubernetes-native deployment	Preferred	Additional complexity
Air-gapped / constrained environments	Preferred (fewer components)	Consider carefully
Mixed OTel + legacy Prometheus targets	Works alongside scraping	Can normalize naming
High-volume, need batching/buffering	Limited control	Preferred

The pragmatic recommendation for most platform engineering teams: if you are already running the OTel Collector (and you should be if traces are in scope), continue routing metrics through it. Use the Collector’s otlphttp exporter to push to Prometheus 3.0. Reserve the direct SDK-to-Prometheus pattern for simple services where the Collector would be the only reason to add complexity.

Remote Write 2.0: What Changes for Existing Setups

Remote Write 2.0 is a significant protocol upgrade with real operational implications for teams using Prometheus as a metrics source for long-term storage systems like Thanos, Mimir, VictoriaMetrics, or Cortex.

Key Protocol Changes

Protobuf encoding with snappy compression — replacing the previous text-based format. Typically 50-70% reduction in wire size for large metric batches
Native histogram support in the wire format — exponential histograms can now be forwarded without converting to classic histograms, preserving full resolution
Metadata forwarding — metric type and unit information is now transmitted alongside samples, enabling better downstream processing
Created timestamps — the timestamp at which a counter was created is forwarded, enabling more accurate rate calculations across restarts

Configuring Remote Write 2.0

# prometheus.yml
remote_write:
  - url: "http://thanos-receive:10908/api/v1/receive"
    # Remote Write 2.0 is negotiated automatically with compatible receivers
    # Force RW2.0 explicitly if needed:
    send_native_histograms: true
    metadata_config:
      send: true
      send_interval: 1m
    queue_config:
      capacity: 10000
      max_shards: 200
      max_samples_per_send: 2000
      batch_send_deadline: 5s

Remote Write 2.0 uses protocol content negotiation — Prometheus 3.0 will attempt RW2.0 first and fall back to RW1.0 if the receiver does not support it. This means upgrades are generally backward-compatible. Verify that your receiving system (Thanos Receive 0.35+, Mimir 2.12+, VictoriaMetrics 1.98+) supports RW2.0 before expecting the benefits.

Migration from Prometheus 2.x: Breaking Changes Checklist

Upgrading from Prometheus 2.x to 3.0 requires attention to several breaking changes. This checklist covers the operationally significant ones for teams running production Prometheus deployments.

Configuration Changes

Removed: query.lookback-delta default change — the default changed from 5 minutes to match the scrape interval. Queries that relied on the 5m default may return different results. Audit alerting rules that use instant queries on counters.
Removed: deprecated remote_write options — remote_write[].queue_config.capacity semantics changed. Review and update queue configurations.
Removed: storage.tsdb.allow-overlapping-blocks flag — overlapping blocks handling is now automatic. Remove this flag from your startup scripts.
Scrape protocols default change — Prometheus 3.0 defaults to OpenMetrics format for scraping when targets support it. This enables native histograms but may surface parsing differences. Test with --enable-feature=no-default-scrape-port removed if you relied on the old behavior.
Agent mode changes — if using Prometheus Agent mode, review the updated configuration options for WAL management.

PromQL Changes

Stricter parsing — some previously accepted but technically invalid PromQL expressions now fail. Run your alerting rules through promtool check rules against a Prometheus 3.0 binary before cutover.
Native histogram functions — new functions like histogram_fraction() and histogram_quantile() have updated behavior with native histograms. Existing dashboard queries using histogram_quantile() on classic histograms continue to work unchanged.

Storage Compatibility

Prometheus 3.0 can read existing 2.x TSDB data. The upgrade path does not require a data migration. However, Prometheus 2.x cannot read data blocks written by 3.0 (downgrade is not supported without data loss after any writes have occurred). Take a snapshot before upgrading if you need rollback capability:

# Take a TSDB snapshot before upgrading
curl -X POST http://prometheus:9090/api/v1/admin/tsdb/snapshot

# Verify the snapshot exists
ls /prometheus/snapshots/

Pre-Upgrade Validation Steps

# 1. Validate configuration against Prometheus 3.0
docker run --rm -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus:v3.0.0 \
  promtool check config /etc/prometheus/prometheus.yml

# 2. Validate alerting rules
docker run --rm -v $(pwd)/rules:/etc/prometheus/rules \
  prom/prometheus:v3.0.0 \
  promtool check rules /etc/prometheus/rules/*.yml

# 3. Run in parallel (shadow mode) before full cutover
# Deploy Prometheus 3.0 alongside 2.x, scraping the same targets
# Compare query results between versions using promtool query range

Practical Kubernetes Deployment Example

Here is a production-ready Kubernetes deployment of Prometheus 3.0 with OTLP ingestion enabled, suitable as a starting point for platform engineering teams.

Prometheus 3.0 ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
      external_labels:
        cluster: production
        region: eu-west-1

    otlp:
      promote_resource_attributes:
        - service.name
        - service.namespace
        - deployment.environment
        - k8s.namespace.name
        - k8s.pod.name

    rule_files:
      - /etc/prometheus/rules/*.yml

    scrape_configs:
      - job_name: kubernetes-pods
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: "true"
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)

    remote_write:
      - url: http://thanos-receive.monitoring.svc.cluster.local:10908/api/v1/receive
        send_native_histograms: true
        metadata_config:
          send: true

Prometheus 3.0 Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      serviceAccountName: prometheus
      containers:
        - name: prometheus
          image: prom/prometheus:v3.0.0
          args:
            - --config.file=/etc/prometheus/prometheus.yml
            - --storage.tsdb.path=/prometheus/data
            - --storage.tsdb.retention.time=15d
            - --web.enable-lifecycle
            - --web.enable-admin-api
            - --enable-feature=otlp-write-receiver
            - --enable-feature=utf8-names
            - --enable-feature=native-histograms
          ports:
            - name: http
              containerPort: 9090
              protocol: TCP
          volumeMounts:
            - name: config
              mountPath: /etc/prometheus
            - name: data
              mountPath: /prometheus/data
          resources:
            requests:
              cpu: 500m
              memory: 2Gi
            limits:
              cpu: 2000m
              memory: 8Gi
          livenessProbe:
            httpGet:
              path: /-/healthy
              port: http
            initialDelaySeconds: 30
            periodSeconds: 15
          readinessProbe:
            httpGet:
              path: /-/ready
              port: http
            initialDelaySeconds: 5
            periodSeconds: 5
      volumes:
        - name: config
          configMap:
            name: prometheus-config
        - name: data
          persistentVolumeClaim:
            claimName: prometheus-data
---
apiVersion: v1
kind: Service
metadata:
  name: prometheus
  namespace: monitoring
spec:
  selector:
    app: prometheus
  ports:
    - name: http
      port: 9090
      targetPort: http
  type: ClusterIP

Configuring Applications to Push OTLP

With this deployment, any application in the cluster can push OTLP metrics by setting the following environment variables (works with any OTel SDK supporting OTLP HTTP):

env:
  - name: OTEL_SERVICE_NAME
    valueFrom:
      fieldRef:
        fieldPath: metadata.labels['app']
  - name: OTEL_SERVICE_NAMESPACE
    valueFrom:
      fieldRef:
        fieldPath: metadata.namespace
  - name: OTEL_METRICS_EXPORTER
    value: "otlp"
  - name: OTEL_EXPORTER_OTLP_METRICS_ENDPOINT
    value: "http://prometheus.monitoring.svc.cluster.local:9090/api/v1/otlp/v1/metrics"
  - name: OTEL_EXPORTER_OTLP_METRICS_PROTOCOL
    value: "http/protobuf"
  - name: OTEL_METRIC_EXPORT_INTERVAL
    value: "30000"
  - name: OTEL_RESOURCE_ATTRIBUTES
    value: "deployment.environment=production,k8s.namespace.name=$(NAMESPACE)"

This approach works particularly well in environments using the OTel Operator for Kubernetes, where the Instrumentation CRD can inject these environment variables automatically into pods based on namespace or pod label selectors — zero-touch instrumentation with native Prometheus storage.

Frequently Asked Questions

Can I use Prometheus 3.0 OTLP ingestion for traces and logs?

No. The Prometheus 3.0 OTLP receiver handles only metrics. Prometheus is a metrics store — it has no data model for traces or logs. For traces, you need a backend like Jaeger or Grafana Tempo. For logs, you need Loki, Elasticsearch, or a similar system. The OTel Collector is the appropriate routing layer when you need to send all three signals to their respective backends from a single application-side push endpoint.

Does the kube-prometheus-stack Helm chart support Prometheus 3.0?

Yes, with caveats. The kube-prometheus-stack chart updated its Prometheus image to 3.0 starting with chart version 66.0.0. However, some bundled recording rules and alerting rules may need adjustment for the PromQL changes and default behavioral differences. The Prometheus Operator itself (version 0.78+) has been updated to support the new configuration options including the otlp configuration block. If you are managing Prometheus via the Operator, you will configure OTLP settings through the Prometheus CRD’s spec.additionalArgs and a custom PrometheusConfiguration resource.

What happens to existing Prometheus 2.x metric names when I enable UTF-8 support?

Existing metrics with underscore-based names continue to work exactly as before. Enabling UTF-8 support is purely additive — it allows the storage and querying of metric names containing dots and other UTF-8 characters, but it does not rename or modify existing metrics. Your existing dashboards, alerting rules, and recording rules continue to function without modification. Only metrics ingested via OTLP (or exposed by exporters using OTel naming conventions) will use dot-separated names.

How does native OTLP ingestion affect Prometheus’s pull model?

It coexists with it. Prometheus 3.0 continues to scrape targets via the pull model on the same 9090 port. The OTLP endpoint is an additional ingestion path, not a replacement for scraping. You can have a Prometheus instance simultaneously scraping Kubernetes pods via service discovery and receiving OTLP push metrics from applications — both are stored in the same TSDB and queryable via the same PromQL interface. This hybrid approach is common during migrations, where legacy components are scraped and new OTel-instrumented services push via OTLP.

Is the Prometheus 3.0 OTLP receiver suitable for high-volume production workloads?

For moderate volumes, yes. The OTLP receiver is synchronous — the HTTP request completes only after the samples are written to the WAL. Under very high ingestion rates (hundreds of thousands of samples per second), this can create back-pressure that affects application latency. The OTel Collector handles this better through internal buffering, retry queues, and batch processing. For high-volume scenarios, the recommended pattern is: applications push to OTel Collector (which acknowledges immediately and buffers), Collector pushes to Prometheus via OTLP or remote_write in optimized batches. For the majority of Kubernetes workloads — dozens to hundreds of services with typical metric cardinality — the native OTLP receiver performs well without an intermediary.

Helm Values JSON Schema: Validate Your values.yaml Before It Breaks Production

2026-05-022026-04-21 by

Helm is the de facto package manager for Kubernetes, and values.yaml is its primary interface for configuration. Yet for years, that interface has been completely unvalidated by default — a free-form YAML file where any key can be anything, where typos silently pass through, and where misconfigured deployments only reveal themselves when pods fail to start in production. The values.schema.json file changes that equation entirely. This article explains why schema validation matters, how to implement it properly, and how to integrate it into a modern CI/CD pipeline.

The Problem: Silent Failures in Production

Consider a platform team managing dozens of Helm releases across multiple clusters. A developer submits a values override file with replicaCount: "3" instead of replicaCount: 3 — a string where an integer is expected. Or they set image.pullPolicy: Allways with a typo. Or they omit a required secret reference that the application needs to boot. In all three cases, Helm without schema validation will happily render the templates, produce Kubernetes manifests, and apply them to the cluster. The failure surfaces later — sometimes much later — as a CrashLoopBackOff, an ImagePullBackOff, or a subtle runtime error that takes hours to debug.

This is not a hypothetical scenario. It is the daily reality for teams operating at scale without values validation. The root cause is architectural: Helm templates use Go’s text/template engine, which is weakly typed and permissive by design. A template that does {{ .Values.replicaCount }} will render whether the value is an integer, a string, or even a boolean. The resulting Kubernetes manifest may be invalid, but that error only surfaces when the Kubernetes API server rejects it — or worse, accepts it but interprets it differently than intended.

The consequences compound at scale. When a chart is used by multiple teams, the lack of a formal contract for acceptable values means every consumer has to read through template files and comments to understand what inputs are valid. There is no machine-readable specification. There is no IDE support. There is no guardrail. The only documentation is whatever the chart author happened to write in comments inside values.yaml — and comments do not stop a CI pipeline from shipping a broken deployment.

What Is values.schema.json

Since Helm 3.0.0, released in November 2019, Helm supports an optional values.schema.json file at the root of a chart directory — the same level as Chart.yaml and values.yaml. This file is a JSON Schema draft-07 document that formally describes the structure, types, constraints, and required fields for the chart’s values.

When this file is present, Helm automatically validates the merged values (defaults from values.yaml merged with any user-supplied overrides) against the schema at multiple points: during helm install, helm upgrade, helm template, and helm lint. If validation fails, Helm refuses to proceed and prints a human-readable error message identifying exactly which value failed and why. This transforms a class of runtime failures into build-time failures — the correct direction for any production system.

The choice of JSON Schema draft-07 specifically is worth noting. Draft-07 is widely supported by tooling, including the Red Hat YAML extension for VS Code, JetBrains IDEs, and most JSON Schema validators. It introduced the if/then/else conditional keywords that are particularly useful for Helm charts. More recent drafts (2019-09, 2020-12) offer additional features but have less universal tooling support, making draft-07 the pragmatic choice for chart authors today.

Chart Directory Structure

my-app/
├── Chart.yaml
├── values.yaml
├── values.schema.json      ← lives here
├── charts/
└── templates/
    ├── deployment.yaml
    ├── service.yaml
    ├── ingress.yaml
    └── _helpers.tpl

The schema file is included when a chart is packaged with helm package and distributed through chart repositories. Consumers of the chart get schema validation automatically without any additional configuration — the guardrails ship with the chart itself.

How Helm Uses the Schema

Helm’s validation behavior is straightforward but has some nuances worth understanding. When Helm processes a release, it first merges all value sources in order of increasing precedence: chart defaults (values.yaml), parent chart values, -f value files, and finally --set flags. The merged result is then validated against the schema as a single operation.

This means the schema validates the effective values, not each source in isolation. A required field that has a default in values.yaml will pass validation even when not specified by the user, because the merged result includes the default. This is the correct behavior — it validates what will actually be used during rendering.

The validation happens before template rendering. If schema validation fails, Helm exits with a non-zero status code and prints all validation errors. The error output is structured and actionable:

$ helm install my-release ./my-app --set replicaCount=abc

Error: values don't meet the specifications of the schema(s) in the following chart(s):
my-app:
- replicaCount: Invalid type. Expected: integer, given: string

For helm lint, which is typically used in CI pipelines without installing to a cluster, schema validation also runs. This makes helm lint a powerful pre-deployment gate when schema files are present.

IDE Benefits: Autocompletion and Inline Validation

Beyond Helm’s own validation, values.schema.json unlocks IDE support that significantly improves the developer experience when working with values files. The Red Hat YAML extension for VS Code can reference a JSON Schema file to provide autocompletion, type checking, and inline error highlighting for YAML files.

To enable this, add a yaml.schemas configuration to your VS Code workspace settings or the user settings file:

// .vscode/settings.json
{
  "yaml.schemas": {
    "./my-app/values.schema.json": "./my-app/values.yaml"
  }
}

With this configuration, editing values.yaml in VS Code will show autocompletion for defined keys, inline errors for type mismatches, and hover documentation pulled from the description fields in your schema. For platform teams maintaining internal Helm charts, this transforms the chart into a self-documenting, IDE-aware configuration interface — without any additional tooling investment.

JetBrains IDEs (IntelliJ IDEA, GoLand, etc.) support JSON Schema associations through the Languages & Frameworks > Schemas and DTDs > JSON Schema Mappings settings panel, providing equivalent functionality for teams using those tools.

Building the Schema: A Practical Guide

Let’s build a complete, realistic example. Start with a typical values.yaml for a web application chart:

# values.yaml
replicaCount: 2

image:
  repository: myorg/my-app
  tag: "1.0.0"
  pullPolicy: IfNotPresent

service:
  type: ClusterIP
  port: 80

ingress:
  enabled: false
  hostname: ""
  tls: false

resources:
  requests:
    cpu: "100m"
    memory: "128Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"

autoscaling:
  enabled: false
  minReplicas: 1
  maxReplicas: 10
  targetCPUUtilizationPercentage: 80

config:
  logLevel: info
  databaseUrl: ""

nodeSelector: {}
tolerations: []
affinity: {}

Now the full values.schema.json that validates this structure:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "my-app Helm Chart Values",
  "description": "Configuration values for the my-app Helm chart",
  "type": "object",
  "additionalProperties": false,
  "required": ["image", "service"],
  "$defs": {
    "resourceQuantity": {
      "type": "string",
      "pattern": "^[0-9]+(\\.[0-9]+)?(m|Ki|Mi|Gi|Ti|Pi|Ei|k|M|G|T|P|E)?$",
      "description": "A Kubernetes resource quantity (e.g. 100m, 128Mi, 1Gi)"
    }
  },
  "properties": {
    "replicaCount": {
      "type": "integer",
      "minimum": 0,
      "maximum": 50,
      "default": 2,
      "description": "Number of pod replicas. Set to 0 to scale down."
    },
    "image": {
      "type": "object",
      "additionalProperties": false,
      "required": ["repository", "tag"],
      "description": "Container image configuration",
      "properties": {
        "repository": {
          "type": "string",
          "minLength": 1,
          "description": "Container image repository"
        },
        "tag": {
          "type": "string",
          "pattern": "^[a-zA-Z0-9._-]+$",
          "minLength": 1,
          "description": "Image tag. Avoid using 'latest' in production."
        },
        "pullPolicy": {
          "type": "string",
          "enum": ["Always", "IfNotPresent", "Never"],
          "default": "IfNotPresent",
          "description": "Kubernetes imagePullPolicy"
        }
      }
    },
    "service": {
      "type": "object",
      "additionalProperties": false,
      "required": ["type", "port"],
      "properties": {
        "type": {
          "type": "string",
          "enum": ["ClusterIP", "NodePort", "LoadBalancer", "ExternalName"],
          "description": "Kubernetes Service type"
        },
        "port": {
          "type": "integer",
          "minimum": 1,
          "maximum": 65535,
          "description": "Service port"
        }
      }
    },
    "ingress": {
      "type": "object",
      "additionalProperties": false,
      "properties": {
        "enabled": {
          "type": "boolean",
          "default": false
        },
        "hostname": {
          "type": "string",
          "description": "Ingress hostname. Required when ingress.enabled is true."
        },
        "tls": {
          "type": "boolean",
          "default": false,
          "description": "Enable TLS for the ingress"
        }
      },
      "if": {
        "properties": {
          "enabled": { "const": true }
        },
        "required": ["enabled"]
      },
      "then": {
        "required": ["hostname"],
        "properties": {
          "hostname": {
            "minLength": 1,
            "pattern": "^[a-zA-Z0-9]([a-zA-Z0-9\\-\\.]+)?[a-zA-Z0-9]$"
          }
        }
      }
    },
    "resources": {
      "type": "object",
      "additionalProperties": false,
      "properties": {
        "requests": {
          "type": "object",
          "additionalProperties": false,
          "properties": {
            "cpu": { "$ref": "#/$defs/resourceQuantity" },
            "memory": { "$ref": "#/$defs/resourceQuantity" }
          }
        },
        "limits": {
          "type": "object",
          "additionalProperties": false,
          "properties": {
            "cpu": { "$ref": "#/$defs/resourceQuantity" },
            "memory": { "$ref": "#/$defs/resourceQuantity" }
          }
        }
      }
    },
    "autoscaling": {
      "type": "object",
      "additionalProperties": false,
      "properties": {
        "enabled": {
          "type": "boolean",
          "default": false
        },
        "minReplicas": {
          "type": "integer",
          "minimum": 1
        },
        "maxReplicas": {
          "type": "integer",
          "minimum": 1,
          "maximum": 100
        },
        "targetCPUUtilizationPercentage": {
          "type": "integer",
          "minimum": 1,
          "maximum": 100
        }
      },
      "if": {
        "properties": {
          "enabled": { "const": true }
        },
        "required": ["enabled"]
      },
      "then": {
        "required": ["minReplicas", "maxReplicas"]
      }
    },
    "config": {
      "type": "object",
      "additionalProperties": false,
      "properties": {
        "logLevel": {
          "type": "string",
          "enum": ["debug", "info", "warn", "error"],
          "default": "info",
          "description": "Application log level"
        },
        "databaseUrl": {
          "type": "string",
          "description": "Database connection URL"
        }
      }
    },
    "nodeSelector": {
      "type": "object",
      "description": "Node selector labels for pod scheduling"
    },
    "tolerations": {
      "type": "array",
      "description": "Pod tolerations"
    },
    "affinity": {
      "type": "object",
      "description": "Pod affinity rules"
    }
  }
}

Key Schema Patterns Explained

additionalProperties: false

This is arguably the most important pattern in a Helm schema. Without it, unknown keys pass validation silently — which defeats much of the purpose. With "additionalProperties": false, any key not listed in properties causes a validation error. This catches typos like repicaCount instead of replicaCount, which would otherwise silently use the default value and leave the developer wondering why their override had no effect.

Apply it at every nested object level, not just the root. A typo inside image: or resources: is just as dangerous as one at the top level.

$defs for Reusable Definitions

The $defs keyword (called definitions in earlier draft versions, though draft-07 supports both) provides a namespace for reusable schema fragments. In the example above, resourceQuantity is defined once and referenced via $ref in both requests and limits. This avoids duplication and ensures consistent validation logic across related fields.

For larger charts, $defs becomes essential. Common patterns include reusable schemas for image configurations, resource requirements, probe configurations, and environment variable maps.

Conditional Validation with if/then/else

The if/then/else construct in JSON Schema draft-07 is particularly powerful for Helm charts, where many values are conditional on a feature toggle. The ingress example above demonstrates this: when ingress.enabled is true, the hostname field becomes required and must match a valid hostname pattern. When ingress is disabled, the hostname can be empty or omitted entirely.

This pattern can be extended for more complex scenarios. For example, enforcing that when autoscaling.enabled is true, the standalone replicaCount should not be set (since the HPA controls replica count):

{
  "if": {
    "properties": {
      "autoscaling": {
        "properties": {
          "enabled": { "const": true }
        },
        "required": ["enabled"]
      }
    }
  },
  "then": {
    "properties": {
      "replicaCount": {
        "description": "replicaCount is ignored when autoscaling is enabled"
      }
    }
  }
}

Pattern Validation for Image Tags

The image tag field is a common source of production issues. Teams accidentally deploy with latest, which is non-deterministic and makes rollbacks unreliable. A pattern constraint can enforce semantic versioning or at least ban the latest tag in production charts:

"tag": {
  "type": "string",
  "not": {
    "enum": ["latest", ""]
  },
  "pattern": "^[0-9]+\\.[0-9]+\\.[0-9]+",
  "description": "Semantic version tag required. 'latest' is not permitted."
}

This enforces that image tags start with a semantic version number, immediately rejecting latest, empty strings, or arbitrary branch names that would produce non-reproducible deployments.

Enum for Controlled Vocabularies

Fields with a fixed set of valid values — Kubernetes service types, image pull policies, log levels — should use enum. This is more precise than a pattern and produces clearer error messages. It also enables IDE autocompletion to show exactly the valid options as a pick-list, rather than requiring the developer to remember or look up acceptable values.

CI/CD Integration

GitHub Actions

The most direct integration point is helm lint, which runs schema validation as part of its checks. A minimal GitHub Actions workflow that validates a chart on every pull request looks like this:

# .github/workflows/helm-lint.yaml
name: Helm Lint

on:
  pull_request:
    paths:
      - 'charts/**'

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Helm
        uses: azure/setup-helm@v4
        with:
          version: '3.14.0'

      - name: Lint chart with default values
        run: helm lint charts/my-app

      - name: Lint chart with staging values
        run: helm lint charts/my-app -f charts/my-app/ci/staging-values.yaml

      - name: Lint chart with production values
        run: helm lint charts/my-app -f charts/my-app/ci/production-values.yaml

      - name: Validate template rendering
        run: |
          helm template my-app charts/my-app \
            -f charts/my-app/ci/production-values.yaml \
            --debug > /dev/null

The ci/ directory convention (values files specifically for CI testing) is a pattern from the chart-testing tool and works well for validating multiple realistic value combinations, not just the defaults.

For teams using the ct (chart-testing) CLI tool from the Helm project, schema validation is automatically included in the ct lint command, which also handles chart versioning checks and YAML linting:

      - name: Chart Testing lint
        uses: helm/chart-testing-action@v2.6.1

      - name: Run chart-testing lint
        run: ct lint --target-branch ${{ github.event.repository.default_branch }}

Pre-commit Hooks

For local development, pre-commit hooks catch issues before code is even pushed. The pre-commit framework makes this straightforward:

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/gruntwork-io/pre-commit
    rev: v0.1.23
    hooks:
      - id: helmlint

  - repo: local
    hooks:
      - id: helm-schema-validate
        name: Helm Schema Validation
        language: script
        entry: scripts/validate-helm-schemas.sh
        files: ^charts/.*values.*\.yaml$

#!/usr/bin/env bash
# scripts/validate-helm-schemas.sh
set -euo pipefail

for chart_dir in charts/*/; do
  if [[ -f "${chart_dir}/values.schema.json" ]]; then
    echo "Linting ${chart_dir}..."
    helm lint "${chart_dir}" --strict
  fi
done

ArgoCD and Flux Integration

Both ArgoCD and Flux (Helm Controller) invoke helm template internally when reconciling Helm releases. Since helm template runs schema validation when a schema file is present, any invalid values in an HelmRelease or ArgoCD Application manifest will cause the reconciliation to fail with a clear error message — visible in the controller logs and surface as a degraded resource status. No additional configuration is required; schema validation is automatic.

Generating Schemas from Existing Charts

For charts that already have a well-structured values.yaml, writing a schema from scratch is time-consuming but not starting from zero. Several tools can generate a draft schema that you then refine:

helm-values-schema-json — a Helm plugin (helm plugin install https://github.com/losisin/helm-values-schema-json) that introspects values.yaml and generates a draft schema with inferred types. Run with helm schema-gen values.yaml.
json-schema-generator online tools — paste your values as JSON (convert YAML to JSON first) and get a draft schema back.
Manually from scratch — for new charts, writing the schema alongside the values file from the beginning is the most accurate approach and requires no extra tooling.

Generated schemas are always starting points. They infer types from existing values but cannot know about intended constraints, enums, patterns, required fields in conditional cases, or additionalProperties: false at nested levels. Manual review and refinement is always necessary.

Common Mistakes and How to Avoid Them

Mistake	Symptom	Fix
Missing `additionalProperties: false`	Typos in key names pass validation silently	Add it at every object level, including nested objects
Schema only at root level	Nested typos go undetected	Apply `additionalProperties: false` recursively
Not including defaults in schema	IDE shows fields as required when they are optional	Add `default` to all optional fields
Overly strict patterns blocking valid values	Legitimate deployments fail schema validation	Test patterns against your real value space before shipping
Using `definitions` instead of `$defs`	Works in most tools but is draft-2019-09+ terminology	Use `$defs` for draft-07 compliance; both work in practice
Schema not committed to the chart repo	Consumers get no validation when pulling from repository	Always commit `values.schema.json` alongside the chart
Validating subchart values through parent schema	Schema errors for subchart values the parent doesn’t own	Do not attempt to validate subchart values in parent schema; each chart owns its own schema

The Null Value Problem

A subtle but common issue: in YAML, an unset key with no value (key:) resolves to null, not an empty string or zero. If your schema defines a field as "type": "string", a null value will fail validation. To handle optional fields that users might leave blank, use a type union:

"databaseUrl": {
  "type": ["string", "null"],
  "description": "Database connection URL. Leave null to use the default."
}

Alternatively, ensure your values.yaml defaults use empty strings ("") rather than bare keys, and document that convention for chart consumers.

Schema Drift

As charts evolve, new values get added to values.yaml without corresponding updates to values.schema.json. Over time the schema becomes stale and provides partial coverage. The fix is procedural: treat schema updates as part of the definition of done for any PR that modifies values. Code review should include checking that new or modified values have corresponding schema entries.

Frequently Asked Questions

Does values.schema.json validate subchart values?

No. Each chart in a dependency relationship validates only its own values against its own schema. If chart A depends on chart B, and chart B has a schema, chart B’s schema validates the values under the b: key in chart A’s values.yaml — but only when processed in the context of chart B itself. Chart A’s schema should not attempt to describe chart B’s values structure. This is by design: it maintains loose coupling between charts and allows subcharts to evolve their schemas independently.

Can I use JSON Schema draft-2020-12 instead of draft-07?

Technically, Helm does not strictly enforce which draft version you use — it uses the Go library github.com/xeipuuv/gojsonschema, which supports draft-04 through draft-07. Using newer draft keywords that are not supported by this library may cause them to be silently ignored rather than throwing an error. For IDE support, draft-07 has the broadest compatibility. If you need features from newer drafts (like unevaluatedProperties from 2020-12), test carefully to confirm they are enforced by Helm’s validator and not silently skipped.

How do I handle values that differ between environments without schema conflicts?

The schema should describe all valid values across all environments. Use enum to enumerate all valid values for a field, and use if/then/else for constraints that only apply in certain configurations. The schema is a contract for what the chart accepts, not a policy for what a specific environment should use. Environment-specific policies (such as “production must use a minimum of 3 replicas”) are better enforced at a higher level — through admission controllers like OPA Gatekeeper or Kyverno — rather than in the chart schema itself.

Does schema validation run when using helm template for dry runs?

Yes. helm template runs schema validation before rendering templates. This makes it useful as a validation step in CI pipelines even without a live cluster: helm template release-name ./chart -f values-override.yaml will fail with schema errors if the values are invalid, and will output the rendered manifests if they are valid. Piping the output to kubectl apply --dry-run=client -f - adds an additional layer of Kubernetes API validation for a thorough offline check.

Should I add values.schema.json to charts I don’t maintain (upstream charts)?

For upstream charts you consume but do not maintain (such as Bitnami charts, ingress-nginx, cert-manager), the recommended approach is to maintain a separate JSON Schema file in your own GitOps repository that validates your specific values overlay files. Tools like jsonschema (Python) or ajv (Node.js) can validate a YAML/JSON values file against a schema in CI without Helm being involved. This gives you schema validation for your environment-specific overrides without needing to modify upstream chart sources.

After NGINX Ingress Controller: Alternatives and Migration Guide

2026-05-022026-04-21 by

If you manage Kubernetes clusters in production, the last 18 months have been uncomfortable. Two of the most widely deployed NGINX-based Ingress Controllers have faced critical security vulnerabilities, deprecation announcements, and shifting maintenance responsibilities — all while the Kubernetes project accelerates its push toward a new traffic management standard. This is not a drill. Teams running ingress-nginx or the F5/NGINX Ingress Controller need a clear picture of what changed, what it means for their clusters, and what their realistic options are going forward.

First, Clear the Confusion: There Are Two NGINX Ingress Controllers

One of the most persistent sources of confusion in the Kubernetes networking space is that there are two completely different projects both called “NGINX Ingress Controller,” maintained by different organizations, with different architectures and different licensing.

ingress-nginx (kubernetes/ingress-nginx)

This is the community-maintained controller under the Kubernetes project umbrella, hosted at github.com/kubernetes/ingress-nginx. It uses the open-source NGINX as its data plane, configured via Lua scripting and dynamically generated nginx.conf files. This is the controller most teams end up with when they follow the official Kubernetes documentation or install from the Helm chart referenced in the ingress guide. It is free, open-source, and until recently was considered the default choice.

NGINX Ingress Controller (nginxinc/kubernetes-ingress)

This is the commercial and open-source controller maintained by F5/NGINX, hosted at github.com/nginxinc/kubernetes-ingress. It also supports NGINX Plus (the commercial version with enhanced features like active health checks, JWT authentication, and advanced load balancing). The architecture is different — it uses native NGINX APIs rather than the Lua-heavy approach — and it targets enterprise customers looking for support contracts and advanced capabilities.

These two controllers are not interchangeable. Configuration annotations differ, Helm chart values differ, and behavior under edge cases differs substantially. Understanding which one your cluster runs is the necessary starting point for any decision about migration.

# Check which NGINX IC you are actually running
kubectl get pods -n ingress-nginx -o jsonpath='{.items[*].spec.containers[*].image}'

# Community controller image looks like:
# registry.k8s.io/ingress-nginx/controller:v1.x.x

# F5/NGINX controller image looks like:
# nginx/nginx-ingress:x.x.x  or  private-registry.nginx.com/nginx-ic/nginx-plus-ingress:x.x.x

What Actually Happened: A Timeline of Disruption

The ingress-nginx CVEs (2024)

In March 2024, security researchers disclosed a set of critical vulnerabilities in ingress-nginx under the collective name IngressNightmare (CVE-2025-1097, CVE-2025-1098, CVE-2025-1974, CVE-2025-24514). The most severe of these, rated CVSS 9.8, allowed unauthenticated remote code execution against the ingress-nginx admission webhook. An attacker with network access to the admission controller could craft a malicious Ingress object to inject arbitrary NGINX configuration, ultimately achieving code execution in the controller pod — which in many clusters runs with elevated permissions and access to service account tokens across namespaces.

The vulnerabilities affected the vast majority of ingress-nginx deployments in the wild. Wiz Research, which discovered and disclosed the issues, estimated that approximately 43% of cloud environments were exposed. Patches were released in versions 1.11.5 and 1.12.1, but the incident forced uncomfortable questions about the controller’s security posture and the architecture decisions (particularly the admission webhook design) that made it possible.

Maintenance Concerns in ingress-nginx

Beyond the CVEs, the ingress-nginx project has faced ongoing concerns about maintainer bandwidth. The project is maintained by a small group of volunteers and relies heavily on community contributions. Issue response times slowed, pull requests aged, and the pace of feature development declined relative to alternatives. For a component as critical as the cluster ingress layer, this created legitimate concern about long-term sustainability without corporate backing or broader contributor growth.

F5/NGINX Deprecation Announcement

On the commercial side, F5/NGINX announced in early 2025 that the nginxinc/kubernetes-ingress controller — particularly its open-source tier — would undergo significant changes. F5 signaled a strategic shift toward NGINX Gateway Fabric, their implementation of the Kubernetes Gateway API specification. The message was clear: investment in the Ingress-based controller would be reduced, and customers were encouraged to plan migrations toward Gateway API-native solutions.

For teams running NGINX Plus-based ingress with support contracts, this was a significant business concern. The product they had licensed and standardized on was being steered toward end-of-life on the Ingress API, even if exact timelines remained somewhat ambiguous in the initial announcements.

Real Impact on Production Clusters

The practical consequences depend heavily on which controller you run and how your clusters are configured. Here is an honest assessment:

Immediate Security Risk

If you run ingress-nginx and have not patched to 1.11.5+ or 1.12.1+, your admission webhook is a critical attack surface. Patching is non-negotiable and should have happened already. The admission webhook can be disabled if you are not using it for validation (many teams are not), which significantly reduces the attack surface while you plan a longer-term migration.

# Check your current ingress-nginx version
kubectl get deployment ingress-nginx-controller -n ingress-nginx \
  -o jsonpath='{.spec.template.spec.containers[0].image}'

# Verify admission webhook is configured
kubectl get validatingwebhookconfigurations | grep ingress

# If you need to disable the webhook temporarily (reduces but does not eliminate risk):
kubectl delete validatingwebhookconfiguration ingress-nginx-admission

Operational Uncertainty

Even after patching, the underlying questions remain. Teams are now asking: should we invest in hardening and tuning ingress-nginx knowing it may not be the strategic direction? Should we migrate now when it is our choice, rather than later when it may be forced? For NGINX IC customers, they are evaluating whether their licensing costs justify continued investment in a product being steered toward deprecation.

Configuration Migration Complexity

The real cost of migration is in the annotation-heavy configurations that accumulate over time. Teams that have built complex routing logic using nginx.ingress.kubernetes.io/* annotations — custom headers, rate limiting, auth snippets, rewrite rules, canary traffic splitting — face significant rework when switching controllers. This is the primary reason many teams are reluctant to move despite clear signals that a transition is coming.

The Alternatives: An Honest Evaluation

There is no shortage of Ingress controller options. The question is which alternatives are mature enough for production workloads at scale, and what trade-offs each brings.

Traefik

Traefik Proxy (and its Kubernetes-native version via Traefik Hub) has emerged as the most popular alternative for teams leaving ingress-nginx. It supports the standard Kubernetes Ingress API for drop-in compatibility, its own IngressRoute CRDs for advanced features, and Kubernetes Gateway API. It is written in Go, has strong TLS automation via Let’s Encrypt, and has excellent observability with built-in metrics and a real-time dashboard.

Trade-offs: Traefik’s configuration model is different enough from NGINX that complex routing logic requires rethinking rather than translating. Performance under very high connection counts is generally good but NGINX has a longer track record in extreme-scale deployments. The commercial offering (Traefik Hub) adds API gateway capabilities but introduces vendor dependency.

Envoy Gateway

Envoy Gateway is now a CNCF project and implements the Kubernetes Gateway API natively using Envoy as its data plane. This is arguably the most strategically aligned option for teams that want to bet on the future of Kubernetes networking. Envoy is battle-tested (it powers Istio, Contour, and large-scale service meshes at companies like Lyft and Google), and the Gateway API implementation is comprehensive and actively developed.

Trade-offs: Envoy Gateway is relatively young as a standalone project. Teams unfamiliar with Envoy will face a steeper learning curve for debugging and custom configuration. The operational model differs significantly from NGINX-based controllers. However, for greenfield deployments or teams willing to invest in the transition, this is a strong forward-looking choice.

Cilium Gateway API

If your cluster already runs Cilium as the CNI, enabling Gateway API support is a natural evolution. Cilium’s Gateway API implementation leverages eBPF for high-performance packet processing, avoiding the overhead of userspace proxy hops entirely. It is deeply integrated with Cilium’s network policy model and observability stack (Hubble).

Trade-offs: This option is only relevant if you are already committed to Cilium as your CNI, or are willing to make that switch simultaneously. Migrating both the CNI and the ingress layer at the same time is a significant operational risk. For Cilium shops, however, this consolidates complexity and provides excellent performance and observability.

HAProxy Ingress

HAProxy Ingress Controller is maintained by the HAProxy Technologies team and has a strong reputation for raw performance and precise traffic control. It supports both Ingress and Gateway API and has a long track record in high-throughput production environments. For teams with existing HAProxy expertise, it provides a familiar mental model for load balancing configuration.

Trade-offs: Smaller community than Traefik or NGINX. Less ecosystem tooling and fewer tutorials. Best suited for teams that specifically want HAProxy’s capabilities (fine-grained connection management, advanced health checking, TCP/HTTP mode flexibility) rather than as a default choice.

Kong Ingress Controller

Kong bridges the gap between an Ingress controller and a full API gateway. It supports Ingress and Gateway API resources alongside its own Kong-native plugin system for authentication, rate limiting, transformation, and observability. For teams that need API gateway capabilities rather than pure L7 routing, Kong provides a unified platform.

Trade-offs: Kong adds operational complexity. Running Kong requires either a PostgreSQL database (DB-mode) or careful management of declarative configuration (DB-less mode). The plugin ecosystem is powerful but introduces additional configuration surface. For teams that just need ingress routing, Kong may be more than necessary. For teams building API platforms, it is worth the overhead.

Istio Gateway

Istio’s ingress gateway (now aligned with Gateway API via its Kubernetes Gateway API integration) provides entry-point traffic management as part of a full service mesh. If your organization is planning or running Istio for east-west traffic, using Istio’s gateway for north-south traffic creates a unified data plane (Envoy) and consistent observability across all service communication.

Trade-offs: Istio is a serious operational commitment. The control plane overhead, the learning curve, and the impact on pod scheduling and sidecar management are significant. Choosing Istio purely for ingress replacement is like buying a race car because you needed a vehicle with good brakes. Consider this path only if service mesh capabilities are on your roadmap.

NGINX Gateway Fabric (F5’s Gateway API implementation)

F5/NGINX is building NGINX Gateway Fabric as their strategic forward path — an NGINX-based implementation of the Kubernetes Gateway API. For teams heavily invested in NGINX and wanting to stay in that ecosystem while moving to Gateway API, this provides a migration path within familiar territory. It is still maturing but represents where F5 is putting its development resources.

Comparison Matrix

Controller	Ingress API	Gateway API	Maturity	Best For	Complexity
ingress-nginx	Yes	Partial	High	Existing deployments, familiar config	Low
Traefik	Yes	Yes	High	General purpose, rapid migration	Low-Medium
Envoy Gateway	No	Yes (native)	Medium	Greenfield, future-aligned	Medium
Cilium Gateway	Yes	Yes	Medium	Cilium CNI clusters	Low (if Cilium)
HAProxy Ingress	Yes	Yes	High	High-throughput, HAProxy expertise	Medium
Kong	Yes	Yes	High	API gateway requirements	High
Istio Gateway	Via Gateway API	Yes	High	Service mesh adopters	Very High
NGINX Gateway Fabric	No	Yes (native)	Low-Medium	NGINX shops moving to Gateway API	Medium

Gateway API: The Strategic Direction You Cannot Ignore

The Kubernetes Gateway API is not simply “Ingress v2.” It is a fundamentally richer traffic management model designed to address the limitations that drove teams to annotation-based workarounds for the past several years. Understanding it is essential regardless of which controller you choose, because the ecosystem is clearly converging on it.

The core resource hierarchy consists of GatewayClass (defines a type of gateway, created by infrastructure providers), Gateway (a specific instance of a listener configuration, typically managed by platform teams), and HTTPRoute, TCPRoute, GRPCRoute, and other route resources (managed by application teams). This separation of concerns maps cleanly onto organizational roles — infrastructure teams control the gateway, application teams control their routing rules.

# Example Gateway API resources replacing an ingress-nginx Ingress
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: production-gateway
  namespace: infra
spec:
  gatewayClassName: nginx  # or envoy, traefik, cilium, etc.
  listeners:
  - name: https
    protocol: HTTPS
    port: 443
    tls:
      mode: Terminate
      certificateRefs:
      - name: wildcard-tls
    allowedRoutes:
      namespaces:
        from: All
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: my-app
  namespace: my-app-namespace
spec:
  parentRefs:
  - name: production-gateway
    namespace: infra
  hostnames:
  - "app.example.com"
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /api
    backendRefs:
    - name: my-api-service
      port: 8080
  - matches:
    - path:
        type: PathPrefix
        value: /
    backendRefs:
    - name: my-frontend-service
      port: 3000

Gateway API reached v1.0 (GA for HTTPRoute and Gateway) in October 2023, and v1.1 followed in 2024 with GRPCRoute graduation and expanded features. The project has broad support across controllers (Traefik, Envoy Gateway, Cilium, NGINX Gateway Fabric, Kong, Istio, and others all implement it). The Ingress API is not being removed from Kubernetes, but new feature development is effectively frozen — Gateway API is where capabilities like traffic weighting, header manipulation, request mirroring, and backend protocol configuration are being built.

Decision Framework: Stay, Migrate, or Evaluate?

There is no universal right answer. The following framework helps teams make a context-appropriate decision rather than following hype or panic.

Stay on ingress-nginx if:

You have patched to 1.11.5+ or 1.12.1+ and have disabled or hardened the admission webhook
Your cluster is stable, heavily annotation-dependent, and migration cost outweighs risk
You have internal NGINX expertise and can take ownership of monitoring the project’s maintenance health
Your organization has a short-term horizon (decommissioning or major platform change within 12-18 months)

Migrate now if:

You are running the F5/NGINX IC with a support contract that is being deprecated
Your cluster has moderate annotation complexity and you have engineering cycles available
You are planning a major Kubernetes version upgrade or cluster rebuild — do it at the same time
Your security team has flagged the CVE history as unacceptable for your risk profile
You are building a new cluster or platform team and want to standardize on Gateway API from the start

Evaluate before committing if:

Your workloads have complex traffic requirements (WebSockets, gRPC, canary deployments, header-based routing) that differ significantly across controllers
You are considering Gateway API but the specific controllers in your environment have not graduated their Gateway API implementations yet
You have multi-cluster or multi-tenant requirements that change the analysis
You need to assess total cost including commercial support, tooling changes, and team retraining

Migration Checklist

For teams that have decided to migrate, the following sequence reduces risk and ensures nothing critical is missed:

Phase 1: Inventory and Assessment

Enumerate all Ingress resources across all namespaces and document their annotations
Identify annotations with no direct equivalent in your target controller
Map TLS certificate sources (cert-manager, Secrets, external providers) and confirm compatibility
Document any custom NGINX configuration snippets (nginx.ingress.kubernetes.io/configuration-snippet, server-snippet) — these are high-risk items that require manual translation
Inventory any rate limiting, authentication, or WAF configurations layered on the controller

# Enumerate all ingress resources and their annotations across the cluster
kubectl get ingress -A -o json | jq -r '
  .items[] |
  {
    namespace: .metadata.namespace,
    name: .metadata.name,
    annotations: (.metadata.annotations // {} | keys)
  }
'

Phase 2: Target Controller Validation

Deploy target controller to a non-production cluster with identical Ingress/HTTPRoute resources
Validate TLS termination, redirect behavior, and timeout configurations
Run load tests to confirm performance characteristics match expectations
Validate observability — metrics, logs, and traces integrate with your existing stack
Test failure scenarios: backend unavailability, certificate expiry, controller pod restart

Phase 3: Staged Production Migration

Deploy new controller to production alongside existing controller (different IngressClass)
Migrate low-risk, low-traffic Ingress resources first by updating their ingressClassName
Use DNS-based canary switching (weighted routing at the DNS level) rather than switching entire IngressClass at once
Monitor error rates and latency for 24-48 hours after each batch migration
Migrate critical services during low-traffic windows with rollback plan documented
Decommission old controller only after all resources are migrated and validated

# Migrate individual Ingress to new controller by changing ingressClassName
kubectl patch ingress my-app -n my-namespace \
  --type='json' \
  -p='[{"op": "replace", "path": "/spec/ingressClassName", "value": "traefik"}]'

# Or if migrating to Gateway API, create equivalent HTTPRoute first,
# test it, then remove the old Ingress resource
kubectl apply -f my-app-httproute.yaml
# Validate, then:
kubectl delete ingress my-app -n my-namespace

Phase 4: Gateway API Adoption (Optional but Recommended)

Install Gateway API CRDs if not already present (kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.1.0/standard-install.yaml)
Define GatewayClass resources matching your chosen controller
Migrate Ingress resources to HTTPRoute progressively, starting with simpler configurations
Update CI/CD pipelines and Helm charts to generate HTTPRoute instead of Ingress resources for new services
Establish a policy: new services use Gateway API; legacy services migrate on their next significant update

Recommendation

For most platform engineering teams reading this in 2025, the pragmatic recommendation is as follows:

Short term (next 30 days): Patch ingress-nginx to the latest release if you are still on it. Assess and harden or disable the admission webhook. This is not optional.

Medium term (3-6 months): Evaluate Traefik or Envoy Gateway against your specific workload requirements. Traefik is the lower-friction migration for teams coming from ingress-nginx on the Ingress API. Envoy Gateway is the stronger strategic choice if you are willing to commit to Gateway API fully. Either way, run a parallel deployment in a non-production environment and measure the delta in operational overhead.

Long term (6-18 months): Plan migration to Gateway API resources regardless of which data plane you choose. The Ingress API will not disappear overnight, but feature parity with Gateway API capabilities will never arrive. Teams that standardize on Gateway API now build the institutional knowledge that will be valuable as the ecosystem continues to evolve.

If you are running F5/NGINX IC under a support contract: engage your F5 account team now to get a clear timeline on the deprecation path and evaluate NGINX Gateway Fabric as a within-ecosystem migration before looking at alternatives. The question is not whether to migrate but when and to what.

Avoid the temptation to treat this as a purely technical decision. The switch of an ingress controller touches CI/CD pipelines, monitoring dashboards, runbooks, on-call playbooks, and engineering team knowledge. Factor in the total transition cost, not just the YAML changes.

Frequently Asked Questions

Is the Kubernetes Ingress API being deprecated or removed?

No. The networking.k8s.io/v1 Ingress API is not deprecated and there are no current plans to remove it from Kubernetes. It will continue to work. What is happening is that the Kubernetes SIG Network has frozen new feature development on the Ingress API and is directing all new traffic management capabilities to Gateway API. In practical terms, if you need a capability that Ingress does not currently provide, you will not get it through Ingress. You will need Gateway API. Existing Ingress resources will continue to function for the foreseeable future.

Can I run two Ingress controllers simultaneously during migration?

Yes, and this is the recommended approach for production migrations. Kubernetes supports multiple IngressClass resources in a cluster, each backed by a different controller. Ingress resources select their controller via the spec.ingressClassName field (or the legacy kubernetes.io/ingress.class annotation). You can run ingress-nginx and Traefik side-by-side, migrating individual Ingress resources by updating their ingressClassName. Once migration is complete and validated, decommission the old controller. Just ensure both controllers are not both marked as the default IngressClass simultaneously, as this causes conflicts.

What happens to cert-manager if I switch controllers?

cert-manager is independent of your Ingress controller and will continue to work regardless of which controller you use. The HTTP-01 challenge solver in cert-manager creates temporary Ingress resources to complete ACME challenges — these will use whichever IngressClass you configure in your Issuer or ClusterIssuer. If you migrate to Gateway API, cert-manager has added Gateway API support (HTTPRoute-based HTTP-01 challenges) starting from version 1.14. DNS-01 challenges are entirely unaffected by controller choice. Update your Issuer configuration to reference the new IngressClass during migration.

How severe is the performance difference between ingress-nginx and alternatives?

For the vast majority of production workloads, the performance difference between mature controllers (ingress-nginx, Traefik, HAProxy, Envoy) is not the deciding factor. All of them can handle tens of thousands of requests per second on reasonable hardware, and the bottleneck is typically the backend services, not the ingress layer. The notable exception is Cilium with eBPF-based forwarding, which eliminates userspace proxy overhead entirely and can show measurable latency reduction at high percentiles for latency-sensitive workloads. If you are running at a scale where ingress controller throughput is actually the constraint, you already have the engineering resources to benchmark your specific workload profile against candidate controllers before committing.

Should we just move everything to a cloud provider’s managed load balancer and skip the in-cluster controller?

This is a legitimate option for teams on managed Kubernetes (EKS, GKE, AKS). Cloud-native load balancers (AWS ALB via AWS Load Balancer Controller, GKE Gateway, Azure Application Gateway Ingress Controller) eliminate the operational burden of managing an in-cluster controller and integrate deeply with cloud IAM, WAF, and observability services. The trade-offs are cost (cloud LBs charge per rule and per hour), vendor lock-in, and reduced portability. For purely cloud-native workloads with no multi-cloud or on-premises requirements, cloud-managed load balancers are worth serious consideration and sidestep the ingress-nginx problem entirely. For hybrid or multi-cluster environments, in-cluster controllers maintain an advantage in consistency and portability.

Prometheus Scalability: High Cardinality and How to Fix It

2026-05-022026-04-21 by

Prometheus has become the de facto standard for metrics collection in cloud-native environments. Its pull-based model, powerful query language, and deep Kubernetes integration make it an obvious choice for platform teams. But as organizations scale — more services, more replicas, more labels — Prometheus starts showing cracks. Queries slow down, memory usage balloons, and what was once a reliable monitoring backbone becomes an operational liability. This article examines exactly why that happens and what you can do about it, from quick tactical fixes to full architectural overhauls.

The Cardinality Problem: Why It Kills Prometheus

Cardinality is the single most important concept to understand when troubleshooting Prometheus scalability. In the context of time series databases, cardinality refers to the total number of unique label combinations that exist across all your metrics. Every unique combination creates a distinct time series, and Prometheus must store, index, and query each of them independently.

Consider a simple HTTP request counter: http_requests_total. If you label it with method (GET, POST, PUT, DELETE), status_code (200, 201, 400, 404, 500, 503), and endpoint (50 distinct API paths), you already have 4 × 6 × 50 = 1,200 time series from a single metric. Now add a customer_id label with 10,000 distinct values. You have just created 12 million time series from one counter.

This is the cardinality explosion pattern, and it is the most common cause of Prometheus degradation in production. The problem is compounded by labels that have unbounded or high-entropy values:

User IDs or session tokens embedded in labels
Request IDs or trace IDs (effectively infinite cardinality)
Pod names without proper aggregation, especially in autoscaling environments
Free-form error messages or SQL query strings
IP addresses, particularly in environments with high churn

The relationship between cardinality and resource consumption is not linear — it is roughly proportional but carries significant overhead per series in memory indexing structures. Prometheus stores its head block (the most recent data) entirely in memory. Each time series in the head block requires approximately 3–4 KB of RAM for the series itself plus index entries. A Prometheus instance with 1 million active time series will typically consume 4–6 GB of RAM just for the head block, before accounting for query processing overhead.

Memory Explosion Patterns and Real Symptoms

Memory issues in Prometheus rarely announce themselves cleanly. Instead, they manifest through a cascade of symptoms that are easy to misdiagnose. Understanding the failure modes helps you identify the root cause faster and apply the right remedy.

The Head Block Growth Pattern

Prometheus keeps a two-hour window of data in memory as the head block before compacting it to disk. If your series count grows continuously — which happens when pod churn creates new series faster than old ones expire — the head block never shrinks. You can monitor this directly with prometheus_tsdb_head_series and prometheus_tsdb_head_chunks. A healthy instance shows this number plateauing. A cardinality problem shows it growing monotonically until OOM.

Query Timeout Cascades

As series count grows, even well-written PromQL queries that worked fine at 100k series become unbearably slow at 1M. Grafana dashboards start timing out, alert evaluation lags behind schedule, and Alertmanager begins receiving delayed or duplicated firing alerts. The prometheus_rule_evaluation_duration_seconds metric is a reliable early warning — when p99 evaluation time for your recording rules exceeds your evaluation interval, you have a problem.

Scrape Failures Under Memory Pressure

When Prometheus is under heavy memory pressure, its Go garbage collector starts spending more time collecting, which introduces latency into the scrape loop. Scrapes begin timing out, causing gaps in your data. This creates a deceptive situation where you have gaps in metrics precisely when your system is under stress — exactly when you need monitoring most. Watch up metric drops and prometheus_target_scrapes_exceeded_sample_limit_total for these patterns.

Compaction Pressure

High cardinality also stresses the TSDB compaction process. Prometheus compacts head block data into persistent blocks every two hours. With millions of series, compaction can take tens of seconds to minutes, during which write performance degrades. prometheus_tsdb_compaction_duration_seconds rising above 30 seconds is a warning sign. Compaction failures leave orphaned blocks on disk, gradually consuming storage and potentially corrupting the TSDB if left unaddressed.

Short-Term Fixes: Tactical Remediation

When you are dealing with a Prometheus instance under active stress, you need immediate relief before you can implement architectural changes. These techniques can be applied quickly and provide meaningful headroom while longer-term solutions are planned.

Recording Rules: Pre-Computing Aggregations

Recording rules are the most underutilized tool in the Prometheus toolbox. They allow you to pre-compute expensive PromQL expressions and store the results as new time series. The key benefit for scalability is that you can aggregate away high-cardinality dimensions, dramatically reducing the number of series that dashboards and alerts need to query at runtime.

Consider an example where you have per-pod HTTP request rates with labels for pod, namespace, service, method, and status_code. Your dashboards mostly need service-level aggregations, not per-pod breakdowns. A recording rule can produce that aggregation once per evaluation interval:

groups:
  - name: http_aggregations
    interval: 30s
    rules:
      - record: job:http_requests_total:rate5m
        expr: |
          sum by (job, namespace, method, status_code) (
            rate(http_requests_total[5m])
          )

      - record: job:http_request_duration_seconds:p99_5m
        expr: |
          histogram_quantile(0.99,
            sum by (job, namespace, le) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          )

      - record: namespace:http_requests_total:rate5m
        expr: |
          sum by (namespace, status_code) (
            rate(http_requests_total[5m])
          )

Notice that the pod label is dropped in all three rules. If you had 500 pods, you have just reduced the cardinality of these series by a factor of 500. Dashboards querying job:http_requests_total:rate5m instead of computing rate(http_requests_total[5m]) on the fly will return results orders of magnitude faster.

The naming convention level:metric:operations is the Prometheus community standard. Following it consistently makes recording rules self-documenting and helps teams understand the aggregation level at a glance.

Metric Dropping via Relabeling

Relabeling gives you surgical control over what metrics Prometheus actually ingests. There are two stages where relabeling applies: relabel_configs (applied before scraping, based on target metadata) and metric_relabel_configs (applied after scraping, based on scraped metric names and labels). For cardinality control, metric_relabel_configs is your primary tool.

Dropping entire metric families that you do not use is the most impactful change you can make. Many exporters emit dozens of metrics that are irrelevant for most use cases:

scrape_configs:
  - job_name: kubernetes-pods
    metric_relabel_configs:
      # Drop metrics we never query
      - source_labels: [__name__]
        regex: 'go_gc_.*|go_memstats_.*|process_.*'
        action: drop

      # Drop high-cardinality label values while keeping the metric
      - source_labels: [__name__, pod]
        regex: 'http_requests_total;.*'
        target_label: pod
        replacement: ''

      # Drop entire time series based on label combinations
      - source_labels: [__name__, le]
        regex: 'http_request_duration_seconds_bucket;(\+Inf|100|250|500)'
        action: keep

      # Replace high-cardinality endpoint paths with normalized versions
      - source_labels: [endpoint]
        regex: '/api/v1/users/[0-9]+'
        target_label: endpoint
        replacement: '/api/v1/users/:id'

Be careful with metric_relabel_configs — they are applied per scraped sample, so computationally expensive regex patterns across high-frequency scrapes can add CPU overhead. Test regex patterns and prefer anchored, non-backtracking expressions.

Cardinality Limits as a Safety Net

Prometheus 2.x introduced per-scrape sample limits as a defensive mechanism. These do not solve cardinality problems but prevent a single misbehaving exporter from taking down your entire Prometheus instance:

global:
  # Global limit across all scrapes
  sample_limit: 0  # 0 = no limit

scrape_configs:
  - job_name: application-pods
    # Reject scrapes that return more than 50k samples
    sample_limit: 50000

    # Limit unique label sets per scrape
    label_limit: 64

    # Limit label name and value lengths
    label_name_length_limit: 256
    label_value_length_limit: 1024

    kubernetes_sd_configs:
      - role: pod

When a scrape exceeds sample_limit, Prometheus rejects the entire scrape and marks the target as having failed. This is a hard circuit breaker, not a graceful degradation — the target’s up metric goes to 0. Set limits conservatively above your expected maximum to avoid false positives, and alert on prometheus_target_scrapes_exceeded_sample_limit_total > 0.

Architectural Solutions: Federation and Remote Write

Once you have exhausted tactical optimizations or when your scale genuinely exceeds what a single Prometheus instance can handle, architectural changes become necessary. Prometheus offers two built-in mechanisms for scaling horizontally: federation and remote_write.

Federation: Hierarchical Scraping

Prometheus federation allows one Prometheus instance to scrape aggregated metrics from other Prometheus instances via the /federate endpoint. In a typical setup, leaf-level Prometheus instances collect raw metrics from targets, while a global Prometheus instance federates pre-aggregated recording rule results from the leaves.

# Global Prometheus configuration federating from regional instances
scrape_configs:
  - job_name: federate-eu-west
    scrape_interval: 15s
    honor_labels: true
    metrics_path: /federate
    params:
      match[]:
        # Only federate pre-aggregated recording rule metrics
        - '{__name__=~"job:.*"}'
        - '{__name__=~"namespace:.*"}'
        - '{__name__=~"cluster:.*"}'
        # Federate key infrastructure alerts
        - 'up{job="kubernetes-apiservers"}'
    static_configs:
      - targets:
          - prometheus-eu-west.monitoring.svc:9090
          - prometheus-us-east.monitoring.svc:9090
          - prometheus-ap-south.monitoring.svc:9090

Federation works well for multi-region global dashboards and cross-cluster alerting on aggregated signals. Its limitations are significant, though: the /federate endpoint is a point-in-time snapshot, so you cannot run range queries against federated data effectively. It also creates a single point of failure at the global layer and does not provide true long-term storage. For those requirements, remote_write is the better path.

Remote Write: Streaming to Durable Storage

Remote write allows Prometheus to stream all ingested samples to an external storage backend in real time. The external backend handles long-term retention, multi-tenancy, and global query federation. Prometheus itself becomes a stateless collection agent that maintains only a short local retention window for resilience against network outages.

remote_write:
  - url: https://thanos-receive.monitoring.svc:19291/api/v1/receive
    # Authentication for the remote endpoint
    basic_auth:
      username: prometheus
      password_file: /etc/prometheus/secrets/remote-write-password

    # Tune the write queue for throughput vs. latency
    queue_config:
      # Number of shards (parallel write connections)
      max_shards: 200
      min_shards: 1
      # Samples to batch before flushing
      max_samples_per_send: 500
      # Time to wait before flushing an incomplete batch
      batch_send_deadline: 5s
      # In-memory buffer capacity per shard
      capacity: 2500
      # How long to retry failed writes
      min_backoff: 30ms
      max_backoff: 5s

    # Metadata configuration
    metadata_config:
      send: true
      send_interval: 1m

    # Filter what gets remote-written (reduce egress)
    write_relabel_configs:
      - source_labels: [__name__]
        regex: 'go_gc_.*|go_memstats_.*'
        action: drop

The queue_config tuning is critical and frequently misunderstood. Each shard maintains its own connection to the remote endpoint and its own in-memory queue. Increasing max_shards increases parallelism and throughput but also increases memory consumption and load on the remote endpoint. The right values depend heavily on your sample ingestion rate and network latency to the remote endpoint. Monitor prometheus_remote_storage_queue_highest_sent_timestamp_seconds versus prometheus_remote_storage_highest_timestamp_in_seconds — the lag between them tells you how far behind your remote write queue is.

Long-Term Solutions: Thanos vs Grafana Mimir vs VictoriaMetrics

For production systems that need long-term storage, global query capability, high availability, and genuine horizontal scalability, purpose-built solutions are the right answer. Three projects dominate this space: Thanos, Grafana Mimir, and VictoriaMetrics. They share similar goals but differ significantly in architecture, operational complexity, and trade-offs.

Criterion	Thanos	Grafana Mimir	VictoriaMetrics
Architecture	Sidecar + object store; modular components	Fully distributed; Cortex-derived microservices	Single binary or cluster mode
Storage backend	Any S3-compatible object store	Any S3-compatible object store	Own TSDB format on local or object store
PromQL compatibility	Full PromQL; own query engine	Full PromQL; Mimir-specific extensions	MetricsQL (PromQL superset)
Operational complexity	Medium — multiple components, each simple	High — many microservices with complex config	Low — minimal components, simple config
Ingest scalability	Scales via Thanos Receive fan-out	Horizontally scalable distributors + ingesters	Excellent; handles millions of samples/sec per node
Query performance	Good; Store Gateway caches object store data	Good; query sharding and caching built in	Excellent; highly optimized query engine
Multi-tenancy	Limited; tenant isolation via external labels	Native; per-tenant limits and isolation	Enterprise only; basic in cluster mode
Deduplication	Built-in; replica dedup at query time	Built-in; ingest-time and query-time dedup	Built-in; dedup with downsampling
Downsampling	Yes; Thanos Compactor handles it	Yes; configurable per tenant	Yes; automatic with vmbackupmanager
License	Apache 2.0 (fully open source)	AGPL-3.0 (open source) + enterprise tier	Apache 2.0 (community); proprietary enterprise
Best fit	Teams already running Prometheus wanting minimal disruption	Large orgs needing multi-tenant SaaS-grade monitoring	Teams prioritizing simplicity and raw performance

Thanos: The Incremental Path

Thanos integrates with existing Prometheus deployments through a sidecar process that runs alongside each Prometheus pod. The sidecar uploads completed TSDB blocks to object storage (S3, GCS, Azure Blob) and exposes a gRPC Store API that Thanos Query uses to federate queries across all Prometheus instances plus historical data in the object store. This makes Thanos the lowest-friction path for teams with existing Prometheus infrastructure.

Thanos Receive is an alternative ingest path that accepts remote_write directly, which is useful when you want to decouple Prometheus instances from the query layer or implement active-active HA without relying on Prometheus replication. Thanos Compactor handles block compaction and downsampling on the object store, creating 5-minute and 1-hour resolution downsamples automatically for efficient long-range queries.

Grafana Mimir: Enterprise-Grade Multi-Tenancy

Mimir is a fork of Cortex, rewritten by Grafana Labs to address operational complexity issues in Cortex’s architecture. It follows the same microservices pattern — Distributor, Ingester, Querier, Query Frontend, Store Gateway, Compactor, Ruler — but with significantly improved defaults and a monolithic deployment mode that simplifies small-scale deployments. Mimir’s headline feature is native multi-tenancy with per-tenant cardinality limits, query limits, and ingestion rate limits enforced at the distributor layer.

Mimir is the right choice when you need to run monitoring as an internal platform service for multiple teams or business units, each with independent resource quotas and data isolation. The operational overhead is substantial, but for large organizations it is justified by the isolation and governance capabilities.

VictoriaMetrics: Simplicity and Raw Performance

VictoriaMetrics takes a fundamentally different approach: rather than building on top of Prometheus’s TSDB format, it implements its own highly optimized storage engine. The result is dramatically better compression (often 5–10x better than Prometheus TSDB) and query performance that consistently outperforms Thanos and Mimir in benchmarks, particularly for high-cardinality workloads and large time ranges. The single-node binary handles workloads that would require a full Thanos cluster, and the cluster version adds horizontal scalability with fewer moving parts than Thanos or Mimir.

VictoriaMetrics also supports MetricsQL, a superset of PromQL that adds useful functions like outlierIQR(), limitOffset(), and improved histogram handling. Grafana datasource compatibility is maintained through a PromQL-compatible API, so existing dashboards work without modification.

Practical Guide: Choosing Your Scaling Approach

The right solution depends on your current scale, team capacity, and trajectory. This is not a one-size-fits-all decision. Here is a pragmatic framework for matching the solution to the problem.

Stage 1: Under 1 Million Active Series

A single Prometheus instance with proper tuning should handle this comfortably. Focus on recording rules to eliminate expensive dashboard queries, implement metric_relabel_configs to drop unused metrics, and set sample_limit guards. Increase Prometheus memory limits to give it adequate headroom (at minimum 8 GB, ideally 16 GB for instances approaching 1M series). Set --storage.tsdb.retention.time to the minimum that satisfies your compliance and debugging needs — 15 days is often enough if you have remote_write configured to a longer-term store.

Stage 2: 1–5 Million Active Series

At this scale, a single instance is viable but requires vertical scaling and aggressive optimization. Consider sharding your Prometheus deployment by functional area: one instance for infrastructure metrics, one for application metrics, one for business metrics. This is horizontal scaling via functional decomposition, not true distributed architecture. Add remote_write to object storage for long-term retention. If you are running Kubernetes, the Prometheus Operator with multiple Prometheus custom resources per namespace group is a clean implementation of this pattern.

This is also the stage where VictoriaMetrics single-node becomes compelling — it can handle this range comfortably with far less RAM than Prometheus and simpler operations than a full distributed system.

Stage 3: 5 Million+ Active Series or Global Requirements

At this scale, a distributed architecture is necessary. Your choice among Thanos, Mimir, and VictoriaMetrics Cluster depends primarily on:

Existing Prometheus investment + incremental migration: Thanos Sidecar is the path of least resistance. Your existing Prometheus instances keep working; you add sidecars and deploy Thanos query components.
Multi-tenant platform with governance requirements: Grafana Mimir, accepting the operational complexity in exchange for native tenant isolation and limits.
Maximum performance with minimal operational burden: VictoriaMetrics Cluster, replacing Prometheus entirely or alongside it via remote_write, with dramatically simpler operations than Thanos or Mimir.
Multi-region, cross-cloud global monitoring: Thanos or Mimir, both have mature multi-region architectures; VictoriaMetrics Enterprise has similar capabilities but is not open source.

Complementary Configuration: Thanos Sidecar Example

For teams adopting Thanos, the sidecar configuration alongside a Prometheus deployment looks like this in a Kubernetes environment:

# Thanos sidecar configuration (as part of Prometheus pod spec)
containers:
  - name: prometheus
    image: prom/prometheus:v2.48.0
    args:
      - --config.file=/etc/prometheus/prometheus.yml
      - --storage.tsdb.path=/prometheus
      # Keep 2h locally; Thanos handles long-term
      - --storage.tsdb.retention.time=2h
      # Thanos requires min-block-duration = max-block-duration for sidecar
      - --storage.tsdb.min-block-duration=2h
      - --storage.tsdb.max-block-duration=2h
      - --web.enable-lifecycle

  - name: thanos-sidecar
    image: quay.io/thanos/thanos:v0.32.0
    args:
      - sidecar
      - --tsdb.path=/prometheus
      - --prometheus.url=http://localhost:9090
      - --grpc-address=0.0.0.0:10901
      - --http-address=0.0.0.0:10902
      # Object store configuration
      - --objstore.config-file=/etc/thanos/objstore.yml
    volumeMounts:
      - name: prometheus-data
        mountPath: /prometheus
      - name: thanos-objstore-config
        mountPath: /etc/thanos

---
# Object store configuration (s3-compatible)
# /etc/thanos/objstore.yml
type: S3
config:
  bucket: my-thanos-metrics
  endpoint: s3.eu-west-1.amazonaws.com
  region: eu-west-1
  # Use IAM role or provide credentials via environment
  access_key: ""
  secret_key: ""

VictoriaMetrics as Remote Write Target

If you choose VictoriaMetrics as your remote storage backend, the integration with existing Prometheus instances is straightforward. VictoriaMetrics exposes a remote_write compatible endpoint at /api/v1/write:

# prometheus.yml — remote write to VictoriaMetrics
remote_write:
  - url: http://victoriametrics:8428/api/v1/write
    queue_config:
      max_samples_per_send: 10000
      capacity: 20000
      max_shards: 30

# VictoriaMetrics single-node startup (Docker Compose example)
services:
  victoriametrics:
    image: victoriametrics/victoria-metrics:v1.95.1
    command:
      - -storageDataPath=/victoria-metrics-data
      # Retain 1 year of data
      - -retentionPeriod=12
      # Enable deduplication (for HA Prometheus pairs)
      - -dedup.minScrapeInterval=15s
      # Memory limit
      - -memory.allowedPercent=60
    ports:
      - "8428:8428"
    volumes:
      - vm-data:/victoria-metrics-data

volumes:
  vm-data:

VictoriaMetrics also exposes a Prometheus-compatible query API at /api/v1/query and /api/v1/query_range, so Grafana datasources pointing at it need only a URL change — no plugin installation required for basic use. For MetricsQL-specific functions, use the VictoriaMetrics datasource plugin available in Grafana’s plugin catalog.

Observing Your Prometheus Health

Before implementing any of these solutions, establish a baseline understanding of your Prometheus instance’s current health. The following PromQL expressions give you immediate visibility into the key indicators:

# Total active time series in the head block
prometheus_tsdb_head_series

# Series created vs. removed (churn indicator)
rate(prometheus_tsdb_head_series_created_total[5m])
rate(prometheus_tsdb_head_series_removed_total[5m])

# Memory usage of the head block chunks
prometheus_tsdb_head_chunks_storage_size_bytes

# Remote write lag (seconds behind)
(
  prometheus_remote_storage_highest_timestamp_in_seconds
  - prometheus_remote_storage_queue_highest_sent_timestamp_seconds
)

# Top cardinality contributors (requires Prometheus 2.14+)
# Run this in Prometheus /api/v1/query:
topk(20,
  count by (__name__) ({__name__!=""})
)

# Alert evaluation lag
rate(prometheus_rule_evaluation_duration_seconds_sum[5m])
/ rate(prometheus_rule_evaluation_duration_seconds_count[5m])

Prometheus also exposes a /api/v1/status/tsdb endpoint that returns cardinality statistics including the top 10 metrics by series count and the top 10 label names by cardinality. This is invaluable for identifying which specific metrics or labels are causing problems and should be your first stop when investigating a new cardinality issue.

Frequently Asked Questions

How do I identify which metrics are causing my cardinality explosion?

Start with the /api/v1/status/tsdb endpoint on your Prometheus instance. It returns a JSON response with seriesCountByMetricName, seriesCountByLabelValuePair, and seriesCountByLabelName arrays, each showing the top contributors to your total series count. This points you directly at the offending metrics and labels without any external tooling. Complement this with topk(20, count by (__name__)({__name__!=""})) in PromQL, which gives you the same information in a queryable format you can alert on. Once you know the metric name, query count by (label1, label2) (your_metric_name) replacing label pairs to identify which specific label dimensions are driving the high count.

Can I run Prometheus in HA without external dependencies?

Yes, but with important caveats. The standard HA pattern for standalone Prometheus is to run two identical Prometheus instances scraping the same targets. Both instances collect data independently, and Alertmanager deduplicates alerts from both using its mesh clustering (run multiple Alertmanager instances in a cluster and point both Prometheus instances at all of them). This provides alerting HA — alerts fire even if one Prometheus instance is down. It does not provide query HA in the traditional sense, because each instance has its own independent data and queries against a failed instance simply fail. Dashboards pointing at a specific instance will show gaps during that instance’s downtime. For true query HA with failover and deduplication, you need Thanos Query (which can deduplicate replica series at query time using the replica external label) or a similar solution. Running Prometheus without any external dependencies means accepting these query HA limitations.

What is a safe maximum cardinality for a single Prometheus instance?

There is no universal number — it depends heavily on your scrape interval, available RAM, and query patterns. A practical guideline: allocate 3–4 GB of RAM per million active series for the head block alone, then add 50% headroom for query processing. A Prometheus instance with 16 GB of RAM can comfortably handle 2–3 million active series under typical workloads. Beyond 5 million series, even well-resourced single instances start showing query performance degradation that impacts alert evaluation reliability. The more meaningful limit to enforce operationally is series churn rate: an instance creating more than 100,000 new series per minute will struggle regardless of total series count, because the head block indexing operations become a bottleneck. Monitor rate(prometheus_tsdb_head_series_created_total[5m]) and treat sustained values above 50,000/minute as a warning condition.

Should I use Thanos Sidecar or Thanos Receive for ingestion?

The choice comes down to whether you want to keep Prometheus as the authoritative ingest layer or move toward a push-based architecture. Thanos Sidecar is the simpler, lower-risk option: Prometheus continues operating normally, the sidecar uploads completed blocks to object storage in the background, and you gain long-term storage and global query capability with minimal disruption. The drawback is that Prometheus must have local storage for at least 2 hours (one block duration), and the sidecar requires that min-block-duration equals max-block-duration, which prevents Prometheus from doing its own compaction. Thanos Receive accepts remote_write from any Prometheus instance, which enables active-active HA setups where multiple Prometheus replicas write to a Receive hashring simultaneously, and Receive handles deduplication. This is more complex to operate but provides better ingest-side redundancy. For most teams starting with Thanos, Sidecar is the right first step. Receive makes sense when you are building a centralized monitoring platform that accepts writes from many Prometheus instances across different clusters or environments.

Is it worth migrating from Thanos to VictoriaMetrics or Mimir once you are already running Thanos?

Migration from Thanos to an alternative should be driven by specific pain points, not by benchmark numbers alone. If your team is spending significant time operating Thanos (debugging query Store Gateway cache issues, managing compactor conflicts, handling block upload failures), and your primary need is simplicity and query performance rather than multi-tenancy, VictoriaMetrics is worth evaluating seriously. The migration path is smooth: run VictoriaMetrics alongside Thanos temporarily, migrate remote_write targets to VictoriaMetrics, and decommission Thanos once you are satisfied. Historical data in your object store can be imported using VictoriaMetrics’s vmctl tool. If your pain point is multi-tenancy and governance — multiple teams with independent data isolation, per-tenant rate limits, chargeback requirements — Mimir is the right destination and the operational complexity is justified. The one scenario where staying with Thanos is usually the right call is when your organization has invested heavily in Thanos tooling, has stable operations, and does not have specific unmet needs. Migration carries real costs in engineering time and operational risk; make sure the benefits are concrete and quantified before committing.

Gateway API Provider Support in 2026: A Critical Evaluation

2026-05-022026-02-23 by Alexandre Vazquez

The Kubernetes Gateway API is no longer a future concept—it’s the present standard for traffic management. With the deprecation of Ingress NGINX’s stable APIs signaling a definitive shift, platform teams and architects are now faced with a critical decision: which Gateway API provider to adopt. The official implementations page lists numerous options, but the real-world picture is one of fragmented support, varying stability, and significant gaps that can derail multi-cluster strategies.

In this evaluation, we move beyond marketing checklists to analyze the practical state of Gateway API support across major cloud providers, ingress controllers, and service meshes. We’ll examine which versions are truly production-ready, where the interoperability pitfalls lie, and what you must account for before standardizing across your infrastructure.

The Gateway API Maturity Spectrum: From Experimental to Standard

Not all Gateway API resources are created equal. The API’s unique versioning model—with features progressing through Experimental, Standard, and Extended support tracks—means provider support is inherently uneven. An implementation might fully support the stable Gateway and HTTPRoute resources while offering only partial or experimental backing for GRPCRoute or TCPRoute.

This creates a fundamental challenge for architects: designing for the lowest common denominator or accepting provider-specific constraints. The decision hinges on accurately mapping your traffic management requirements (HTTP, TLS termination, gRPC, TCP/UDP load balancing) against what each provider actually delivers in a stable form.

Core API Support: The Foundation

Most providers now support the v1 (GA) versions of the foundational resources:

GatewayClass & Gateway: Nearly universal support for v1. These are the control plane resources for provisioning and configuring load balancers.
HTTPRoute: Universal support for v1. This is the workhorse for HTTP/HTTPS traffic routing and is considered the most stable.

However, support for other route types reveals the fragmentation:

GRPCRoute: Often in beta or experimental stages. Critical for modern microservices architectures but not yet universally reliable.
TCPRoute & UDPRoute: Patchy support. Some providers implement them as beta, others ignore them entirely, forcing fallbacks to provider-specific annotations or custom resources.
TLSRoute: Frequently tied to specific certificate management integrations (e.g., cert-manager).

Major Provider Deep Dive: Implementation Realities

AWS Elastic Kubernetes Service (EKS)

AWS offers an official Gateway API controller for EKS. Its support is pragmatic but currently limited:

Supported Resources: GatewayClass, Gateway, HTTPRoute, and GRPCRoute (all v1beta1 as of early 2024). Note the use of v1beta1 for GRPCRoute, indicating it’s not yet at GA stability.
Underlying Infrastructure: Maps directly to AWS Application Load Balancer (ALB) and Network Load Balancer (NLB). This is a strength (managed AWS services) and a constraint (you inherit ALB/NLB feature limits).
Critical Gap: No support for TCPRoute or UDPRoute. If your workload requires raw TCP/UDP load balancing, you must use the legacy Kubernetes Service type LoadBalancer or a different ingress controller alongside the Gateway API controller, creating a disjointed management model.

Google Kubernetes Engine (GKE) & Azure Kubernetes Service (AKS)

Both Google and Azure have integrated Gateway API support directly into their managed Kubernetes offerings, often with a focus on their global load-balancing infrastructures.

GKE: Offers the GKE Gateway controller. It supports v1 resources and can provision Google Cloud Global External Load Balancers. Its integration with Google’s certificate management and CDN is a key advantage. However, advanced routing features may require GCP-specific backend configs.
AKS: Provides the Application Gateway Ingress Controller (AGIC) with Gateway API support, mapping to Azure Application Gateway. Support for newer route types like GRPCRoute has historically lagged behind other providers.

The pattern here is clear: cloud providers implement the Gateway API as a facade over their existing, proprietary load-balancing products. This ensures stability and performance but can limit portability and advanced cross-provider features.

NGINX & Kong Ingress Controller

These third-party, cluster-based controllers offer a different value proposition: consistency across any Kubernetes distribution, including on-premises.

NGINX: With its stable Ingress APIs deprecated in favor of Gateway API, its Gateway API implementation is now the primary path forward. It generally has excellent support for the full range of experimental and standard resources, as it’s not constrained by a cloud vendor’s underlying service. This makes it a strong choice for hybrid or multi-cloud deployments where feature parity is crucial.
Kong Ingress Controller: Kong has been an early and comprehensive supporter of the Gateway API, often implementing features quickly. It leverages Kong Gateway’s extensive plugin ecosystem, which can be a major draw but also introduces vendor lock-in.

Critical Gaps for Enterprise Architects

Beyond checking resource support boxes, several deeper gaps can impact production deployments, especially in complex environments.

1. Multi-Cluster & Hybrid Environment Support

The Gateway API specification includes concepts like ReferenceGrant for cross-namespace and future cross-cluster routing. In practice, very few providers have robust, production-ready multi-cluster stories. Most implementations assume a single cluster. If your architecture spans multiple clusters (for isolation, geography, or failure domains), you will likely need to:

Manage separate Gateway resources per cluster.
Use an external global load balancer (like a cloud DNS/GSLB) to distribute traffic across cluster-specific gateways.
This negates some of the API’s promise of a unified, abstracted configuration.

2. Policy Attachment and Extension Consistency

Gateway API is designed to be extended through policy attachment (e.g., for rate limiting, WAF rules, authentication). There is no standard for how these policies are implemented. One provider might use a custom RateLimitPolicy CRD, while another might rely on annotations or a separate policy engine. This creates massive configuration drift and vendor lock-in, breaking the portability goal.

3. Observability and Debugging Interfaces

While the API defines status fields, the richness of operational data—detailed error logs, granular metrics tied to API resources, distributed tracing integration—varies wildly. Some providers expose deep integration with their monitoring stack; others offer minimal visibility. You must verify that the provider’s observability model meets your SRE team’s needs.

Evaluation Framework: Questions for Your Team

Before selecting a provider, work through this technical checklist:

Route Requirements: Do we need stable support for HTTP only, or also gRPC, TCP, UDP? Is beta support acceptable for non-HTTP routes?
Infrastructure Model: Do we want a cloud-managed load balancer (simpler, less control) or a cluster-based controller (more portable, more operational overhead)?
Multi-Cluster Future: Is our architecture single-cluster today but likely to expand? Does the provider have a credible roadmap for multi-cluster Gateway API?
Policy Needs: What advanced policies (auth, WAF, rate limiting) are required? How does the provider implement them? Can we live with vendor-specific policy CRDs?
Observe & Debug: What logging, metrics, and tracing are exposed for Gateway API resources? Do they integrate with our existing observability platform?
Upgrade Path: What is the provider’s track record for supporting new Gateway API releases? How painful are version upgrades?

Strategic Recommendations

Based on the current landscape, here are pragmatic paths forward:

For Single-Cloud Deployments: Start with your cloud provider’s native controller (AWS, GKE, AKS). It’s the path of least resistance and best integration with other cloud services (IAM, certificates, monitoring). Just be acutely aware of its specific limitations regarding unsupported route types.
For Hybrid/Multi-Cloud or On-Premises: Standardize on a portable, cluster-based controller like Ingress-NGINX or Kong. The consistency across environments will save significant operational complexity, even if it means forgoing some cloud-native integrations.
For Greenfield Projects: Design your applications and configurations against the stable v1 resources (Gateway, HTTPRoute) only. Treat any use of beta/experimental resources as a known risk that may require refactoring later.
Always Have an Exit Plan: Isolate Gateway API configuration YAMLs from provider-specific policies and annotations. This modularity will make migration less painful when the next generation of providers emerges or when you need to switch.

The Gateway API’s evolution is a net positive for the Kubernetes ecosystem, offering a far more expressive model than the original Ingress. However, in 2026, the provider landscape is still maturing. Support is broad but not deep, and critical gaps in multi-cluster management and policy portability remain. The successful architect will choose a provider not based on a feature checklist, but based on how well its specific constraints and capabilities align with their organization’s immediate traffic patterns and long-term platform strategy. The era of a universal, write-once-run-anywhere Gateway API configuration is not yet here—but with careful, informed provider selection, you can build a robust foundation for it.

Kubernetes Housekeeping: How to Clean Up Orphaned ConfigMaps and Secrets

2026-05-022026-02-09 by Alexandre Vazquez

If you’ve been running Kubernetes clusters for any meaningful amount of time, you’ve likely encountered a familiar problem: orphaned ConfigMaps and Secrets piling up in your namespaces. These abandoned resources don’t just clutter your cluster—they introduce security risks, complicate troubleshooting, and can even impact cluster performance as your resource count grows.

The reality is that Kubernetes doesn’t automatically clean up ConfigMaps and Secrets when the workloads that reference them are deleted. This gap in Kubernetes’ native garbage collection creates a housekeeping problem that every production cluster eventually faces. In this article, we’ll explore why orphaned resources happen, how to detect them, and most importantly, how to implement sustainable cleanup strategies that prevent them from accumulating in the first place.

Understanding the Orphaned Resource Problem

What Are Orphaned ConfigMaps and Secrets?

Orphaned ConfigMaps and Secrets are configuration resources that no longer have any active references from Pods, Deployments, StatefulSets, or other workload resources in your cluster. They typically become orphaned when:

Applications are updated and new ConfigMaps are created while old ones remain
Deployments are deleted but their associated configuration resources aren’t
Failed rollouts leave behind unused configuration versions
Development and testing workflows create temporary resources that never get cleaned up
CI/CD pipelines generate unique ConfigMap names (often with hash suffixes) on each deployment

Why This Matters for Production Clusters

While a few orphaned ConfigMaps might seem harmless, the problem compounds over time and introduces real operational challenges:

Security Risks: Orphaned Secrets can contain outdated credentials, API keys, or certificates that should no longer be accessible. If these aren’t removed, they remain attack vectors for unauthorized access—especially problematic if RBAC policies grant broad read access to Secrets within a namespace.

Cluster Bloat: Kubernetes stores these resources in etcd, your cluster’s backing store. As the number of orphaned resources grows, etcd size increases, potentially impacting cluster performance and backup times. In extreme cases, this can contribute to etcd performance degradation or even hit storage quotas.

Operational Complexity: When troubleshooting issues or reviewing configurations, sifting through dozens of unused ConfigMaps makes it harder to identify which resources are actually in use. This “configuration noise” slows down incident response and increases cognitive load for your team.

Cost Implications: While individual ConfigMaps are small, at scale they contribute to storage costs and can trigger alerts in cost monitoring systems, especially in multi-tenant environments where resource quotas matter.

Detecting Orphaned ConfigMaps and Secrets

Before you can clean up orphaned resources, you need to identify them. Let’s explore both manual detection methods and automated tooling approaches.

Manual Detection with kubectl

The simplest approach uses kubectl to cross-reference ConfigMaps and Secrets against active workload resources. Here’s a basic script to identify potentially orphaned ConfigMaps:

#!/bin/bash
# detect-orphaned-configmaps.sh
# Identifies ConfigMaps not referenced by any active Pods

NAMESPACE=${1:-default}

echo "Checking for orphaned ConfigMaps in namespace: $NAMESPACE"
echo "---"

# Get all ConfigMaps in the namespace
CONFIGMAPS=$(kubectl get configmaps -n $NAMESPACE -o jsonpath='{.items[*].metadata.name}')

for cm in $CONFIGMAPS; do
    # Skip kube-root-ca.crt as it's system-managed
    if [[ "$cm" == "kube-root-ca.crt" ]]; then
        continue
    fi

    # Check if any Pod references this ConfigMap
    REFERENCED=$(kubectl get pods -n $NAMESPACE -o json | \
        jq -r --arg cm "$cm" '.items[] |
        select(
            (.spec.volumes[]?.configMap.name == $cm) or
            (.spec.containers[].env[]?.valueFrom.configMapKeyRef.name == $cm) or
            (.spec.containers[].envFrom[]?.configMapRef.name == $cm)
        ) | .metadata.name' | head -1)

    if [[ -z "$REFERENCED" ]]; then
        echo "Orphaned: $cm"
    fi
done

A similar script for Secrets would look like this:

#!/bin/bash
# detect-orphaned-secrets.sh

NAMESPACE=${1:-default}

echo "Checking for orphaned Secrets in namespace: $NAMESPACE"
echo "---"

SECRETS=$(kubectl get secrets -n $NAMESPACE -o jsonpath='{.items[*].metadata.name}')

for secret in $SECRETS; do
    # Skip service account tokens and system secrets
    SECRET_TYPE=$(kubectl get secret $secret -n $NAMESPACE -o jsonpath='{.type}')
    if [[ "$SECRET_TYPE" == "kubernetes.io/service-account-token" ]]; then
        continue
    fi

    # Check if any Pod references this Secret
    REFERENCED=$(kubectl get pods -n $NAMESPACE -o json | \
        jq -r --arg secret "$secret" '.items[] |
        select(
            (.spec.volumes[]?.secret.secretName == $secret) or
            (.spec.containers[].env[]?.valueFrom.secretKeyRef.name == $secret) or
            (.spec.containers[].envFrom[]?.secretRef.name == $secret) or
            (.spec.imagePullSecrets[]?.name == $secret)
        ) | .metadata.name' | head -1)

    if [[ -z "$REFERENCED" ]]; then
        echo "Orphaned: $secret"
    fi
done

Important caveat: These scripts only check currently running Pods. They won’t catch ConfigMaps or Secrets referenced by Deployments, StatefulSets, or DaemonSets that might currently have zero replicas. For production use, you’ll want to check against all workload resource types.

Automated Detection with Specialized Tools

Several open-source tools have emerged to solve this problem more comprehensively:

Kor: Comprehensive Unused Resource Detection

Kor is a purpose-built tool for finding unused resources across your Kubernetes cluster. It checks not just ConfigMaps and Secrets, but also PVCs, Services, and other resource types.

# Install Kor
brew install kor

# Scan for unused ConfigMaps and Secrets
kor all --namespace production --output json

# Check specific resource types
kor configmap --namespace production
kor secret --namespace production --exclude-namespaces kube-system,kube-public

Kor works by analyzing resource relationships and identifying anything without dependent objects. It’s particularly effective because it understands Kubernetes resource hierarchies and checks against Deployments, StatefulSets, and DaemonSets—not just running Pods.

Popeye: Cluster Sanitization Reports

Popeye scans your cluster and generates reports on resource health, including orphaned resources. While broader in scope than just ConfigMap cleanup, it provides valuable context:

# Install Popeye
brew install derailed/popeye/popeye

# Scan cluster
popeye --output json --save

# Focus on specific namespace
popeye --namespace production

Custom Controllers with Kubernetes APIs

For more sophisticated detection, you can build custom controllers using client-go that continuously monitor for orphaned resources. This approach works well when integrated with your existing observability stack:

// Pseudocode example
func detectOrphanedConfigMaps(namespace string) []string {
    configMaps := listConfigMaps(namespace)
    deployments := listDeployments(namespace)
    statefulSets := listStatefulSets(namespace)
    daemonSets := listDaemonSets(namespace)

    referenced := make(map[string]bool)

    // Check all workload types for ConfigMap references
    for _, deploy := range deployments {
        for _, cm := range getReferencedConfigMaps(deploy) {
            referenced[cm] = true
        }
    }
    // ... repeat for other workload types

    orphaned := []string{}
    for _, cm := range configMaps {
        if !referenced[cm.Name] {
            orphaned = append(orphaned, cm.Name)
        }
    }

    return orphaned
}

Prevention Strategies: Stop Orphans Before They Start

The best cleanup strategy is prevention. By implementing proper resource management patterns from the beginning, you can minimize orphaned resources in the first place.

Use Owner References for Automatic Cleanup

Kubernetes provides a built-in mechanism for resource lifecycle management through owner references. When properly configured, child resources are automatically deleted when their owner is removed.

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
  namespace: production
  ownerReferences:
    - apiVersion: apps/v1
      kind: Deployment
      name: myapp
      uid: d9607e19-f88f-11e6-a518-42010a800195
      controller: true
      blockOwnerDeletion: true
data:
  app.properties: |
    database.url=postgres://db:5432

Tools like Helm and Kustomize automatically set owner references, which is one reason GitOps workflows tend to have fewer orphaned resources than imperative deployment approaches.

Implement Consistent Labeling Standards

Labels make it much easier to identify resource relationships and track ownership:

apiVersion: v1
kind: ConfigMap
metadata:
  name: api-gateway-config-v2
  labels:
    app: api-gateway
    component: configuration
    version: v2
    managed-by: argocd
    owner: platform-team
data:
  config.yaml: |
    # configuration here

With consistent labeling, you can easily query for ConfigMaps associated with specific applications:

# Find all ConfigMaps for a specific app
kubectl get configmaps -l app=api-gateway

# Clean up old versions
kubectl delete configmaps -l app=api-gateway,version=v1

Adopt GitOps Practices

GitOps tools like ArgoCD and Flux excel at preventing orphaned resources because they maintain a clear desired state:

Declarative management: All resources are defined in Git
Automatic pruning: Tools can detect and remove resources not defined in Git
Audit trail: Git history shows when and why resources were created or deleted

ArgoCD’s sync policies can automatically prune resources:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp
spec:
  syncPolicy:
    automated:
      prune: true  # Remove resources not in Git
      selfHeal: true

Use Kustomize ConfigMap Generators with Hashes

Kustomize’s ConfigMap generator feature appends content hashes to ConfigMap names, ensuring that configuration changes trigger new ConfigMaps:

# kustomization.yaml
configMapGenerator:
  - name: app-config
    files:
      - config.properties
generatorOptions:
  disableNameSuffixHash: false  # Include hash in name

This creates ConfigMaps like app-config-dk9g72hk5f. When you update the configuration, Kustomize creates a new ConfigMap with a different hash. Combined with Kustomize’s --prune flag, old ConfigMaps are automatically removed:

kubectl apply --prune -k ./overlays/production \
  -l app=myapp

Set Resource Quotas

While quotas don’t prevent orphans, they create backpressure that forces teams to clean up:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: config-quota
  namespace: production
spec:
  hard:
    configmaps: "50"
    secrets: "50"

When teams hit quota limits, they’re incentivized to audit and remove unused resources.

Cleanup Strategies for Existing Orphaned Resources

For clusters that already have accumulated orphaned ConfigMaps and Secrets, here are practical cleanup approaches.

One-Time Manual Cleanup

For immediate cleanup, combine detection scripts with kubectl delete:

# Dry run first - review what would be deleted
./detect-orphaned-configmaps.sh production > orphaned-cms.txt
cat orphaned-cms.txt

# Manual review and cleanup
for cm in $(cat orphaned-cms.txt | grep "Orphaned:" | awk '{print $2}'); do
    kubectl delete configmap $cm -n production
done

Critical warning: Always do a dry run and manual review first. Some ConfigMaps might be referenced by workloads that aren’t currently running but will scale up later (HPA scaled to zero, CronJobs, etc.).

Scheduled Cleanup with CronJobs

For ongoing maintenance, deploy a Kubernetes CronJob that runs cleanup scripts periodically:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: configmap-cleanup
  namespace: kube-system
spec:
  schedule: "0 2 * * 0"  # Weekly at 2 AM Sunday
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: cleanup-sa
          containers:
          - name: cleanup
            image: bitnami/kubectl:latest
            command:
            - /bin/bash
            - -c
            - |
              # Cleanup script here
              echo "Starting ConfigMap cleanup..."

              for ns in $(kubectl get ns -o jsonpath='{.items[*].metadata.name}'); do
                echo "Checking namespace: $ns"

                # Get all workload-referenced ConfigMaps
                REFERENCED_CMS=$(kubectl get deploy,sts,ds -n $ns -o json | \
                  jq -r '.items[].spec.template.spec |
                  [.volumes[]?.configMap.name,
                   .containers[].env[]?.valueFrom.configMapKeyRef.name,
                   .containers[].envFrom[]?.configMapRef.name] |
                  .[] | select(. != null)' | sort -u)

                ALL_CMS=$(kubectl get cm -n $ns -o jsonpath='{.items[*].metadata.name}')

                for cm in $ALL_CMS; do
                  if [[ "$cm" == "kube-root-ca.crt" ]]; then
                    continue
                  fi

                  if ! echo "$REFERENCED_CMS" | grep -q "^$cm$"; then
                    echo "Deleting orphaned ConfigMap: $cm in namespace: $ns"
                    kubectl delete cm $cm -n $ns
                  fi
                done
              done
          restartPolicy: OnFailure
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: cleanup-sa
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cleanup-role
rules:
- apiGroups: [""]
  resources: ["configmaps", "secrets", "namespaces"]
  verbs: ["get", "list", "delete"]
- apiGroups: ["apps"]
  resources: ["deployments", "statefulsets", "daemonsets"]
  verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cleanup-binding
subjects:
- kind: ServiceAccount
  name: cleanup-sa
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: cleanup-role
  apiGroup: rbac.authorization.k8s.io

Security consideration: This CronJob needs cluster-wide permissions to read workloads and delete ConfigMaps. Review and adjust the RBAC permissions based on your security requirements. Consider limiting to specific namespaces if you don’t need cluster-wide cleanup.

Integration with CI/CD Pipelines

Build cleanup into your deployment workflows. Here’s an example GitLab CI job:

cleanup_old_configs:
  stage: post-deploy
  image: bitnami/kubectl:latest
  script:
    - |
      # Delete ConfigMaps with old version labels after successful deployment
      kubectl delete configmap -n production \
        -l app=myapp,version!=v${CI_COMMIT_TAG}

    - |
      # Keep only the last 3 ConfigMap versions by timestamp
      kubectl get configmap -n production \
        -l app=myapp \
        --sort-by=.metadata.creationTimestamp \
        -o name | head -n -3 | xargs -r kubectl delete -n production
  only:
    - tags
  when: on_success

Safe Deletion Practices

When cleaning up ConfigMaps and Secrets, follow these safety guidelines:

Dry run first: Always review what will be deleted before executing
Backup before deletion: Export resources to YAML files before removing them
Check age: Only delete resources older than a certain threshold (e.g., 30 days)
Exclude system resources: Skip kube-system, kube-public, and other system namespaces
Monitor for impact: Watch application metrics after cleanup to ensure nothing broke

Example backup and conditional deletion:

# Backup before deletion
kubectl get configmap -n production -o yaml > cm-backup-$(date +%Y%m%d).yaml

# Only delete ConfigMaps older than 30 days
kubectl get configmap -n production -o json | \
  jq -r --arg date "$(date -d '30 days ago' -u +%Y-%m-%dT%H:%M:%SZ)" \
  '.items[] | select(.metadata.creationTimestamp < $date) | .metadata.name' | \
  while read cm; do
    echo "Would delete: $cm (created: $(kubectl get cm $cm -n production -o jsonpath='{.metadata.creationTimestamp}'))"
    # Uncomment to actually delete:
    # kubectl delete configmap $cm -n production
  done

Advanced Patterns for Large-Scale Clusters

For organizations running multiple clusters or large multi-tenant platforms, housekeeping requires more sophisticated approaches.

Policy-Based Cleanup with OPA Gatekeeper

Use OPA Gatekeeper to enforce ConfigMap lifecycle policies at admission time:

apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
  name: configmaprequiredlabels
spec:
  crd:
    spec:
      names:
        kind: ConfigMapRequiredLabels
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package configmaprequiredlabels

        violation[{"msg": msg}] {
          input.review.kind.kind == "ConfigMap"
          not input.review.object.metadata.labels["app"]
          msg := "ConfigMaps must have an 'app' label for lifecycle tracking"
        }

        violation[{"msg": msg}] {
          input.review.kind.kind == "ConfigMap"
          not input.review.object.metadata.labels["owner"]
          msg := "ConfigMaps must have an 'owner' label for lifecycle tracking"
        }

This policy prevents ConfigMaps without proper labels from being created, making future tracking and cleanup much easier.

Centralized Monitoring with Prometheus

Monitor orphaned resource metrics across your clusters:

apiVersion: v1
kind: ConfigMap
metadata:
  name: orphan-detection-exporter
data:
  script.sh: |
    #!/bin/bash
    # Expose metrics for Prometheus scraping
    while true; do
      echo "# HELP k8s_orphaned_configmaps Number of orphaned ConfigMaps"
      echo "# TYPE k8s_orphaned_configmaps gauge"

      for ns in $(kubectl get ns -o jsonpath='{.items[*].metadata.name}'); do
        count=$(./detect-orphaned-configmaps.sh $ns | grep -c "Orphaned:")
        echo "k8s_orphaned_configmaps{namespace=\"$ns\"} $count"
      done

      sleep 300  # Update every 5 minutes
    done

Create alerts when orphaned resource counts exceed thresholds:

groups:
- name: kubernetes-housekeeping
  rules:
  - alert: HighOrphanedConfigMapCount
    expr: k8s_orphaned_configmaps > 20
    for: 24h
    labels:
      severity: warning
    annotations:
      summary: "High number of orphaned ConfigMaps in {{ $labels.namespace }}"
      description: "Namespace {{ $labels.namespace }} has {{ $value }} orphaned ConfigMaps"

Multi-Cluster Cleanup with Crossplane or Cluster API

For platform teams managing dozens or hundreds of clusters, extend cleanup automation across your entire fleet:

# Crossplane Composition for cluster-wide cleanup
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: cluster-cleanup-policy
spec:
  compositeTypeRef:
    apiVersion: platform.example.com/v1
    kind: ClusterCleanupPolicy
  resources:
    - name: cleanup-cronjob
      base:
        apiVersion: kubernetes.crossplane.io/v1alpha1
        kind: Object
        spec:
          forProvider:
            manifest:
              apiVersion: batch/v1
              kind: CronJob
              # ... CronJob spec from earlier

Housekeeping Checklist for Production Clusters

Here’s a practical checklist to implement sustainable ConfigMap and Secret housekeeping:

Immediate Actions:

[ ] Run detection scripts to audit current orphaned resource count
[ ] Backup all ConfigMaps and Secrets before any cleanup
[ ] Manually review and delete obvious orphans (with team approval)
[ ] Document which ConfigMaps/Secrets are intentionally unused but needed

Short-term (1-4 weeks):

[ ] Implement consistent labeling standards across teams
[ ] Add owner references to all ConfigMaps and Secrets
[ ] Deploy scheduled CronJob for automated detection and reporting
[ ] Integrate cleanup steps into CI/CD pipelines

Long-term (1-3 months):

[ ] Adopt GitOps tooling (ArgoCD, Flux) with automated pruning
[ ] Implement OPA Gatekeeper policies for required labels
[ ] Set up Prometheus monitoring for orphaned resource metrics
[ ] Create runbooks for incident responders
[ ] Establish resource quotas per namespace
[ ] Conduct quarterly cluster hygiene reviews

Ongoing Practices:

[ ] Review orphaned resource reports weekly
[ ] Include cleanup tasks in sprint planning
[ ] Train new team members on resource lifecycle best practices
[ ] Update cleanup automation as cluster architecture evolves

Conclusion

Kubernetes doesn’t automatically clean up orphaned ConfigMaps and Secrets, but with the right strategies, you can prevent them from becoming a problem. The key is implementing a layered approach: use owner references and GitOps for prevention, deploy automated detection for ongoing monitoring, and run scheduled cleanup jobs for maintenance.

Start with detection to understand your current situation, then focus on prevention strategies like owner references and consistent labeling. For existing clusters with accumulated orphaned resources, implement gradual cleanup with proper safety checks rather than aggressive bulk deletion.

Remember that housekeeping isn’t a one-time task—it’s an ongoing operational practice. By building cleanup into your CI/CD pipelines and establishing clear resource ownership, you’ll maintain a clean, secure, and performant Kubernetes environment over time.

The tools and patterns we’ve covered here—from simple bash scripts to sophisticated policy engines—can be adapted to your organization’s scale and maturity level. Whether you’re managing a single cluster or a multi-cluster platform, investing in proper resource lifecycle management pays dividends in operational efficiency, security posture, and team productivity.

Frequently Asked Questions (FAQ)

Can Kubernetes automatically delete unused ConfigMaps and Secrets?

No. Kubernetes does not garbage-collect ConfigMaps or Secrets by default when workloads are deleted. Unless they have ownerReferences set, these resources remain in the cluster indefinitely and must be cleaned up manually or via automation.

Is it safe to delete ConfigMaps or Secrets that are not referenced by running Pods?

Not always. Some resources may be referenced by workloads scaled to zero, CronJobs, or future rollouts. Always perform a dry run, check workload definitions (Deployments, StatefulSets, DaemonSets), and review resource age before deletion.

What is the safest way to prevent orphaned ConfigMaps and Secrets?

The most effective prevention strategies are:
Using ownerReferences (via Helm or Kustomize)
Adopting GitOps with pruning enabled (ArgoCD / Flux)
Applying consistent labeling (app, owner, version)
These ensure unused resources are automatically detected and removed

Which tools are best for detecting orphaned resources?

Popular and reliable tools include:
Kor – purpose-built for detecting unused Kubernetes resources
Popeye – broader cluster hygiene and sanitization reports
Custom scripts/controllers – useful for tailored environments or integrations
For production clusters, Kor provides the best signal-to-noise ratio.

How often should ConfigMap and Secret cleanup run in production?

A common best practice is:
Weekly detection (reporting only)
Monthly cleanup for resources older than a defined threshold (e.g. 30–60 days)
Immediate cleanup integrated into CI/CD after successful deployments
This balances safety with long-term cluster hygiene.

Sources

Kubernetes Gateway API Versions: Complete Compatibility and Upgrade Guide

2026-05-022026-02-02 by Alexandre Vazquez

The Kubernetes Gateway API has rapidly evolved from its experimental roots to become the standard for ingress and service mesh traffic management. But with multiple versions released and various maturity levels, understanding which version to use, how it relates to your Kubernetes cluster, and when to upgrade can be challenging.

In this comprehensive guide, we’ll explore the different Gateway API versions, their relationship to Kubernetes releases, provider support levels, and the upgrade philosophy that will help you make informed decisions for your infrastructure.

Understanding Gateway API Versioning

The Gateway API follows a unique versioning model that differs from standard Kubernetes APIs. Unlike built-in Kubernetes resources that are tied to specific cluster versions, Gateway API CRDs can be installed independently as long as your cluster meets the minimum requirements.

Minimum Kubernetes Version Requirements

As of Gateway API v1.1 and later versions, you need Kubernetes 1.26 or later to run the latest Gateway API releases. The API commits to supporting a minimum of the most recent 5 Kubernetes minor versions, providing a reasonable window for cluster upgrades.

This rolling support window means that if you’re running Kubernetes 1.26, 1.27, 1.28, 1.29, or 1.30, you can safely install and use the latest Gateway API without concerns about compatibility.

Release Channels: Standard vs Experimental

Gateway API uses two distinct release channels to balance stability with innovation. Understanding these channels is critical for choosing the right version for your use case.

Standard Channel

The Standard channel contains only GA (Generally Available, v1) and Beta (v1beta1) level resources and fields. When you install from the Standard channel, you get:

Stability guarantees: No breaking changes once a resource reaches Beta or GA
Backwards compatibility: Safe to upgrade between minor versions
Production readiness: Extensively tested features with multiple implementations
Conformance coverage: Full test coverage ensuring portability

Resources in the Standard channel include GatewayClass, Gateway, HTTPRoute, and ReferenceGrant at the v1 level, plus stable features like GRPCRoute.

Experimental Channel

The Experimental channel includes everything from the Standard channel plus Alpha-level resources and experimental fields. This channel is for:

Early feature testing: Try new capabilities before they stabilize
Cutting-edge functionality: Access the latest Gateway API innovations
No stability guarantees: Breaking changes can occur between releases
Feature feedback: Help shape the API by testing experimental features

Features may graduate from Experimental to Standard or be dropped entirely based on implementation experience and community feedback.

Gateway API Version History and Features

Let’s explore the major Gateway API releases and what each introduced.

v1.0 (October 2023)

The v1.0 release marked a significant milestone, graduating core resources to GA status. This release included:

Gateway, GatewayClass, and HTTPRoute at v1 (stable)
Full backwards compatibility guarantees for v1 resources
Production-ready status for ingress traffic management
Multiple conformant implementations across vendors

v1.1 (May 2024)

Version 1.1 expanded the API significantly with service mesh support:

GRPCRoute: Native support for gRPC traffic routing
Service mesh capabilities: East-west traffic management alongside north-south
Multiple implementations: Both Istio and other service meshes achieved conformance
Enhanced features: Additional matching criteria and routing capabilities

This version bridged the gap between traditional ingress controllers and full service mesh implementations.

v1.2 and v1.3

These intermediate releases introduced structured release cycles and additional features:

Refined conformance testing
BackendTLSPolicy (experimental in v1.3)
Enhanced observability and debugging capabilities
Improved cross-namespace routing

v1.4 (October 2025)

The latest GA release as of this writing, v1.4.0 brought:

Continued API refinement
Additional experimental features for community testing
Enhanced conformance profiles
Improved documentation and migration guides

Kubernetes Version Compatibility Matrix

Here’s how Gateway API versions relate to Kubernetes releases:

Gateway API Version	Minimum Kubernetes	Recommended Kubernetes	Release Date
v1.0.x	1.25	1.26+	October 2023
v1.1.x	1.26	1.27+	May 2024
v1.2.x	1.26	1.28+	2024
v1.3.x	1.26	1.29+	2024
v1.4.x	1.26	1.30+	October 2025

The key takeaway: Gateway API v1.1 and later all support Kubernetes 1.26+, meaning you can run the latest Gateway API on any reasonably modern cluster.

Gateway Provider Support Levels

Different Gateway API implementations support various versions and feature sets. Understanding provider support helps you choose the right implementation for your needs.

Conformance Levels

Gateway API defines three conformance levels for features:

Core: Features that must be supported for an implementation to claim conformance. These are portable across all implementations.
Extended: Standardized optional features. Implementations indicate Extended support separately from Core.
Implementation-specific: Vendor-specific features without conformance requirements.

Major Provider Support

Istio

Istio reached Gateway API GA support in version 1.22 (May 2024). Istio provides:

Full Standard channel support (v1 resources)
Service mesh (east-west) traffic management via GAMMA
Ingress (north-south) traffic control
Experimental support for BackendTLSPolicy (Istio 1.26+)

Istio is particularly strong for organizations needing both ingress and service mesh capabilities in a single solution.

Envoy Gateway

Envoy Gateway tracks Gateway API releases closely. Version 1.4.0 includes:

Gateway API v1.3.0 support
Compatibility matrix for Envoy Proxy versions
Focus on ingress use cases
Strong experimental feature adoption

Check the Envoy Gateway compatibility matrix to ensure your Envoy Proxy version aligns with your Gateway API and Kubernetes versions.

Cilium

Cilium integrates Gateway API deeply with its CNI implementation:

Per-node Envoy proxy architecture
Network policy enforcement for Gateway traffic
Both ingress and service mesh support
eBPF-based packet processing

Cilium’s unique architecture makes it a strong choice for organizations already using Cilium for networking.

Contour

Contour v1.31.0 implements Gateway API v1.2.1, supporting:

All Standard channel v1 resources
Most v1alpha2 resources (TLSRoute, TCPRoute, GRPCRoute)
BackendTLSPolicy support

Checking Provider Conformance

To verify which Gateway API version and features your provider supports:

Visit the official implementations page: The Gateway API project maintains a comprehensive list of implementations with their conformance levels.
Check provider documentation: Most providers publish compatibility matrices showing Gateway API, Kubernetes, and proxy version relationships.
Review conformance reports: Providers submit conformance test results that detail exactly which Core and Extended features they support.
Test in non-production: Before upgrading production, validate your specific use cases in a staging environment.

Upgrade Philosophy: When and How to Upgrade

One of the most common questions about Gateway API is: “Do I need to run the latest version?” The answer depends on your specific needs and risk tolerance.

Staying on Older Versions

You don’t need to always run the latest Gateway API version. It’s perfectly acceptable to:

Stay on an older stable release if it meets your needs
Upgrade only when you need specific new features
Wait for your Gateway provider to officially support newer versions
Maintain stability over having the latest features

The Standard channel’s backwards compatibility guarantees mean that when you do upgrade, your existing configurations will continue to work.

When to Consider Upgrading

Consider upgrading when:

You need a specific feature: A new HTTPRoute matcher, GRPCRoute support, or other functionality only available in newer versions
Your provider recommends it: Gateway providers often optimize for specific Gateway API versions
Security considerations: While rare, security issues could prompt upgrades
Kubernetes cluster upgrades: When upgrading Kubernetes, verify your Gateway API version is compatible with the new cluster version

Safe Upgrade Practices

Follow these best practices for Gateway API upgrades:

1. Stick with Standard Channel

Using Standard channel CRDs makes upgrades simpler and safer. Experimental features can introduce breaking changes, while Standard features maintain compatibility.

2. Upgrade One Minor Version at a Time

While it’s usually safe to skip versions, the most tested upgrade path is incremental. Going from v1.2 to v1.3 to v1.4 is safer than jumping directly from v1.2 to v1.4.

3. Test Before Upgrading

Always test upgrades in non-production environments:

# Install specific Gateway API version in test cluster
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.0/standard-install.yaml

4. Review Release Notes

Each Gateway API release publishes comprehensive release notes detailing:

New features and capabilities
Graduation of experimental features to standard
Deprecation notices
Upgrade considerations

5. Check Provider Compatibility

Before upgrading Gateway API CRDs, verify your Gateway provider supports the target version. Installing Gateway API v1.4 won’t help if your controller only supports v1.2.

6. Never Overwrite Different Channels

Implementations should never overwrite Gateway API CRDs that use a different release channel. Keep track of whether you’re using Standard or Experimental channel installations.

CRD Management Best Practices

Gateway API CRD management requires attention to detail:

# Check currently installed Gateway API version
kubectl get crd gateways.gateway.networking.k8s.io -o yaml | grep 'gateway.networking.k8s.io/bundle-version'

# Verify which channel is installed
kubectl get crd gateways.gateway.networking.k8s.io -o yaml | grep 'gateway.networking.k8s.io/channel'

Staying Informed About New Releases

Gateway API releases follow a structured release cycle with clear communication channels.

How to Know When New Versions Are Released

GitHub Releases Page: Watch the kubernetes-sigs/gateway-api repository for release announcements
Kubernetes Blog: Major Gateway API releases are announced on the official Kubernetes blog
Mailing Lists and Slack: Join the Gateway API community channels for discussions and announcements
Provider Announcements: Gateway providers announce support for new Gateway API versions through their own channels

Release Cadence

Gateway API follows a quarterly release schedule for minor versions, with patch releases as needed for bug fixes and security issues. This predictable cadence helps teams plan upgrades.

Practical Decision Framework

Here’s a framework to help you decide which Gateway API version to run:

For New Deployments

Production workloads: Use the latest GA version supported by your provider
Innovation-focused: Consider Experimental channel if you need cutting-edge features
Conservative approach: Use v1.1 or later with Standard channel

For Existing Deployments

If things are working: Stay on your current version until you need new features
If provider recommends upgrade: Follow provider guidance, especially for security
If Kubernetes upgrade planned: Verify compatibility, may need to upgrade Gateway API first or simultaneously

Feature-Driven Upgrades

Need service mesh support: Upgrade to v1.1 minimum
Need GRPCRoute: Upgrade to v1.1 minimum
Need BackendTLSPolicy: Requires v1.3+ and provider support for experimental features

Conclusion

Kubernetes Gateway API represents the future of traffic management in Kubernetes, offering a standardized, extensible, and role-oriented API for both ingress and service mesh use cases. Understanding the versioning model, compatibility requirements, and upgrade philosophy empowers you to make informed decisions that balance innovation with stability.

Key takeaways:

Gateway API versions install independently from Kubernetes, requiring only version 1.26 or later for recent releases
Standard channel provides stability, Experimental channel provides early access to new features
You don’t need to always run the latest version—upgrade when you need specific features
Verify provider support before upgrading Gateway API CRDs
Follow safe upgrade practices: test first, upgrade incrementally, review release notes

By following these guidelines, you can confidently deploy and maintain Gateway API in your Kubernetes infrastructure while making upgrade decisions that align with your organization’s needs and risk tolerance.

Frequently Asked Questions

What is the difference between Kubernetes Ingress and the Gateway API?

Kubernetes Ingress is a legacy API focused mainly on HTTP(S) traffic with limited extensibility. The Gateway API is its successor, offering a more expressive, role-oriented model that supports multiple protocols, advanced routing, better separation of concerns, and consistent behavior across implementations

Which Gateway API version should I use in production today?

For most production environments, you should use the latest GA (v1.x) release supported by your Gateway provider, installed from the Standard channel. This ensures stability, backwards compatibility, and conformance guarantees while still benefiting from ongoing improvements.

Can I upgrade the Gateway API without upgrading my Kubernetes cluster?

Yes. Gateway API CRDs are installed independently of Kubernetes itself. As long as your cluster meets the minimum supported Kubernetes version (1.26+ for recent releases), you can upgrade the Gateway API without upgrading the cluster.

What happens if my Gateway provider does not support the latest Gateway API version?

If your provider lags behind, you should stay on the latest version officially supported by that provider. Installing newer Gateway API CRDs than your controller supports can lead to missing features or undefined behavior. Provider compatibility should always take precedence over running the newest API version.

Is it safe to upgrade Gateway API CRDs without downtime?

In most cases, yes—when using the Standard channel. The Gateway API provides strong backwards compatibility guarantees for GA and Beta resources. However, you should always test upgrades in a non-production environment and verify that your Gateway provider supports the target version.