Istio ServiceEntry Explained: External Services, DNS, and Traffic Control

Istio ServiceEntry Explained: External Services, DNS, and Traffic Control

Every production Kubernetes cluster talks to the outside world. Your services call payment APIs, connect to managed databases, push events to SaaS analytics platforms, and reach legacy systems that will never run inside the mesh. By default, Istio lets all outbound traffic flow freely — or blocks it entirely if you flip outboundTrafficPolicy to REGISTRY_ONLY. Neither extreme gives you what you actually need: selective, observable, policy-controlled access to external services.

That is exactly what Istio ServiceEntry solves. It registers external endpoints in the mesh’s internal service registry so that Envoy sidecars can apply the same traffic management, security, and observability features to outbound calls that you already enjoy for east-west traffic. No new proxies, no egress gateways required for the basic case — just a YAML resource that tells the mesh “this external thing exists, and here is how to reach it.”

In this guide, I will walk through every field of the ServiceEntry spec, explain the four DNS resolution modes with real-world use cases, and show production-ready patterns for external APIs, databases, TCP services, and legacy workloads. We will also cover how to combine ServiceEntry with DestinationRule and VirtualService to get circuit breaking, retries, connection pooling, and even sticky sessions for external dependencies.

What Is a ServiceEntry

Istio maintains an internal service registry that merges Kubernetes Services with any additional entries you declare. When a sidecar proxy needs to decide how to route a request, it consults this registry. Services inside the mesh are automatically registered. Services outside the mesh are not — unless you create a ServiceEntry.

A ServiceEntry is a custom resource that adds an entry to the mesh’s service registry. Once registered, the external service becomes a first-class citizen: Envoy generates clusters, routes, and listeners for it, which means you get metrics (istio_requests_total), access logs, distributed traces, mTLS origination, retries, timeouts, circuit breaking — the full Istio feature set.

Without a ServiceEntry, outbound traffic to an external host either passes through as a raw TCP connection (in ALLOW_ANY mode) with no telemetry, or gets dropped with a 502/503 (in REGISTRY_ONLY mode). Both outcomes are undesirable in production. The ServiceEntry bridges that gap.

ServiceEntry Anatomy: All Fields Explained

Let us look at a complete ServiceEntry and then break down each field.

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: external-api
  namespace: production
spec:
  hosts:
    - api.stripe.com
  location: MESH_EXTERNAL
  ports:
    - number: 443
      name: https
      protocol: TLS
  resolution: DNS
  exportTo:
    - "."
    - "istio-system"

hosts

A list of hostnames associated with the service. For external services, this is typically the DNS name your application uses (e.g., api.stripe.com). For services using HTTP protocols, the hosts field is matched against the HTTP Host header. For non-HTTP protocols and services without a DNS name, you can use a synthetic hostname and pair it with addresses or static endpoints.

addresses

Optional virtual IP addresses associated with the service. Useful for TCP services where you want to assign a VIP that the sidecar will intercept. Not required for HTTP/HTTPS services that use hostname-based routing.

ports

The ports on which the external service is exposed. Each port needs a number, name, and protocol. The protocol matters: setting it to TLS tells Envoy to perform SNI-based routing without terminating TLS. Setting it to HTTPS means HTTP over TLS. For databases, you’ll typically use TCP.

location

MESH_EXTERNAL or MESH_INTERNAL. Use MESH_EXTERNAL for services outside your cluster (third-party APIs, managed databases). Use MESH_INTERNAL for services inside your infrastructure that are not part of the mesh — for example, VMs running in the same VPC that do not have a sidecar, or a Kubernetes Service in a namespace without injection enabled. The location affects how mTLS is applied and how metrics are labeled.

resolution

How the sidecar resolves the endpoint addresses. This is the most critical field and I will dedicate the next section to it. Options: NONE, STATIC, DNS, DNS_ROUND_ROBIN.

endpoints

An explicit list of network endpoints. Required when resolution is STATIC. Optional with DNS resolution to provide labels or locality information. Each endpoint can have an address, ports, labels, network, locality, and weight.

exportTo

Controls the visibility of this ServiceEntry across namespaces. Use "." for the current namespace only, "*" for all namespaces. In multi-team clusters, restrict exports to avoid namespace pollution.

Resolution Types: NONE vs STATIC vs DNS vs DNS_ROUND_ROBIN

The resolution field determines how Envoy discovers the IP addresses behind the service. Getting this wrong is the number one cause of ServiceEntry misconfigurations. Here is a clear breakdown.

ResolutionHow It WorksBest For
NONEEnvoy uses the original destination IP from the connection. No DNS lookup by the proxy.Wildcard entries, pass-through scenarios, services where the application already resolved the IP.
STATICEnvoy routes to the IPs listed in the endpoints field. No DNS involved.Services with stable, known IPs (e.g., on-prem databases, VMs with fixed IPs).
DNSEnvoy resolves the hostname at connection time and creates an endpoint per returned IP. Uses async DNS with health checking per IP.External APIs behind load balancers, managed databases with DNS endpoints (RDS, CloudSQL).
DNS_ROUND_ROBINEnvoy resolves the hostname and uses a single logical endpoint, rotating across returned IPs. No per-IP health checking.Simple external services, services where you do not need per-endpoint circuit breaking.

When to Use NONE

Use NONE when you want to register a range of external IPs or wildcard hosts without Envoy performing any address resolution. This is common for broad egress policies: “allow traffic to *.googleapis.com on port 443.” Envoy will simply forward traffic to whatever IP the application resolved via kube-dns. The downside: Envoy has limited ability to apply per-endpoint policies.

When to Use STATIC

Use STATIC when the external service has known, stable IP addresses that rarely change. This avoids DNS dependencies entirely. You define the IPs in the endpoints list. Classic use case: a legacy Oracle database on a fixed IP in your data center.

When to Use DNS

Use DNS for most external API integrations. Envoy performs asynchronous DNS resolution and creates a cluster endpoint for each returned IP address. This enables per-endpoint health checking and circuit breaking — critical for production reliability. This is the mode you want for services like api.stripe.com or your RDS instance endpoint.

When to Use DNS_ROUND_ROBIN

Use DNS_ROUND_ROBIN when the external hostname returns many IPs and you do not need per-IP circuit breaking. Envoy treats all resolved IPs as a single logical endpoint and round-robins across them. This is lighter weight than DNS mode and avoids creating a large number of endpoints in Envoy’s cluster configuration.

Practical Patterns

Pattern 1: External HTTP API (api.stripe.com)

The most common ServiceEntry pattern. Your application calls a third-party HTTPS API. You want Istio telemetry, and optionally retries and timeouts.

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: stripe-api
  namespace: payments
spec:
  hosts:
    - api.stripe.com
  location: MESH_EXTERNAL
  ports:
    - number: 443
      name: tls
      protocol: TLS
  resolution: DNS

Note the protocol is TLS, not HTTPS. Since your application initiates the TLS handshake directly, Envoy handles this as opaque TLS using SNI-based routing. If you were terminating TLS at the sidecar and doing TLS origination via a DestinationRule, you would set the protocol to HTTP and handle the upgrade separately — but for most external APIs, let the application manage its own TLS.

Pattern 2: External Managed Database (RDS / CloudSQL)

Managed databases expose a DNS endpoint that resolves to one or more IPs. During failover, the DNS record changes. You need Envoy to respect DNS TTLs and route to the current primary.

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: orders-database
  namespace: orders
spec:
  hosts:
    - orders-db.abc123.us-east-1.rds.amazonaws.com
  location: MESH_EXTERNAL
  ports:
    - number: 5432
      name: postgres
      protocol: TCP
  resolution: DNS

For TCP services, Envoy cannot use HTTP headers to route, so it relies on IP-based matching. The DNS resolution mode ensures Envoy periodically re-resolves the hostname and updates its endpoint list. This is critical for RDS multi-AZ failover scenarios where the DNS endpoint flips to a new IP.

Pattern 3: Legacy Internal Service Not in the Mesh

You have a monitoring service running on a set of VMs at known IP addresses inside your VPC. It is not part of the mesh, but your meshed services need to talk to it.

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: legacy-monitoring
  namespace: observability
spec:
  hosts:
    - legacy-monitoring.internal
  location: MESH_INTERNAL
  ports:
    - number: 8080
      name: http
      protocol: HTTP
  resolution: STATIC
  endpoints:
    - address: 10.0.5.10
    - address: 10.0.5.11
    - address: 10.0.5.12

Key differences: location is MESH_INTERNAL because the service lives inside your network, and resolution is STATIC because we know the IPs. The hostname legacy-monitoring.internal is synthetic — your application uses it, and Istio’s DNS proxy (or a CoreDNS entry) resolves it to one of the listed endpoints.

Pattern 4: TCP Services with Multiple Ports

Some external services expose multiple TCP ports — for example, an Elasticsearch cluster with both data (9200) and transport (9300) ports.

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: external-elasticsearch
  namespace: search
spec:
  hosts:
    - es.example.com
  location: MESH_EXTERNAL
  ports:
    - number: 9200
      name: http
      protocol: HTTP
    - number: 9300
      name: transport
      protocol: TCP
  resolution: DNS

Each port gets its own Envoy listener configuration. The HTTP port benefits from full Layer 7 telemetry and traffic management. The TCP port gets Layer 4 metrics and connection-level policies.

Combining ServiceEntry with DestinationRule

A ServiceEntry alone registers the external service. To apply traffic policies — connection pooling, circuit breaking, TLS origination, load balancing — you pair it with a DestinationRule. This is where things get powerful.

Connection Pooling and Circuit Breaking

External APIs have rate limits. Your managed database has a maximum connection count. Protecting these dependencies at the mesh level prevents cascading failures.

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: stripe-api
  namespace: payments
spec:
  hosts:
    - api.stripe.com
  location: MESH_EXTERNAL
  ports:
    - number: 443
      name: tls
      protocol: TLS
  resolution: DNS
---
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: stripe-api-dr
  namespace: payments
spec:
  host: api.stripe.com
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 50
        connectTimeout: 5s
      http:
        h2UpgradePolicy: DO_NOT_UPGRADE
        maxRequestsPerConnection: 100
    outlierDetection:
      consecutive5xxErrors: 3
      interval: 30s
      baseEjectionTime: 60s
      maxEjectionPercent: 100

This configuration caps outbound connections to Stripe at 50, sets a 5-second connection timeout, and ejects endpoints that return 3 consecutive 5xx errors. In production, this prevents a degraded third-party API from consuming all your connection slots and causing a domino effect across your services.

TLS Origination

Sometimes your application speaks plain HTTP, but the external service requires HTTPS. Instead of modifying application code, you can offload TLS origination to the sidecar.

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: external-api
  namespace: default
spec:
  hosts:
    - api.external-service.com
  location: MESH_EXTERNAL
  ports:
    - number: 80
      name: http
      protocol: HTTP
    - number: 443
      name: https
      protocol: TLS
  resolution: DNS
---
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: external-api-tls
  namespace: default
spec:
  host: api.external-service.com
  trafficPolicy:
    portLevelSettings:
      - port:
          number: 443
        tls:
          mode: SIMPLE

Your application sends HTTP to port 80. A VirtualService (shown in the next section) redirects that to port 443. The DestinationRule initiates TLS to the external endpoint. The application never knows TLS happened.

Combining ServiceEntry with VirtualService

VirtualService gives you Layer 7 traffic management for external services: retries, timeouts, fault injection, header-based routing, and traffic shifting. This is invaluable when you are migrating between API providers or need resilience policies for unreliable external dependencies.

Retries and Timeouts

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: stripe-api-vs
  namespace: payments
spec:
  hosts:
    - api.stripe.com
  http:
    - route:
        - destination:
            host: api.stripe.com
            port:
              number: 443
      timeout: 10s
      retries:
        attempts: 3
        perTryTimeout: 3s
        retryOn: connect-failure,refused-stream,unavailable,cancelled,retriable-status-codes
        retryRemoteLocalities: true

This applies a 10-second overall timeout with up to 3 retry attempts (3 seconds each) for specific failure conditions. Note that this only works for HTTP-protocol ServiceEntries. For TLS-protocol entries where Envoy cannot see the HTTP layer, you are limited to TCP-level connection retries configured via the DestinationRule.

Traffic Shifting Between External Providers

Migrating from one external API to another? Use weighted routing to shift traffic gradually.

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: geocoding-primary
  namespace: geo
spec:
  hosts:
    - geocoding.internal
  location: MESH_EXTERNAL
  ports:
    - number: 443
      name: tls
      protocol: TLS
  resolution: STATIC
  endpoints:
    - address: api.old-geocoding-provider.com
      labels:
        provider: old
    - address: api.new-geocoding-provider.com
      labels:
        provider: new
---
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: geocoding-dr
  namespace: geo
spec:
  host: geocoding.internal
  trafficPolicy:
    tls:
      mode: SIMPLE
  subsets:
    - name: old-provider
      labels:
        provider: old
    - name: new-provider
      labels:
        provider: new
---
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: geocoding-vs
  namespace: geo
spec:
  hosts:
    - geocoding.internal
  http:
    - route:
        - destination:
            host: geocoding.internal
            subset: old-provider
          weight: 80
        - destination:
            host: geocoding.internal
            subset: new-provider
          weight: 20

This sends 80% of geocoding traffic to the old provider and 20% to the new one. Adjust the weights as you gain confidence. Fully reversible — just set the old provider back to 100%.

DNS Resolution Patterns: Istio DNS Proxy vs kube-dns

Istio DNS resolution for external services involves two layers: how your application resolves the hostname (kube-dns / CoreDNS), and how the sidecar resolves the hostname (Envoy’s async DNS or Istio’s DNS proxy). Understanding the interplay is crucial for reliable Istio DNS behavior.

Default Flow (Without Istio DNS Proxy)

Your application calls api.stripe.com. kube-dns resolves it to an IP. The application opens a connection to that IP. The sidecar intercepts the connection and — if the ServiceEntry uses DNS resolution — Envoy independently resolves api.stripe.com to determine its endpoint list. Two separate DNS lookups happen, which can lead to inconsistencies if DNS records change between the two resolutions.

With Istio DNS Proxy (dns.istio.io)

Istio’s sidecar includes a DNS proxy that intercepts DNS queries from the application. When enabled (via meshConfig.defaultConfig.proxyMetadata.ISTIO_META_DNS_CAPTURE and ISTIO_META_DNS_AUTO_ALLOCATE), the proxy can:

  • Auto-allocate virtual IPs for ServiceEntry hosts that do not have addresses defined, which is critical for TCP ServiceEntries that need IP-based matching.
  • Resolve ServiceEntry hosts directly, avoiding the round-trip to kube-dns for known mesh services.
  • Ensure consistency between the application’s DNS resolution and the sidecar’s endpoint resolution.

In modern Istio installations (1.18+), DNS capture is enabled by default. Verify with:

istioctl proxy-config bootstrap <pod-name> -n <namespace> | grep -A2 "ISTIO_META_DNS"

When DNS Proxy Matters Most

The DNS proxy is especially important for TCP ServiceEntries without an explicit addresses field. Without a VIP, Envoy cannot match an incoming TCP connection to the correct ServiceEntry because there is no HTTP Host header to inspect. The DNS proxy solves this by auto-allocating a VIP from the 240.240.0.0/16 range and returning that VIP when the application resolves the hostname. The sidecar then intercepts traffic to that VIP and routes it to the correct external endpoint.

Sticky Sessions with ServiceEntry

Some external services require session affinity — for example, a legacy service that stores session state in memory, or a WebSocket endpoint that must maintain a persistent connection to the same backend. Istio supports sticky sessions for external services through consistent hashing in a DestinationRule.

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: legacy-session-service
  namespace: default
spec:
  hosts:
    - legacy-session.internal
  location: MESH_INTERNAL
  ports:
    - number: 8080
      name: http
      protocol: HTTP
  resolution: STATIC
  endpoints:
    - address: 10.0.1.10
    - address: 10.0.1.11
    - address: 10.0.1.12
---
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: legacy-session-dr
  namespace: default
spec:
  host: legacy-session.internal
  trafficPolicy:
    loadBalancer:
      consistentHash:
        httpCookie:
          name: SERVERID
          ttl: 3600s

This configuration hashes on an HTTP cookie named SERVERID. If the cookie does not exist, Envoy generates one and sets it on the response so that subsequent requests from the same client stick to the same endpoint. You can also hash on:

  • HTTP header: consistentHash.httpHeaderName: "x-user-id" — useful when your application sends a user identifier in every request.
  • Source IP: consistentHash.useSourceIp: true — simplest option but breaks in environments with NAT or shared egress IPs.
  • Query parameter: consistentHash.httpQueryParameterName: "session_id" — for REST APIs that include a session identifier in the URL.

Sticky sessions with ServiceEntry work identically to in-mesh sticky sessions. The key requirement is that the ServiceEntry must use STATIC or DNS resolution (not NONE) so that Envoy has multiple endpoints to hash across. With DNS_ROUND_ROBIN, there is only one logical endpoint, so consistent hashing has no effect.

Troubleshooting Common Issues

503 Errors When Calling External Services

The most common ServiceEntry issue. Start with this diagnostic sequence:

# Check if the ServiceEntry is applied and visible to the proxy
istioctl proxy-config cluster <pod-name> -n <namespace> | grep <external-host>

# Check the listeners
istioctl proxy-config listener <pod-name> -n <namespace> --port <port>

# Look at Envoy access logs for the specific request
kubectl logs <pod-name> -n <namespace> -c istio-proxy | grep <external-host>

Common causes of 503 errors:

  • Wrong protocol: Setting protocol: HTTPS when your application initiates TLS. Use TLS for pass-through; use HTTP only if the sidecar does TLS origination.
  • Missing ServiceEntry in REGISTRY_ONLY mode: If outboundTrafficPolicy is REGISTRY_ONLY, any host without a ServiceEntry is blocked.
  • exportTo restriction: The ServiceEntry is in namespace A, exported only to ".", and the calling pod is in namespace B.
  • DNS resolution failure: Envoy cannot resolve the hostname. Check that the DNS servers are reachable from the pod.

DNS Resolution Failures

When Envoy’s async DNS resolver fails, you will see UH (upstream unhealthy) or UF (upstream connection failure) flags in access logs.

# Verify DNS works from inside the sidecar
kubectl exec <pod-name> -n <namespace> -c istio-proxy -- \
  pilot-agent request GET /dns_resolve?proxyID=<pod-name>.<namespace>&host=api.stripe.com

# Check Envoy cluster health
istioctl proxy-config endpoint <pod-name> -n <namespace> | grep <external-host>

If the endpoint shows UNHEALTHY, Envoy resolved the DNS but the outlier detection ejected the host. If no endpoint appears at all, DNS resolution is failing. Common fix: ensure your pods can reach an external DNS server, or that CoreDNS is configured to forward queries for the external domain.

TLS Origination Not Working

If you configured TLS origination via a DestinationRule but traffic still fails:

  • Ensure the ServiceEntry port protocol is HTTP, not TLS. If you set it to TLS, Envoy treats the connection as opaque TLS pass-through and will not apply the DestinationRule’s TLS settings.
  • Verify the DestinationRule’s host field exactly matches the ServiceEntry’s hosts entry.
  • Check that the VirtualService (if used) routes to the correct port number.

TCP ServiceEntry Not Intercepting Traffic

For TCP-protocol ServiceEntries without the DNS proxy, Envoy cannot match traffic by hostname. You must either:

  • Set an explicit addresses field with a VIP that your application targets.
  • Enable Istio’s DNS proxy to auto-allocate VIPs.
  • Ensure the destination IP matches what the ServiceEntry resolves to.

Without one of these, TCP traffic goes through the PassthroughCluster and bypasses your ServiceEntry entirely.

Frequently Asked Questions

Do I need a ServiceEntry if outboundTrafficPolicy is set to ALLOW_ANY?

You do not need one for connectivity — your services can reach external hosts without it. But you should create ServiceEntries anyway. Without them, outbound traffic goes through the PassthroughCluster, which means no detailed metrics per destination, no access logging with the external hostname, no circuit breaking, no retries, and no timeout policies. A ServiceEntry is the difference between “it works” and “it works reliably with observability.”

What is the difference between protocol TLS and HTTPS in a ServiceEntry port?

TLS tells Envoy to treat the connection as opaque TLS. Envoy reads the SNI header to determine routing but does not decrypt the payload. Use this when your application initiates TLS directly. HTTPS tells Envoy the protocol is HTTP over TLS, which implies Envoy should handle TLS. In practice, for external services where the application manages its own TLS, use TLS. Use HTTP with a DestinationRule TLS origination when you want the sidecar to handle TLS.

Can I use wildcards in ServiceEntry hosts?

Yes, but with limitations. You can use *.example.com to match any subdomain of example.com. However, wildcard entries only work with resolution: NONE because Envoy cannot perform DNS lookups for wildcard hostnames. This means you lose the ability to apply per-endpoint traffic policies. Wildcard ServiceEntries are best used for broad egress access control rather than fine-grained traffic management.

How do I configure sticky sessions for an external service behind a ServiceEntry?

Create a ServiceEntry with STATIC or DNS resolution (so Envoy has multiple endpoints), then pair it with a DestinationRule that configures consistentHash under trafficPolicy.loadBalancer. You can hash on an HTTP cookie, header, source IP, or query parameter. The ServiceEntry must expose multiple endpoints for consistent hashing to have any effect. See the “Sticky Sessions with ServiceEntry” section above for a complete YAML example.

How does ServiceEntry interact with NetworkPolicy and Istio AuthorizationPolicy?

A ServiceEntry does not bypass Kubernetes NetworkPolicy. If a NetworkPolicy blocks egress to the external IP, traffic will be dropped at the CNI level before Envoy can route it. Istio AuthorizationPolicy can also restrict which workloads are allowed to call specific ServiceEntry hosts. For defense in depth, use ServiceEntry for traffic management and observability, AuthorizationPolicy for workload-level access control, and NetworkPolicy for network-level enforcement.

Wrapping Up

ServiceEntry is one of the most practical Istio resources you will use in production. It transforms opaque outbound connections into managed, observable, policy-controlled traffic — and it does so without requiring changes to your application code. Start with the basics: create a ServiceEntry for each external dependency, set the correct resolution type, and pair it with a DestinationRule for connection limits and circuit breaking. As you mature, add VirtualServices for retries and timeouts, configure sticky sessions where needed, and enable the DNS proxy for seamless TCP service integration.

The pattern is always the same: register the service, apply policies, observe the traffic. Every external dependency you formalize with a ServiceEntry is one fewer blind spot in your production mesh.

Prometheus Alertmanager vs Grafana Alerting (2026): Architecture, Features, and When to Use Each

Prometheus Alertmanager vs Grafana Alerting (2026): Architecture, Features, and When to Use Each

Most observability stacks that have been running in production for more than a year end up with alerting spread across two systems: Prometheus Alertmanager handling metric-based alerts and Grafana Alerting managing everything else. Engineers add a Slack integration in Grafana because it is convenient, then realize their Alertmanager routing tree already covers the same service. Before long, the on-call team receives duplicated pages, silencing rules live in two places, and nobody is confident which system is authoritative.

This is the alerting consolidation problem, and it affects teams of every size. The question is straightforward: should you standardize on Prometheus Alertmanager, move everything into Grafana Alerting, or deliberately run both? The answer depends on your datasource mix, your GitOps maturity, and how your organization manages on-call routing. This guide breaks down the architecture, features, and operational trade-offs of each system so you can make a deliberate choice instead of drifting into accidental complexity.

Architecture Overview

Before comparing features, you need to understand how each system fits into the alerting pipeline. They occupy the same logical space — “receive a condition, route a notification” — but they get there from fundamentally different starting points.

Prometheus Alertmanager: The Standalone Receiver

Alertmanager is a dedicated, standalone component in the Prometheus ecosystem. It does not evaluate alert rules itself. Instead, Prometheus (or any compatible sender like Thanos Ruler, Cortex, or Mimir Ruler) evaluates PromQL expressions and pushes firing alerts to the Alertmanager API. Alertmanager then handles deduplication, grouping, inhibition, silencing, and notification delivery.

# Simplified Prometheus → Alertmanager flow
#
# [Prometheus] --evaluates rules--> [firing alerts]
#        |
#        +--POST /api/v2/alerts--> [Alertmanager]
#                                      |
#                          +-----------+-----------+
#                          |           |           |
#                       [Slack]    [PagerDuty]  [Email]

The entire configuration lives in a single YAML file (alertmanager.yml). This includes the routing tree, receiver definitions, inhibition rules, and silence templates. There is no database, no UI-driven state — just a config file and an optional local storage directory for notification state and silences. This makes it trivially reproducible and ideal for GitOps workflows.

For high availability, you run multiple Alertmanager instances in a gossip-based cluster. They use a mesh protocol to share silence and notification state, ensuring that failover does not result in duplicate or lost notifications. The HA model is well-understood and has been stable for years.

Grafana Alerting: The Integrated Platform

Grafana Alerting (sometimes called “Grafana Unified Alerting,” introduced in Grafana 8 and significantly matured through Grafana 11 and 12) takes a different architectural approach. It embeds the entire alerting lifecycle — rule evaluation, state management, routing, and notification — inside the Grafana server process. Under the hood, it actually uses a fork of Alertmanager for the routing and notification layer, but this is an implementation detail that is invisible to users.

# Simplified Grafana Alerting flow
#
# [Grafana Server]
#   ├── Rule Evaluation Engine
#   │     ├── queries Prometheus
#   │     ├── queries Loki
#   │     ├── queries CloudWatch
#   │     └── queries any supported datasource
#   │
#   ├── Alert State Manager (internal)
#   │
#   └── Embedded Alertmanager (routing + notifications)
#           |
#           +-----------+-----------+
#           |           |           |
#        [Slack]    [PagerDuty]  [Email]

The critical distinction is that Grafana Alerting evaluates alert rules itself, querying any configured datasource — not just Prometheus. It can fire alerts based on Loki log queries, Elasticsearch searches, CloudWatch metrics, PostgreSQL queries, or any of the 100+ datasource plugins available in Grafana. Rule definitions, contact points, notification policies, and mute timings are stored in the Grafana database (or provisioned via YAML files and the Grafana API).

For high availability in self-hosted environments, Grafana Alerting relies on a shared database and a peer-discovery mechanism between Grafana instances. In Grafana Cloud, HA is fully managed by Grafana Labs.

Feature Comparison

The following table provides a side-by-side comparison of the capabilities that matter most in production alerting systems. Both systems are mature, but they prioritize different things.

FeaturePrometheus AlertmanagerGrafana Alerting
DatasourcesPrometheus-compatible only (Prometheus, Thanos, Mimir, VictoriaMetrics)Any Grafana datasource (Prometheus, Loki, Elasticsearch, CloudWatch, SQL databases, etc.)
Rule evaluationExternal (Prometheus/Ruler evaluates rules and pushes alerts)Built-in (Grafana evaluates rules directly)
Routing treeHierarchical YAML-based routing with match/match_re, continue, group_byNotification policies with label matchers, nested policies, mute timings
GroupingFull support via group_by, group_wait, group_intervalFull support via notification policies with equivalent controls
InhibitionNative inhibition rules (suppress alerts when a related alert is firing)Supported since Grafana 10.3 but less flexible than Alertmanager
SilencingLabel-based silences via API or UI, time-limitedMute timings (recurring schedules) and silences (ad-hoc, label-based)
Notification channelsEmail, Slack, PagerDuty, OpsGenie, VictoriaOps, webhook, WeChat, Telegram, SNS, WebexAll of the above plus Teams, Discord, Google Chat, LINE, Threema, Oncall, and more via contact points
TemplatingGo templates in notification configGo templates with access to Grafana template variables and functions
Multi-tenancyNot built-in; achieved via separate instances or Mimir AlertmanagerNative multi-tenancy via Grafana organizations and RBAC
High availabilityGossip-based cluster (peer mesh, well-proven)Database-backed HA with peer discovery between Grafana instances
Configuration modelSingle YAML file, fully declarativeUI + API + provisioning YAML files, stored in database
GitOps compatibilityExcellent — config file lives in version control nativelyPossible via provisioning files or Terraform provider, but requires extra tooling
External alert sourcesAny system that can POST to the Alertmanager APISupported via the Grafana Alerting API (external alerts can be pushed)
Managed serviceAvailable via Grafana Cloud (as Mimir Alertmanager), Amazon Managed PrometheusAvailable via Grafana Cloud

Alertmanager Strengths

Alertmanager has been a production staple since 2015. Over a decade of use across thousands of organizations has made it one of the most battle-tested components in the CNCF ecosystem. Here is where it genuinely excels.

Declarative, GitOps-Native Configuration

The entire Alertmanager configuration is a single YAML file. There is no hidden state in a database, no click-driven configuration that someone forgets to document. You check it into Git, review it in a pull request, and deploy it through your CI/CD pipeline like any other infrastructure code. This is a significant operational advantage for teams that have invested in GitOps.

# alertmanager.yml — everything in one file
global:
  resolve_timeout: 5m
  slack_api_url: "https://hooks.slack.com/services/T00/B00/XXX"

route:
  receiver: platform-team
  group_by: [alertname, cluster, namespace]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: pagerduty-oncall
      group_wait: 10s
    - match_re:
        team: "^(payments|checkout)$"
      receiver: payments-slack
      continue: true

receivers:
  - name: platform-team
    slack_configs:
      - channel: "#platform-alerts"
  - name: pagerduty-oncall
    pagerduty_configs:
      - service_key: ""
  - name: payments-slack
    slack_configs:
      - channel: "#payments-oncall"

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: [alertname, cluster]

Every change is auditable. Rollbacks are a git revert away. This matters enormously when you are debugging why an alert did not fire at 3 AM.

Lightweight and Single-Purpose

Alertmanager does one thing: route and deliver notifications. It has no dashboard, no query engine, no datasource plugins. This single-purpose design makes it operationally simple. Resource consumption is minimal — a small Alertmanager instance handles thousands of active alerts on a few hundred megabytes of memory. It starts in milliseconds and requires almost no maintenance.

Mature Inhibition and Routing

Alertmanager’s inhibition rules are first-class citizens. You can suppress downstream warnings when a critical alert is already firing, preventing alert storms from overwhelming your on-call team. The hierarchical routing tree with continue flags allows for nuanced delivery: send to the team channel AND escalate to PagerDuty simultaneously, with different grouping strategies at each level.

Proven High Availability

The gossip-based HA cluster has been stable for years. Running three Alertmanager replicas behind a load balancer (or using Kubernetes service discovery) gives you reliable notification delivery without shared storage. The protocol handles deduplication across instances automatically, which is the hardest part of distributed alerting.

Grafana Alerting Strengths

Grafana Alerting has matured considerably since its rocky introduction in Grafana 8. By Grafana 11 and 12, it has become a legitimate production alerting platform with capabilities that Alertmanager cannot match on its own.

Multi-Datasource Alert Rules

This is Grafana Alerting’s strongest differentiator. You can write alert rules that query Loki for error log spikes, CloudWatch for AWS resource utilization, Elasticsearch for application errors, or a PostgreSQL database for business metrics — all from the same alerting system. If your observability stack includes more than just Prometheus, this eliminates the need for separate alerting tools per datasource.

# Grafana alert rule provisioning example — alerting on Loki log errors
apiVersion: 1
groups:
  - orgId: 1
    name: application-errors
    folder: Production
    interval: 1m
    rules:
      - uid: loki-error-spike
        title: "High error rate in payment service"
        condition: C
        data:
          - refId: A
            datasourceUid: loki-prod
            model:
              expr: 'sum(rate({app="payment-service"} |= "ERROR" [5m]))'
          - refId: B
            datasourceUid: "__expr__"
            model:
              type: reduce
              expression: A
              reducer: last
          - refId: C
            datasourceUid: "__expr__"
            model:
              type: threshold
              expression: B
              conditions:
                - evaluator:
                    type: gt
                    params: [10]
        for: 5m
        labels:
          severity: warning
          team: payments

This is something Alertmanager simply cannot do. Alertmanager only receives pre-evaluated alerts — it has no concept of datasources or query execution.

Unified UI for Alert Management

Grafana provides a single pane of glass for alert rule creation, visualization, notification policy management, contact point configuration, and silence management. For teams where not every engineer is comfortable editing YAML routing trees, the visual notification policy editor significantly reduces the barrier to entry. You can see the state of every alert rule, its evaluation history, and the exact notification path it will take — all without leaving the browser.

Native Multi-Tenancy and RBAC

Grafana’s organization model and role-based access control extend naturally to alerting. Different teams can manage their own alert rules, contact points, and notification policies within their organization or folder scope, without seeing or interfering with other teams. Achieving this with standalone Alertmanager requires either running separate instances per tenant or using Mimir’s multi-tenant Alertmanager.

Mute Timings and Richer Scheduling

While Alertmanager supports silences (ad-hoc, time-limited suppressions), Grafana Alerting adds mute timings — recurring time-based windows where notifications are suppressed. This is useful for scheduled maintenance windows, business-hours-only alerting, or suppressing non-critical alerts on weekends. Alertmanager requires external tooling or manual silence creation for recurring windows.

Grafana Cloud as a Managed Option

For teams that want to avoid managing alerting infrastructure entirely, Grafana Cloud provides a fully managed Grafana Alerting stack. This includes HA, state persistence, and notification delivery without any self-hosted components. The Grafana Cloud alerting stack also includes a managed Mimir Alertmanager, which means you can use Prometheus-native alerting rules if you prefer that model while still benefiting from the managed infrastructure.

When to Use Prometheus Alertmanager

Alertmanager is the right choice when the following conditions describe your environment:

  • Your metrics stack is Prometheus-native. If all your alert rules are PromQL expressions evaluated by Prometheus, Thanos Ruler, or Mimir Ruler, Alertmanager is the natural fit. There is no added value in routing those alerts through Grafana.
  • GitOps is non-negotiable. If every infrastructure change must go through a pull request and be fully declarative, Alertmanager’s single-file configuration model is significantly easier to manage than Grafana’s database-backed state. Tools like amtool provide config validation in CI pipelines.
  • You need fine-grained routing with inhibition. Complex routing trees with multiple levels of grouping, inhibition rules, and continue flags are more naturally expressed in Alertmanager’s YAML format. The routing logic has been stable and well-documented for years.
  • You run microservices with per-team routing. If each team owns its routing subtree and the routing logic is complex, Alertmanager’s hierarchical model scales better than UI-driven configuration. Teams can own their section of the config file via CODEOWNERS in Git.
  • You want minimal operational overhead. Alertmanager is a single binary with minimal resource requirements. There is no database to back up, no migrations to run, and no UI framework to keep updated.

When to Use Grafana Alerting

Grafana Alerting is the right choice when these conditions apply:

  • You alert on more than just Prometheus metrics. If you need alert rules based on Loki logs, Elasticsearch queries, CloudWatch metrics, or database queries, Grafana Alerting is the only option that handles all of these natively. The alternative is running separate alerting tools per datasource, which is worse.
  • Your team prefers UI-driven configuration. Not every engineer wants to edit YAML routing trees. If your organization values a visual interface for managing alerts, contact points, and notification policies, Grafana’s UI is a major productivity advantage.
  • You are using Grafana Cloud. If you are already on Grafana Cloud, using its built-in alerting is the path of least resistance. You get HA, managed notification delivery, and a unified experience without running any additional infrastructure.
  • Multi-tenancy is a requirement. If multiple teams need isolated alerting configurations with RBAC, Grafana’s native organization and folder-based access model is significantly easier to set up than running per-tenant Alertmanager instances.
  • You want mute timings for recurring maintenance windows. If your team regularly needs to suppress alerts during scheduled windows (deploy windows, batch processing hours, weekend non-critical suppression), Grafana’s mute timings feature is more ergonomic than creating and managing recurring silences in Alertmanager.

Running Both Together: The Hybrid Pattern

In practice, many production environments run both Alertmanager and Grafana Alerting. This is not necessarily a mistake — it can be a deliberate architectural choice when done with clear boundaries.

Common Hybrid Architecture

The most common pattern looks like this:

  • Prometheus Alertmanager handles all metric-based alerts. PromQL rules are evaluated by Prometheus or a long-term storage ruler (Thanos, Mimir). Alertmanager owns routing, grouping, and notification for these alerts.
  • Grafana Alerting handles non-Prometheus alerts: log-based alerts from Loki, business metrics from SQL datasources, and cross-datasource correlation rules.

The key to making this work without chaos is establishing clear ownership rules:

# Ownership boundaries for hybrid alerting
#
# Prometheus Alertmanager owns:
#   - All PromQL-based alert rules
#   - Infrastructure alerts (node, kubelet, etcd, CoreDNS)
#   - Application SLO/SLI alerts based on metrics
#
# Grafana Alerting owns:
#   - Log-based alert rules (Loki, Elasticsearch)
#   - Business metric alerts (SQL datasources)
#   - Cross-datasource correlation rules
#   - Alerts for teams that prefer UI-driven management
#
# Shared:
#   - Contact points / receivers use the same Slack channels and PagerDuty services
#   - On-call rotations are managed externally (PagerDuty, Grafana OnCall)

Both systems can deliver to the same notification channels. The critical discipline is ensuring that silencing and maintenance windows are applied in both systems when needed. This is the primary operational cost of the hybrid approach.

Grafana as a Viewer for Alertmanager

Even if you use Alertmanager exclusively for routing and notification, Grafana can serve as a read-only viewer. Grafana natively supports connecting to an external Alertmanager datasource, allowing you to see firing alerts, active silences, and alert groups in the Grafana UI. This gives you the operational visibility of Grafana without moving your alerting logic into it.

# Grafana datasource provisioning for external Alertmanager
apiVersion: 1
datasources:
  - name: Alertmanager
    type: alertmanager
    url: http://alertmanager.monitoring.svc:9093
    access: proxy
    jsonData:
      implementation: prometheus

Migration Considerations

If you are moving from one system to the other, here are the practical considerations to plan for.

Migrating from Alertmanager to Grafana Alerting

  • Rule conversion. Your PromQL-based recording and alerting rules defined in Prometheus rule files need to be recreated as Grafana alert rules. Grafana provides a migration tool that can import Prometheus-format rules, but complex expressions may need manual adjustment.
  • Routing tree translation. Alertmanager’s hierarchical routing tree maps to Grafana’s notification policies, but the semantics are not identical. Test the notification routing thoroughly — the continue flag behavior and default routes may differ.
  • Silence and inhibition migration. Active silences are ephemeral and do not need migration. Inhibition rules need to be recreated in Grafana’s format. Recurring maintenance windows should be converted to mute timings.
  • Run in parallel first. The safest migration strategy is to run both systems in parallel for two to four weeks, sending notifications from both, then cutting over when you have confidence in the Grafana setup. Accept the temporary noise of duplicate alerts — it is far cheaper than missing a critical page during migration.

Migrating from Grafana Alerting to Alertmanager

  • Datasource limitation. You can only migrate alerts that are based on Prometheus-compatible datasources. Alerts querying Loki, Elasticsearch, or SQL datasources have no equivalent in Alertmanager — you will need an alternative solution for those.
  • Rule export. Export Grafana alert rules and convert them to Prometheus-format rule files. The Grafana API (GET /api/v1/provisioning/alert-rules) provides structured output that can be transformed with a script.
  • Contact point mapping. Map Grafana contact points to Alertmanager receivers. The configuration format is different, but the concepts are equivalent.
  • State loss. Alertmanager does not carry over Grafana’s alert evaluation history. You start fresh. Plan for a brief period where alerts may re-fire as Prometheus evaluates rules that were previously managed by Grafana.

Decision Framework

If you want a quick decision path, use this framework:

Start here:
│
├── Do you alert on non-Prometheus datasources (Loki, ES, SQL, CloudWatch)?
│   ├── YES → Grafana Alerting (at least for those datasources)
│   └── NO ↓
│
├── Is GitOps/declarative config a hard requirement?
│   ├── YES → Alertmanager
│   └── NO ↓
│
├── Do you need multi-tenancy with RBAC?
│   ├── YES → Grafana Alerting (or Mimir Alertmanager)
│   └── NO ↓
│
├── Are you on Grafana Cloud?
│   ├── YES → Grafana Alerting (path of least resistance)
│   └── NO ↓
│
└── Default → Alertmanager (simpler, lighter, well-proven)

For many teams, the honest answer is “both” — Alertmanager for the Prometheus-native metric pipeline, Grafana Alerting for everything else. That is a valid architecture as long as the ownership boundaries are documented and the on-call team knows where to look.

Frequently Asked Questions

What is the difference between Alertmanager and Grafana Alerting?

Prometheus Alertmanager is a standalone notification routing engine that receives pre-evaluated alerts from Prometheus and delivers them to receivers like Slack, PagerDuty, or email. It does not evaluate alert rules itself. Grafana Alerting is an integrated alerting platform embedded in Grafana that both evaluates alert rules (querying any supported datasource) and handles notification routing. Alertmanager is configured entirely via YAML, while Grafana Alerting offers a UI, API, and file-based provisioning. The fundamental difference is scope: Alertmanager handles only the routing and notification phase, while Grafana Alerting handles the full lifecycle from query evaluation to notification.

Can Grafana Alerting replace Prometheus Alertmanager?

Yes, for many use cases. Grafana Alerting can evaluate PromQL rules directly against your Prometheus datasource, so you do not strictly need a separate Alertmanager instance. However, there are scenarios where Alertmanager remains the better choice: heavily GitOps-driven environments, teams that need Alertmanager’s mature inhibition rules, or architectures where Prometheus rule evaluation happens externally (Thanos Ruler, Mimir Ruler) and a dedicated Alertmanager is already in the pipeline. If your only datasource is Prometheus and you value declarative configuration, Alertmanager is still simpler and lighter.

Is Grafana Alertmanager the same as Prometheus Alertmanager?

Not exactly. Grafana Alerting uses a fork of the Prometheus Alertmanager code internally for its notification routing engine, but it is not the same product. The Grafana “Alertmanager” you see in the UI is a managed, embedded component with a different configuration interface (notification policies, contact points, mute timings) compared to the standalone Prometheus Alertmanager (routing tree, receivers, inhibition rules in YAML). Grafana can also connect to an external Prometheus Alertmanager as a datasource, which adds to the confusion. When people refer to “Grafana Alertmanager,” they usually mean the embedded routing engine inside Grafana Alerting.

What are the best alternatives to Prometheus Alertmanager?

The most direct alternative is Grafana Alerting, which can receive and route Prometheus alerts while also supporting other datasources. Beyond that, other options include: Grafana OnCall for on-call management and escalation (often used alongside Alertmanager rather than replacing it), PagerDuty or Opsgenie as managed incident response platforms that can receive alerts directly, Keep as an open-source AIOps alert management platform, and Mimir Alertmanager for multi-tenant environments running Grafana Mimir. The choice depends on whether you need an Alertmanager replacement (routing and notification) or a complementary tool for escalation and incident response.

Should I use Prometheus alerts or Grafana alerts for Kubernetes monitoring?

For Kubernetes monitoring specifically, the kube-prometheus-stack (which includes Prometheus, Alertmanager, and a comprehensive set of pre-built alerting rules) remains the industry standard. These rules are PromQL-based and are designed to work with Alertmanager. If you are deploying kube-prometheus-stack, using Alertmanager for metric-based alerts is the straightforward choice. Add Grafana Alerting on top if you also need to alert on logs (via Loki) or non-metric datasources. For Kubernetes-specific monitoring, the combination of Prometheus rules with Alertmanager for routing is the most mature and well-supported path.

Final Thoughts

The Alertmanager vs Grafana Alerting debate is not really about which tool is better — it is about which tool fits your operational context. Alertmanager is simpler, lighter, and more GitOps-friendly. Grafana Alerting is more versatile, more accessible to UI-oriented teams, and the only option if you need multi-datasource alerting. Running both is perfectly valid when the boundaries are clear.

The worst outcome is not picking the “wrong” tool. The worst outcome is running both accidentally, with overlapping coverage, duplicated notifications, and no clear ownership. Whatever you choose, document the decision, define the ownership boundaries, and make sure your on-call team knows exactly where to go when they need to silence an alert at 3 AM.

Gateway API Provider Support in 2026: A Critical Evaluation

Gateway API Provider Support in 2026: A Critical Evaluation

The Kubernetes Gateway API is no longer a future concept—it’s the present standard for traffic management. With the deprecation of Ingress NGINX’s stable APIs signaling a definitive shift, platform teams and architects are now faced with a critical decision: which Gateway API provider to adopt. The official implementations page lists numerous options, but the real-world picture is one of fragmented support, varying stability, and significant gaps that can derail multi-cluster strategies.

In this evaluation, we move beyond marketing checklists to analyze the practical state of Gateway API support across major cloud providers, ingress controllers, and service meshes. We’ll examine which versions are truly production-ready, where the interoperability pitfalls lie, and what you must account for before standardizing across your infrastructure.

The Gateway API Maturity Spectrum: From Experimental to Standard

Not all Gateway API resources are created equal. The API’s unique versioning model—with features progressing through Experimental, Standard, and Extended support tracks—means provider support is inherently uneven. An implementation might fully support the stable Gateway and HTTPRoute resources while offering only partial or experimental backing for GRPCRoute or TCPRoute.

This creates a fundamental challenge for architects: designing for the lowest common denominator or accepting provider-specific constraints. The decision hinges on accurately mapping your traffic management requirements (HTTP, TLS termination, gRPC, TCP/UDP load balancing) against what each provider actually delivers in a stable form.

Core API Support: The Foundation

Most providers now support the v1 (GA) versions of the foundational resources:

  • GatewayClass & Gateway: Nearly universal support for v1. These are the control plane resources for provisioning and configuring load balancers.
  • HTTPRoute: Universal support for v1. This is the workhorse for HTTP/HTTPS traffic routing and is considered the most stable.

However, support for other route types reveals the fragmentation:

  • GRPCRoute: Often in beta or experimental stages. Critical for modern microservices architectures but not yet universally reliable.
  • TCPRoute & UDPRoute: Patchy support. Some providers implement them as beta, others ignore them entirely, forcing fallbacks to provider-specific annotations or custom resources.
  • TLSRoute: Frequently tied to specific certificate management integrations (e.g., cert-manager).

Major Provider Deep Dive: Implementation Realities

AWS Elastic Kubernetes Service (EKS)

AWS offers an official Gateway API controller for EKS. Its support is pragmatic but currently limited:

  • Supported Resources: GatewayClass, Gateway, HTTPRoute, and GRPCRoute (all v1beta1 as of early 2024). Note the use of v1beta1 for GRPCRoute, indicating it’s not yet at GA stability.
  • Underlying Infrastructure: Maps directly to AWS Application Load Balancer (ALB) and Network Load Balancer (NLB). This is a strength (managed AWS services) and a constraint (you inherit ALB/NLB feature limits).
  • Critical Gap: No support for TCPRoute or UDPRoute. If your workload requires raw TCP/UDP load balancing, you must use the legacy Kubernetes Service type LoadBalancer or a different ingress controller alongside the Gateway API controller, creating a disjointed management model.

Google Kubernetes Engine (GKE) & Azure Kubernetes Service (AKS)

Both Google and Azure have integrated Gateway API support directly into their managed Kubernetes offerings, often with a focus on their global load-balancing infrastructures.

  • GKE: Offers the GKE Gateway controller. It supports v1 resources and can provision Google Cloud Global External Load Balancers. Its integration with Google’s certificate management and CDN is a key advantage. However, advanced routing features may require GCP-specific backend configs.
  • AKS: Provides the Application Gateway Ingress Controller (AGIC) with Gateway API support, mapping to Azure Application Gateway. Support for newer route types like GRPCRoute has historically lagged behind other providers.

The pattern here is clear: cloud providers implement the Gateway API as a facade over their existing, proprietary load-balancing products. This ensures stability and performance but can limit portability and advanced cross-provider features.

NGINX & Kong Ingress Controller

These third-party, cluster-based controllers offer a different value proposition: consistency across any Kubernetes distribution, including on-premises.

  • NGINX: With its stable Ingress APIs deprecated in favor of Gateway API, its Gateway API implementation is now the primary path forward. It generally has excellent support for the full range of experimental and standard resources, as it’s not constrained by a cloud vendor’s underlying service. This makes it a strong choice for hybrid or multi-cloud deployments where feature parity is crucial.
  • Kong Ingress Controller: Kong has been an early and comprehensive supporter of the Gateway API, often implementing features quickly. It leverages Kong Gateway’s extensive plugin ecosystem, which can be a major draw but also introduces vendor lock-in.

Critical Gaps for Enterprise Architects

Beyond checking resource support boxes, several deeper gaps can impact production deployments, especially in complex environments.

1. Multi-Cluster & Hybrid Environment Support

The Gateway API specification includes concepts like ReferenceGrant for cross-namespace and future cross-cluster routing. In practice, very few providers have robust, production-ready multi-cluster stories. Most implementations assume a single cluster. If your architecture spans multiple clusters (for isolation, geography, or failure domains), you will likely need to:

  • Manage separate Gateway resources per cluster.
  • Use an external global load balancer (like a cloud DNS/GSLB) to distribute traffic across cluster-specific gateways.
  • This negates some of the API’s promise of a unified, abstracted configuration.

2. Policy Attachment and Extension Consistency

Gateway API is designed to be extended through policy attachment (e.g., for rate limiting, WAF rules, authentication). There is no standard for how these policies are implemented. One provider might use a custom RateLimitPolicy CRD, while another might rely on annotations or a separate policy engine. This creates massive configuration drift and vendor lock-in, breaking the portability goal.

3. Observability and Debugging Interfaces

While the API defines status fields, the richness of operational data—detailed error logs, granular metrics tied to API resources, distributed tracing integration—varies wildly. Some providers expose deep integration with their monitoring stack; others offer minimal visibility. You must verify that the provider’s observability model meets your SRE team’s needs.

Evaluation Framework: Questions for Your Team

Before selecting a provider, work through this technical checklist:

  1. Route Requirements: Do we need stable support for HTTP only, or also gRPC, TCP, UDP? Is beta support acceptable for non-HTTP routes?
  2. Infrastructure Model: Do we want a cloud-managed load balancer (simpler, less control) or a cluster-based controller (more portable, more operational overhead)?
  3. Multi-Cluster Future: Is our architecture single-cluster today but likely to expand? Does the provider have a credible roadmap for multi-cluster Gateway API?
  4. Policy Needs: What advanced policies (auth, WAF, rate limiting) are required? How does the provider implement them? Can we live with vendor-specific policy CRDs?
  5. Observe & Debug: What logging, metrics, and tracing are exposed for Gateway API resources? Do they integrate with our existing observability platform?
  6. Upgrade Path: What is the provider’s track record for supporting new Gateway API releases? How painful are version upgrades?

Strategic Recommendations

Based on the current landscape, here are pragmatic paths forward:

  • For Single-Cloud Deployments: Start with your cloud provider’s native controller (AWS, GKE, AKS). It’s the path of least resistance and best integration with other cloud services (IAM, certificates, monitoring). Just be acutely aware of its specific limitations regarding unsupported route types.
  • For Hybrid/Multi-Cloud or On-Premises: Standardize on a portable, cluster-based controller like Ingress-NGINX or Kong. The consistency across environments will save significant operational complexity, even if it means forgoing some cloud-native integrations.
  • For Greenfield Projects: Design your applications and configurations against the stable v1 resources (Gateway, HTTPRoute) only. Treat any use of beta/experimental resources as a known risk that may require refactoring later.
  • Always Have an Exit Plan: Isolate Gateway API configuration YAMLs from provider-specific policies and annotations. This modularity will make migration less painful when the next generation of providers emerges or when you need to switch.

The Gateway API’s evolution is a net positive for the Kubernetes ecosystem, offering a far more expressive model than the original Ingress. However, in 2026, the provider landscape is still maturing. Support is broad but not deep, and critical gaps in multi-cluster management and policy portability remain. The successful architect will choose a provider not based on a feature checklist, but based on how well its specific constraints and capabilities align with their organization’s immediate traffic patterns and long-term platform strategy. The era of a universal, write-once-run-anywhere Gateway API configuration is not yet here—but with careful, informed provider selection, you can build a robust foundation for it.

Building a Kubernetes Migration Framework: Lessons from Ingress-NGINX

Building a Kubernetes Migration Framework: Lessons from Ingress-NGINX

The recent announcement regarding the deprecation of the Ingress-NGINX controller sent a ripple through the Kubernetes community. For many organizations, it’s the first major deprecation of a foundational, widely-adopted ecosystem component. While the immediate reaction is often tactical—”What do we replace it with?”—the more valuable long-term question is strategic: “How do we systematically manage this and future migrations?”

This event isn’t an anomaly; it’s a precedent. As Kubernetes matures, core add-ons, APIs, and patterns will evolve or sunset. Platform engineering teams need a repeatable, low-risk framework for navigating these changes. Drawing from the Ingress-NGINX transition and established deployment management principles, we can abstract a robust Kubernetes Migration Framework applicable to any major component, from service meshes to CSI drivers.

Why Ad-Hoc Migrations Fail in Production

Attempting a “big bang” replacement or a series of manual, one-off changes is a recipe for extended downtime, configuration drift, and undetected regression. Production Kubernetes environments are complex systems with deep dependencies:

  • Interdependent Workloads: Multiple applications often share the same ingress controller, relying on specific annotations, custom snippets, or behavioral quirks.
  • Automation and GitOps Dependencies: Helm charts, Kustomize overlays, and ArgoCD/Flux manifests are tightly coupled to the existing component’s API and schema.
  • Observability and Security Integration: Monitoring dashboards, logging parsers, and security policies are tuned for the current implementation.
  • Knowledge Silos: Tribal knowledge about workarounds and specific configurations isn’t documented.

A structured framework mitigates these risks by enforcing discipline, creating clear validation gates, and ensuring the capability to roll back at any point.

The Four-Phase Kubernetes Migration Framework

This framework decomposes the migration into four distinct phases: Assessment, Parallel Run, Cutover, and Decommission. Each phase has defined inputs, activities, and exit criteria.

Phase 1: Deep Assessment & Dependency Mapping

Before writing a single line of new configuration, understand the full scope. The goal is to move from “we use Ingress-NGINX” to a precise inventory of how it’s used.

  • Inventory All Ingress Resources: Use kubectl get ingress --all-namespaces as a starting point, but go deeper.
  • Analyze Annotation Usage: Script an analysis to catalog every annotation in use (e.g., nginx.ingress.kubernetes.io/rewrite-target, nginx.ingress.kubernetes.io/configuration-snippet). This reveals functional dependencies.
  • Map to Backend Services: For each Ingress, identify the backend Services and Namespaces. This highlights critical applications and potential blast radius.
  • Review Customizations: Document any custom ConfigMaps for main NGINX configuration, custom template patches, or modifications to the controller deployment itself.
  • Evaluate Alternatives: Based on the inventory, evaluate candidate replacements (e.g., Gateway API with a compatible implementation, another Ingress controller like Emissary-ingress or Traefik). The Google Cloud migration framework provides a useful decision tree for ingress-specific migrations.

The output of this phase is a migration manifesto: a concrete list of what needs to be converted, grouped by complexity and criticality.

Phase 2: Phased Rollout & Parallel Run

This is the core of a low-risk migration. Instead of replacing, you run the new and old systems in parallel, shifting traffic gradually. For ingress, this often means installing the new controller alongside the old one.

  • Dual Installation: Deploy the new ingress controller in the same cluster, configured with a distinct ingress class (e.g., ingressClassName: gateway vs. nginx).
  • Create Canary Ingress Resources: For a low-risk application, create a parallel Ingress or Gateway resource pointing to the new controller. Use techniques like managed deployments with canary patterns to control exposure.
    # Example: A new Gateway API HTTPRoute for a canary service
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
    name: app-canary
    spec:
    parentRefs:
    - name: company-gateway
    rules:
    - backendRefs:
    - name: app-service
    port: 8080
    weight: 10 # Start with 10% of traffic

  • Validate Equivalency: Use traffic mirroring (if supported) or direct synthetic testing against both ingress paths. Compare logs, response headers, latency, and error rates.
  • Iterate and Expand: Gradually increase traffic weight or add more applications to the new stack, group by group, based on the assessment from Phase 1.

This phase relies heavily on your observability stack. Dashboards comparing error rates, latency (p50, p99), and throughput between the old and new paths are essential.

Phase 3: Validation & Automated Cutover

The cutover is not a manual event. It’s the final step in a validation process.

  • Define Validation Tests: Create a suite of tests that must pass before full cutover. This includes:
    • Smoke tests for all critical user journeys.
    • Load tests to verify performance under expected traffic patterns.
    • Security scan validation (e.g., no unintended ports open).
    • Compliance checks (e.g., specific headers are present).
  • Automate the Switch: For each application, the cutover is ultimately a change in its Ingress or Gateway resource. This should be done via your GitOps pipeline. Update the source manifests (e.g., change the ingressClassName), merge, and let automation apply it. This ensures the state is declarative and recorded.
  • Maintain Rollback Capacity: The old system must remain operational and routable (with reduced capacity) during this phase. The GitOps rollback is simply reverting the manifest change.

Phase 4: Observability & Decommission

Once all traffic is successfully migrated and validated over a sustained period (e.g., 72 hours), you can decommission the old component.

  • Monitor Aggressively: Keep a close watch on all key metrics for at least one full business cycle (a week).
  • Remove Old Resources: Delete the old controller’s Deployment, Service, ConfigMaps, and CRDs (if no longer needed).
  • Clean Up Auxiliary Artifacts: Remove old RBAC bindings, service accounts, and any custom monitoring alerts or dashboards specific to the old component.
  • Document Lessons Learned: Update runbooks and architecture diagrams. Note any surprises, gaps in the process, or validation tests that were particularly valuable.

Key Principles for a Resilient Framework

Beyond the phases, these principles should guide your framework’s design:

  • Always Maintain Rollback Capability: Every step should be reversible with minimal disruption. This is a core tenet of managing Kubernetes deployments.
  • Leverage GitOps for State Management: All desired state changes (Ingress resources, controller deployments) must flow through version-controlled manifests. This provides an audit trail, consistency, and the simplest rollback mechanism (git revert).
  • Validate with Production Traffic Patterns: Synthetic tests are insufficient. Use canary weights and traffic mirroring to validate with real user traffic in a controlled manner.
  • Communicate Transparently: Platform teams should maintain a clear migration status page for internal stakeholders, showing which applications have been migrated, which are in progress, and the overall timeline.

Conclusion: Building a Migration-Capable Platform

The deprecation of Ingress-NGINX is a wake-up call. The next major change is a matter of “when,” not “if.” By investing in a structured migration framework now, platform teams transform a potential crisis into a manageable, repeatable operational procedure.

This framework—Assess, Run in Parallel, Validate, and Decommission—abstracts the specific lessons from the ingress migration into a generic pattern. It can be applied to migrating from PodSecurityPolicies to Pod Security Standards, from a deprecated CSI driver, or from one service mesh to another. The tools (GitOps, canary deployments, observability) are already in your stack. The value is in stitching them together into a disciplined process that ensures platform evolution doesn’t compromise platform stability.

Start by documenting this framework as a runbook template. Then, apply it to your next significant component update, even a minor one, to refine the process. When the next major deprecation announcement lands in your inbox, you’ll be ready.

Kubernetes Housekeeping: How to Clean Up Orphaned ConfigMaps and Secrets

Kubernetes Housekeeping: How to Clean Up Orphaned ConfigMaps and Secrets

If you’ve been running Kubernetes clusters for any meaningful amount of time, you’ve likely encountered a familiar problem: orphaned ConfigMaps and Secrets piling up in your namespaces. These abandoned resources don’t just clutter your cluster—they introduce security risks, complicate troubleshooting, and can even impact cluster performance as your resource count grows.

The reality is that Kubernetes doesn’t automatically clean up ConfigMaps and Secrets when the workloads that reference them are deleted. This gap in Kubernetes’ native garbage collection creates a housekeeping problem that every production cluster eventually faces. In this article, we’ll explore why orphaned resources happen, how to detect them, and most importantly, how to implement sustainable cleanup strategies that prevent them from accumulating in the first place.

Understanding the Orphaned Resource Problem

What Are Orphaned ConfigMaps and Secrets?

Orphaned ConfigMaps and Secrets are configuration resources that no longer have any active references from Pods, Deployments, StatefulSets, or other workload resources in your cluster. They typically become orphaned when:

  • Applications are updated and new ConfigMaps are created while old ones remain
  • Deployments are deleted but their associated configuration resources aren’t
  • Failed rollouts leave behind unused configuration versions
  • Development and testing workflows create temporary resources that never get cleaned up
  • CI/CD pipelines generate unique ConfigMap names (often with hash suffixes) on each deployment

Why This Matters for Production Clusters

While a few orphaned ConfigMaps might seem harmless, the problem compounds over time and introduces real operational challenges:

Security Risks: Orphaned Secrets can contain outdated credentials, API keys, or certificates that should no longer be accessible. If these aren’t removed, they remain attack vectors for unauthorized access—especially problematic if RBAC policies grant broad read access to Secrets within a namespace.

Cluster Bloat: Kubernetes stores these resources in etcd, your cluster’s backing store. As the number of orphaned resources grows, etcd size increases, potentially impacting cluster performance and backup times. In extreme cases, this can contribute to etcd performance degradation or even hit storage quotas.

Operational Complexity: When troubleshooting issues or reviewing configurations, sifting through dozens of unused ConfigMaps makes it harder to identify which resources are actually in use. This “configuration noise” slows down incident response and increases cognitive load for your team.

Cost Implications: While individual ConfigMaps are small, at scale they contribute to storage costs and can trigger alerts in cost monitoring systems, especially in multi-tenant environments where resource quotas matter.

Detecting Orphaned ConfigMaps and Secrets

Before you can clean up orphaned resources, you need to identify them. Let’s explore both manual detection methods and automated tooling approaches.

Manual Detection with kubectl

The simplest approach uses kubectl to cross-reference ConfigMaps and Secrets against active workload resources. Here’s a basic script to identify potentially orphaned ConfigMaps:

#!/bin/bash
# detect-orphaned-configmaps.sh
# Identifies ConfigMaps not referenced by any active Pods

NAMESPACE=${1:-default}

echo "Checking for orphaned ConfigMaps in namespace: $NAMESPACE"
echo "---"

# Get all ConfigMaps in the namespace
CONFIGMAPS=$(kubectl get configmaps -n $NAMESPACE -o jsonpath='{.items[*].metadata.name}')

for cm in $CONFIGMAPS; do
    # Skip kube-root-ca.crt as it's system-managed
    if [[ "$cm" == "kube-root-ca.crt" ]]; then
        continue
    fi

    # Check if any Pod references this ConfigMap
    REFERENCED=$(kubectl get pods -n $NAMESPACE -o json | \
        jq -r --arg cm "$cm" '.items[] |
        select(
            (.spec.volumes[]?.configMap.name == $cm) or
            (.spec.containers[].env[]?.valueFrom.configMapKeyRef.name == $cm) or
            (.spec.containers[].envFrom[]?.configMapRef.name == $cm)
        ) | .metadata.name' | head -1)

    if [[ -z "$REFERENCED" ]]; then
        echo "Orphaned: $cm"
    fi
done

A similar script for Secrets would look like this:

#!/bin/bash
# detect-orphaned-secrets.sh

NAMESPACE=${1:-default}

echo "Checking for orphaned Secrets in namespace: $NAMESPACE"
echo "---"

SECRETS=$(kubectl get secrets -n $NAMESPACE -o jsonpath='{.items[*].metadata.name}')

for secret in $SECRETS; do
    # Skip service account tokens and system secrets
    SECRET_TYPE=$(kubectl get secret $secret -n $NAMESPACE -o jsonpath='{.type}')
    if [[ "$SECRET_TYPE" == "kubernetes.io/service-account-token" ]]; then
        continue
    fi

    # Check if any Pod references this Secret
    REFERENCED=$(kubectl get pods -n $NAMESPACE -o json | \
        jq -r --arg secret "$secret" '.items[] |
        select(
            (.spec.volumes[]?.secret.secretName == $secret) or
            (.spec.containers[].env[]?.valueFrom.secretKeyRef.name == $secret) or
            (.spec.containers[].envFrom[]?.secretRef.name == $secret) or
            (.spec.imagePullSecrets[]?.name == $secret)
        ) | .metadata.name' | head -1)

    if [[ -z "$REFERENCED" ]]; then
        echo "Orphaned: $secret"
    fi
done

Important caveat: These scripts only check currently running Pods. They won’t catch ConfigMaps or Secrets referenced by Deployments, StatefulSets, or DaemonSets that might currently have zero replicas. For production use, you’ll want to check against all workload resource types.

Automated Detection with Specialized Tools

Several open-source tools have emerged to solve this problem more comprehensively:

Kor: Comprehensive Unused Resource Detection

Kor is a purpose-built tool for finding unused resources across your Kubernetes cluster. It checks not just ConfigMaps and Secrets, but also PVCs, Services, and other resource types.

# Install Kor
brew install kor

# Scan for unused ConfigMaps and Secrets
kor all --namespace production --output json

# Check specific resource types
kor configmap --namespace production
kor secret --namespace production --exclude-namespaces kube-system,kube-public

Kor works by analyzing resource relationships and identifying anything without dependent objects. It’s particularly effective because it understands Kubernetes resource hierarchies and checks against Deployments, StatefulSets, and DaemonSets—not just running Pods.

Popeye: Cluster Sanitization Reports

Popeye scans your cluster and generates reports on resource health, including orphaned resources. While broader in scope than just ConfigMap cleanup, it provides valuable context:

# Install Popeye
brew install derailed/popeye/popeye

# Scan cluster
popeye --output json --save

# Focus on specific namespace
popeye --namespace production

Custom Controllers with Kubernetes APIs

For more sophisticated detection, you can build custom controllers using client-go that continuously monitor for orphaned resources. This approach works well when integrated with your existing observability stack:

// Pseudocode example
func detectOrphanedConfigMaps(namespace string) []string {
    configMaps := listConfigMaps(namespace)
    deployments := listDeployments(namespace)
    statefulSets := listStatefulSets(namespace)
    daemonSets := listDaemonSets(namespace)

    referenced := make(map[string]bool)

    // Check all workload types for ConfigMap references
    for _, deploy := range deployments {
        for _, cm := range getReferencedConfigMaps(deploy) {
            referenced[cm] = true
        }
    }
    // ... repeat for other workload types

    orphaned := []string{}
    for _, cm := range configMaps {
        if !referenced[cm.Name] {
            orphaned = append(orphaned, cm.Name)
        }
    }

    return orphaned
}

Prevention Strategies: Stop Orphans Before They Start

The best cleanup strategy is prevention. By implementing proper resource management patterns from the beginning, you can minimize orphaned resources in the first place.

Use Owner References for Automatic Cleanup

Kubernetes provides a built-in mechanism for resource lifecycle management through owner references. When properly configured, child resources are automatically deleted when their owner is removed.

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
  namespace: production
  ownerReferences:
    - apiVersion: apps/v1
      kind: Deployment
      name: myapp
      uid: d9607e19-f88f-11e6-a518-42010a800195
      controller: true
      blockOwnerDeletion: true
data:
  app.properties: |
    database.url=postgres://db:5432

Tools like Helm and Kustomize automatically set owner references, which is one reason GitOps workflows tend to have fewer orphaned resources than imperative deployment approaches.

Implement Consistent Labeling Standards

Labels make it much easier to identify resource relationships and track ownership:

apiVersion: v1
kind: ConfigMap
metadata:
  name: api-gateway-config-v2
  labels:
    app: api-gateway
    component: configuration
    version: v2
    managed-by: argocd
    owner: platform-team
data:
  config.yaml: |
    # configuration here

With consistent labeling, you can easily query for ConfigMaps associated with specific applications:

# Find all ConfigMaps for a specific app
kubectl get configmaps -l app=api-gateway

# Clean up old versions
kubectl delete configmaps -l app=api-gateway,version=v1

Adopt GitOps Practices

GitOps tools like ArgoCD and Flux excel at preventing orphaned resources because they maintain a clear desired state:

  • Declarative management: All resources are defined in Git
  • Automatic pruning: Tools can detect and remove resources not defined in Git
  • Audit trail: Git history shows when and why resources were created or deleted

ArgoCD’s sync policies can automatically prune resources:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp
spec:
  syncPolicy:
    automated:
      prune: true  # Remove resources not in Git
      selfHeal: true

Use Kustomize ConfigMap Generators with Hashes

Kustomize’s ConfigMap generator feature appends content hashes to ConfigMap names, ensuring that configuration changes trigger new ConfigMaps:

# kustomization.yaml
configMapGenerator:
  - name: app-config
    files:
      - config.properties
generatorOptions:
  disableNameSuffixHash: false  # Include hash in name

This creates ConfigMaps like app-config-dk9g72hk5f. When you update the configuration, Kustomize creates a new ConfigMap with a different hash. Combined with Kustomize’s --prune flag, old ConfigMaps are automatically removed:

kubectl apply --prune -k ./overlays/production \
  -l app=myapp

Set Resource Quotas

While quotas don’t prevent orphans, they create backpressure that forces teams to clean up:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: config-quota
  namespace: production
spec:
  hard:
    configmaps: "50"
    secrets: "50"

When teams hit quota limits, they’re incentivized to audit and remove unused resources.

Cleanup Strategies for Existing Orphaned Resources

For clusters that already have accumulated orphaned ConfigMaps and Secrets, here are practical cleanup approaches.

One-Time Manual Cleanup

For immediate cleanup, combine detection scripts with kubectl delete:

# Dry run first - review what would be deleted
./detect-orphaned-configmaps.sh production > orphaned-cms.txt
cat orphaned-cms.txt

# Manual review and cleanup
for cm in $(cat orphaned-cms.txt | grep "Orphaned:" | awk '{print $2}'); do
    kubectl delete configmap $cm -n production
done

Critical warning: Always do a dry run and manual review first. Some ConfigMaps might be referenced by workloads that aren’t currently running but will scale up later (HPA scaled to zero, CronJobs, etc.).

Scheduled Cleanup with CronJobs

For ongoing maintenance, deploy a Kubernetes CronJob that runs cleanup scripts periodically:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: configmap-cleanup
  namespace: kube-system
spec:
  schedule: "0 2 * * 0"  # Weekly at 2 AM Sunday
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: cleanup-sa
          containers:
          - name: cleanup
            image: bitnami/kubectl:latest
            command:
            - /bin/bash
            - -c
            - |
              # Cleanup script here
              echo "Starting ConfigMap cleanup..."

              for ns in $(kubectl get ns -o jsonpath='{.items[*].metadata.name}'); do
                echo "Checking namespace: $ns"

                # Get all workload-referenced ConfigMaps
                REFERENCED_CMS=$(kubectl get deploy,sts,ds -n $ns -o json | \
                  jq -r '.items[].spec.template.spec |
                  [.volumes[]?.configMap.name,
                   .containers[].env[]?.valueFrom.configMapKeyRef.name,
                   .containers[].envFrom[]?.configMapRef.name] |
                  .[] | select(. != null)' | sort -u)

                ALL_CMS=$(kubectl get cm -n $ns -o jsonpath='{.items[*].metadata.name}')

                for cm in $ALL_CMS; do
                  if [[ "$cm" == "kube-root-ca.crt" ]]; then
                    continue
                  fi

                  if ! echo "$REFERENCED_CMS" | grep -q "^$cm$"; then
                    echo "Deleting orphaned ConfigMap: $cm in namespace: $ns"
                    kubectl delete cm $cm -n $ns
                  fi
                done
              done
          restartPolicy: OnFailure
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: cleanup-sa
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cleanup-role
rules:
- apiGroups: [""]
  resources: ["configmaps", "secrets", "namespaces"]
  verbs: ["get", "list", "delete"]
- apiGroups: ["apps"]
  resources: ["deployments", "statefulsets", "daemonsets"]
  verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cleanup-binding
subjects:
- kind: ServiceAccount
  name: cleanup-sa
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: cleanup-role
  apiGroup: rbac.authorization.k8s.io

Security consideration: This CronJob needs cluster-wide permissions to read workloads and delete ConfigMaps. Review and adjust the RBAC permissions based on your security requirements. Consider limiting to specific namespaces if you don’t need cluster-wide cleanup.

Integration with CI/CD Pipelines

Build cleanup into your deployment workflows. Here’s an example GitLab CI job:

cleanup_old_configs:
  stage: post-deploy
  image: bitnami/kubectl:latest
  script:
    - |
      # Delete ConfigMaps with old version labels after successful deployment
      kubectl delete configmap -n production \
        -l app=myapp,version!=v${CI_COMMIT_TAG}

    - |
      # Keep only the last 3 ConfigMap versions by timestamp
      kubectl get configmap -n production \
        -l app=myapp \
        --sort-by=.metadata.creationTimestamp \
        -o name | head -n -3 | xargs -r kubectl delete -n production
  only:
    - tags
  when: on_success

Safe Deletion Practices

When cleaning up ConfigMaps and Secrets, follow these safety guidelines:

  1. Dry run first: Always review what will be deleted before executing
  2. Backup before deletion: Export resources to YAML files before removing them
  3. Check age: Only delete resources older than a certain threshold (e.g., 30 days)
  4. Exclude system resources: Skip kube-system, kube-public, and other system namespaces
  5. Monitor for impact: Watch application metrics after cleanup to ensure nothing broke

Example backup and conditional deletion:

# Backup before deletion
kubectl get configmap -n production -o yaml > cm-backup-$(date +%Y%m%d).yaml

# Only delete ConfigMaps older than 30 days
kubectl get configmap -n production -o json | \
  jq -r --arg date "$(date -d '30 days ago' -u +%Y-%m-%dT%H:%M:%SZ)" \
  '.items[] | select(.metadata.creationTimestamp < $date) | .metadata.name' | \
  while read cm; do
    echo "Would delete: $cm (created: $(kubectl get cm $cm -n production -o jsonpath='{.metadata.creationTimestamp}'))"
    # Uncomment to actually delete:
    # kubectl delete configmap $cm -n production
  done

Advanced Patterns for Large-Scale Clusters

For organizations running multiple clusters or large multi-tenant platforms, housekeeping requires more sophisticated approaches.

Policy-Based Cleanup with OPA Gatekeeper

Use OPA Gatekeeper to enforce ConfigMap lifecycle policies at admission time:

apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
  name: configmaprequiredlabels
spec:
  crd:
    spec:
      names:
        kind: ConfigMapRequiredLabels
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package configmaprequiredlabels

        violation[{"msg": msg}] {
          input.review.kind.kind == "ConfigMap"
          not input.review.object.metadata.labels["app"]
          msg := "ConfigMaps must have an 'app' label for lifecycle tracking"
        }

        violation[{"msg": msg}] {
          input.review.kind.kind == "ConfigMap"
          not input.review.object.metadata.labels["owner"]
          msg := "ConfigMaps must have an 'owner' label for lifecycle tracking"
        }

This policy prevents ConfigMaps without proper labels from being created, making future tracking and cleanup much easier.

Centralized Monitoring with Prometheus

Monitor orphaned resource metrics across your clusters:

apiVersion: v1
kind: ConfigMap
metadata:
  name: orphan-detection-exporter
data:
  script.sh: |
    #!/bin/bash
    # Expose metrics for Prometheus scraping
    while true; do
      echo "# HELP k8s_orphaned_configmaps Number of orphaned ConfigMaps"
      echo "# TYPE k8s_orphaned_configmaps gauge"

      for ns in $(kubectl get ns -o jsonpath='{.items[*].metadata.name}'); do
        count=$(./detect-orphaned-configmaps.sh $ns | grep -c "Orphaned:")
        echo "k8s_orphaned_configmaps{namespace=\"$ns\"} $count"
      done

      sleep 300  # Update every 5 minutes
    done

Create alerts when orphaned resource counts exceed thresholds:

groups:
- name: kubernetes-housekeeping
  rules:
  - alert: HighOrphanedConfigMapCount
    expr: k8s_orphaned_configmaps > 20
    for: 24h
    labels:
      severity: warning
    annotations:
      summary: "High number of orphaned ConfigMaps in {{ $labels.namespace }}"
      description: "Namespace {{ $labels.namespace }} has {{ $value }} orphaned ConfigMaps"

Multi-Cluster Cleanup with Crossplane or Cluster API

For platform teams managing dozens or hundreds of clusters, extend cleanup automation across your entire fleet:

# Crossplane Composition for cluster-wide cleanup
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: cluster-cleanup-policy
spec:
  compositeTypeRef:
    apiVersion: platform.example.com/v1
    kind: ClusterCleanupPolicy
  resources:
    - name: cleanup-cronjob
      base:
        apiVersion: kubernetes.crossplane.io/v1alpha1
        kind: Object
        spec:
          forProvider:
            manifest:
              apiVersion: batch/v1
              kind: CronJob
              # ... CronJob spec from earlier

Housekeeping Checklist for Production Clusters

Here’s a practical checklist to implement sustainable ConfigMap and Secret housekeeping:

Immediate Actions:

  • [ ] Run detection scripts to audit current orphaned resource count
  • [ ] Backup all ConfigMaps and Secrets before any cleanup
  • [ ] Manually review and delete obvious orphans (with team approval)
  • [ ] Document which ConfigMaps/Secrets are intentionally unused but needed

Short-term (1-4 weeks):

  • [ ] Implement consistent labeling standards across teams
  • [ ] Add owner references to all ConfigMaps and Secrets
  • [ ] Deploy scheduled CronJob for automated detection and reporting
  • [ ] Integrate cleanup steps into CI/CD pipelines

Long-term (1-3 months):

  • [ ] Adopt GitOps tooling (ArgoCD, Flux) with automated pruning
  • [ ] Implement OPA Gatekeeper policies for required labels
  • [ ] Set up Prometheus monitoring for orphaned resource metrics
  • [ ] Create runbooks for incident responders
  • [ ] Establish resource quotas per namespace
  • [ ] Conduct quarterly cluster hygiene reviews

Ongoing Practices:

  • [ ] Review orphaned resource reports weekly
  • [ ] Include cleanup tasks in sprint planning
  • [ ] Train new team members on resource lifecycle best practices
  • [ ] Update cleanup automation as cluster architecture evolves

Conclusion

Kubernetes doesn’t automatically clean up orphaned ConfigMaps and Secrets, but with the right strategies, you can prevent them from becoming a problem. The key is implementing a layered approach: use owner references and GitOps for prevention, deploy automated detection for ongoing monitoring, and run scheduled cleanup jobs for maintenance.

Start with detection to understand your current situation, then focus on prevention strategies like owner references and consistent labeling. For existing clusters with accumulated orphaned resources, implement gradual cleanup with proper safety checks rather than aggressive bulk deletion.

Remember that housekeeping isn’t a one-time task—it’s an ongoing operational practice. By building cleanup into your CI/CD pipelines and establishing clear resource ownership, you’ll maintain a clean, secure, and performant Kubernetes environment over time.

The tools and patterns we’ve covered here—from simple bash scripts to sophisticated policy engines—can be adapted to your organization’s scale and maturity level. Whether you’re managing a single cluster or a multi-cluster platform, investing in proper resource lifecycle management pays dividends in operational efficiency, security posture, and team productivity.

Frequently Asked Questions (FAQ)

Can Kubernetes automatically delete unused ConfigMaps and Secrets?

No. Kubernetes does not garbage-collect ConfigMaps or Secrets by default when workloads are deleted. Unless they have ownerReferences set, these resources remain in the cluster indefinitely and must be cleaned up manually or via automation.

Is it safe to delete ConfigMaps or Secrets that are not referenced by running Pods?

Not always. Some resources may be referenced by workloads scaled to zero, CronJobs, or future rollouts. Always perform a dry run, check workload definitions (Deployments, StatefulSets, DaemonSets), and review resource age before deletion.

What is the safest way to prevent orphaned ConfigMaps and Secrets?

The most effective prevention strategies are:
Using ownerReferences (via Helm or Kustomize)
Adopting GitOps with pruning enabled (ArgoCD / Flux)
Applying consistent labeling (app, owner, version)
These ensure unused resources are automatically detected and removed

Which tools are best for detecting orphaned resources?

Popular and reliable tools include:
Kor – purpose-built for detecting unused Kubernetes resources
Popeye – broader cluster hygiene and sanitization reports
Custom scripts/controllers – useful for tailored environments or integrations
For production clusters, Kor provides the best signal-to-noise ratio.

How often should ConfigMap and Secret cleanup run in production?

A common best practice is:
Weekly detection (reporting only)
Monthly cleanup for resources older than a defined threshold (e.g. 30–60 days)
Immediate cleanup integrated into CI/CD after successful deployments
This balances safety with long-term cluster hygiene.

Sources

Kubernetes Gateway API Versions: Complete Compatibility and Upgrade Guide

Kubernetes Gateway API Versions: Complete Compatibility and Upgrade Guide

The Kubernetes Gateway API has rapidly evolved from its experimental roots to become the standard for ingress and service mesh traffic management. But with multiple versions released and various maturity levels, understanding which version to use, how it relates to your Kubernetes cluster, and when to upgrade can be challenging.

In this comprehensive guide, we’ll explore the different Gateway API versions, their relationship to Kubernetes releases, provider support levels, and the upgrade philosophy that will help you make informed decisions for your infrastructure.

Understanding Gateway API Versioning

The Gateway API follows a unique versioning model that differs from standard Kubernetes APIs. Unlike built-in Kubernetes resources that are tied to specific cluster versions, Gateway API CRDs can be installed independently as long as your cluster meets the minimum requirements.

Minimum Kubernetes Version Requirements

As of Gateway API v1.1 and later versions, you need Kubernetes 1.26 or later to run the latest Gateway API releases. The API commits to supporting a minimum of the most recent 5 Kubernetes minor versions, providing a reasonable window for cluster upgrades.

This rolling support window means that if you’re running Kubernetes 1.26, 1.27, 1.28, 1.29, or 1.30, you can safely install and use the latest Gateway API without concerns about compatibility.

Release Channels: Standard vs Experimental

Gateway API uses two distinct release channels to balance stability with innovation. Understanding these channels is critical for choosing the right version for your use case.

Standard Channel

The Standard channel contains only GA (Generally Available, v1) and Beta (v1beta1) level resources and fields. When you install from the Standard channel, you get:

  • Stability guarantees: No breaking changes once a resource reaches Beta or GA
  • Backwards compatibility: Safe to upgrade between minor versions
  • Production readiness: Extensively tested features with multiple implementations
  • Conformance coverage: Full test coverage ensuring portability

Resources in the Standard channel include GatewayClass, Gateway, HTTPRoute, and ReferenceGrant at the v1 level, plus stable features like GRPCRoute.

Experimental Channel

The Experimental channel includes everything from the Standard channel plus Alpha-level resources and experimental fields. This channel is for:

  • Early feature testing: Try new capabilities before they stabilize
  • Cutting-edge functionality: Access the latest Gateway API innovations
  • No stability guarantees: Breaking changes can occur between releases
  • Feature feedback: Help shape the API by testing experimental features

Features may graduate from Experimental to Standard or be dropped entirely based on implementation experience and community feedback.

Gateway API Version History and Features

Let’s explore the major Gateway API releases and what each introduced.

v1.0 (October 2023)

The v1.0 release marked a significant milestone, graduating core resources to GA status. This release included:

  • Gateway, GatewayClass, and HTTPRoute at v1 (stable)
  • Full backwards compatibility guarantees for v1 resources
  • Production-ready status for ingress traffic management
  • Multiple conformant implementations across vendors

v1.1 (May 2024)

Version 1.1 expanded the API significantly with service mesh support:

  • GRPCRoute: Native support for gRPC traffic routing
  • Service mesh capabilities: East-west traffic management alongside north-south
  • Multiple implementations: Both Istio and other service meshes achieved conformance
  • Enhanced features: Additional matching criteria and routing capabilities

This version bridged the gap between traditional ingress controllers and full service mesh implementations.

v1.2 and v1.3

These intermediate releases introduced structured release cycles and additional features:

  • Refined conformance testing
  • BackendTLSPolicy (experimental in v1.3)
  • Enhanced observability and debugging capabilities
  • Improved cross-namespace routing

v1.4 (October 2025)

The latest GA release as of this writing, v1.4.0 brought:

  • Continued API refinement
  • Additional experimental features for community testing
  • Enhanced conformance profiles
  • Improved documentation and migration guides

Kubernetes Version Compatibility Matrix

Here’s how Gateway API versions relate to Kubernetes releases:

Gateway API Version Minimum Kubernetes Recommended Kubernetes Release Date
v1.0.x 1.25 1.26+ October 2023
v1.1.x 1.26 1.27+ May 2024
v1.2.x 1.26 1.28+ 2024
v1.3.x 1.26 1.29+ 2024
v1.4.x 1.26 1.30+ October 2025

The key takeaway: Gateway API v1.1 and later all support Kubernetes 1.26+, meaning you can run the latest Gateway API on any reasonably modern cluster.

Gateway Provider Support Levels

Different Gateway API implementations support various versions and feature sets. Understanding provider support helps you choose the right implementation for your needs.

Conformance Levels

Gateway API defines three conformance levels for features:

  1. Core: Features that must be supported for an implementation to claim conformance. These are portable across all implementations.
  2. Extended: Standardized optional features. Implementations indicate Extended support separately from Core.
  3. Implementation-specific: Vendor-specific features without conformance requirements.

Major Provider Support

Istio

Istio reached Gateway API GA support in version 1.22 (May 2024). Istio provides:

  • Full Standard channel support (v1 resources)
  • Service mesh (east-west) traffic management via GAMMA
  • Ingress (north-south) traffic control
  • Experimental support for BackendTLSPolicy (Istio 1.26+)

Istio is particularly strong for organizations needing both ingress and service mesh capabilities in a single solution.

Envoy Gateway

Envoy Gateway tracks Gateway API releases closely. Version 1.4.0 includes:

  • Gateway API v1.3.0 support
  • Compatibility matrix for Envoy Proxy versions
  • Focus on ingress use cases
  • Strong experimental feature adoption

Check the Envoy Gateway compatibility matrix to ensure your Envoy Proxy version aligns with your Gateway API and Kubernetes versions.

Cilium

Cilium integrates Gateway API deeply with its CNI implementation:

  • Per-node Envoy proxy architecture
  • Network policy enforcement for Gateway traffic
  • Both ingress and service mesh support
  • eBPF-based packet processing

Cilium’s unique architecture makes it a strong choice for organizations already using Cilium for networking.

Contour

Contour v1.31.0 implements Gateway API v1.2.1, supporting:

  • All Standard channel v1 resources
  • Most v1alpha2 resources (TLSRoute, TCPRoute, GRPCRoute)
  • BackendTLSPolicy support

Checking Provider Conformance

To verify which Gateway API version and features your provider supports:

  1. Visit the official implementations page: The Gateway API project maintains a comprehensive list of implementations with their conformance levels.
  2. Check provider documentation: Most providers publish compatibility matrices showing Gateway API, Kubernetes, and proxy version relationships.
  3. Review conformance reports: Providers submit conformance test results that detail exactly which Core and Extended features they support.
  4. Test in non-production: Before upgrading production, validate your specific use cases in a staging environment.

Upgrade Philosophy: When and How to Upgrade

One of the most common questions about Gateway API is: “Do I need to run the latest version?” The answer depends on your specific needs and risk tolerance.

Staying on Older Versions

You don’t need to always run the latest Gateway API version. It’s perfectly acceptable to:

  • Stay on an older stable release if it meets your needs
  • Upgrade only when you need specific new features
  • Wait for your Gateway provider to officially support newer versions
  • Maintain stability over having the latest features

The Standard channel’s backwards compatibility guarantees mean that when you do upgrade, your existing configurations will continue to work.

When to Consider Upgrading

Consider upgrading when:

  1. You need a specific feature: A new HTTPRoute matcher, GRPCRoute support, or other functionality only available in newer versions
  2. Your provider recommends it: Gateway providers often optimize for specific Gateway API versions
  3. Security considerations: While rare, security issues could prompt upgrades
  4. Kubernetes cluster upgrades: When upgrading Kubernetes, verify your Gateway API version is compatible with the new cluster version

Safe Upgrade Practices

Follow these best practices for Gateway API upgrades:

1. Stick with Standard Channel

Using Standard channel CRDs makes upgrades simpler and safer. Experimental features can introduce breaking changes, while Standard features maintain compatibility.

2. Upgrade One Minor Version at a Time

While it’s usually safe to skip versions, the most tested upgrade path is incremental. Going from v1.2 to v1.3 to v1.4 is safer than jumping directly from v1.2 to v1.4.

3. Test Before Upgrading

Always test upgrades in non-production environments:

# Install specific Gateway API version in test cluster
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.0/standard-install.yaml

4. Review Release Notes

Each Gateway API release publishes comprehensive release notes detailing:

  • New features and capabilities
  • Graduation of experimental features to standard
  • Deprecation notices
  • Upgrade considerations

5. Check Provider Compatibility

Before upgrading Gateway API CRDs, verify your Gateway provider supports the target version. Installing Gateway API v1.4 won’t help if your controller only supports v1.2.

6. Never Overwrite Different Channels

Implementations should never overwrite Gateway API CRDs that use a different release channel. Keep track of whether you’re using Standard or Experimental channel installations.

CRD Management Best Practices

Gateway API CRD management requires attention to detail:

# Check currently installed Gateway API version
kubectl get crd gateways.gateway.networking.k8s.io -o yaml | grep 'gateway.networking.k8s.io/bundle-version'

# Verify which channel is installed
kubectl get crd gateways.gateway.networking.k8s.io -o yaml | grep 'gateway.networking.k8s.io/channel'

Staying Informed About New Releases

Gateway API releases follow a structured release cycle with clear communication channels.

How to Know When New Versions Are Released

  1. GitHub Releases Page: Watch the kubernetes-sigs/gateway-api repository for release announcements
  2. Kubernetes Blog: Major Gateway API releases are announced on the official Kubernetes blog
  3. Mailing Lists and Slack: Join the Gateway API community channels for discussions and announcements
  4. Provider Announcements: Gateway providers announce support for new Gateway API versions through their own channels

Release Cadence

Gateway API follows a quarterly release schedule for minor versions, with patch releases as needed for bug fixes and security issues. This predictable cadence helps teams plan upgrades.

Practical Decision Framework

Here’s a framework to help you decide which Gateway API version to run:

For New Deployments

  • Production workloads: Use the latest GA version supported by your provider
  • Innovation-focused: Consider Experimental channel if you need cutting-edge features
  • Conservative approach: Use v1.1 or later with Standard channel

For Existing Deployments

  • If things are working: Stay on your current version until you need new features
  • If provider recommends upgrade: Follow provider guidance, especially for security
  • If Kubernetes upgrade planned: Verify compatibility, may need to upgrade Gateway API first or simultaneously

Feature-Driven Upgrades

  • Need service mesh support: Upgrade to v1.1 minimum
  • Need GRPCRoute: Upgrade to v1.1 minimum
  • Need BackendTLSPolicy: Requires v1.3+ and provider support for experimental features

Conclusion

Kubernetes Gateway API represents the future of traffic management in Kubernetes, offering a standardized, extensible, and role-oriented API for both ingress and service mesh use cases. Understanding the versioning model, compatibility requirements, and upgrade philosophy empowers you to make informed decisions that balance innovation with stability.

Key takeaways:

  • Gateway API versions install independently from Kubernetes, requiring only version 1.26 or later for recent releases
  • Standard channel provides stability, Experimental channel provides early access to new features
  • You don’t need to always run the latest version—upgrade when you need specific features
  • Verify provider support before upgrading Gateway API CRDs
  • Follow safe upgrade practices: test first, upgrade incrementally, review release notes

By following these guidelines, you can confidently deploy and maintain Gateway API in your Kubernetes infrastructure while making upgrade decisions that align with your organization’s needs and risk tolerance.

Frequently Asked Questions

What is the difference between Kubernetes Ingress and the Gateway API?

Kubernetes Ingress is a legacy API focused mainly on HTTP(S) traffic with limited extensibility. The Gateway API is its successor, offering a more expressive, role-oriented model that supports multiple protocols, advanced routing, better separation of concerns, and consistent behavior across implementations

Which Gateway API version should I use in production today?

For most production environments, you should use the latest GA (v1.x) release supported by your Gateway provider, installed from the Standard channel. This ensures stability, backwards compatibility, and conformance guarantees while still benefiting from ongoing improvements.

Can I upgrade the Gateway API without upgrading my Kubernetes cluster?

Yes. Gateway API CRDs are installed independently of Kubernetes itself. As long as your cluster meets the minimum supported Kubernetes version (1.26+ for recent releases), you can upgrade the Gateway API without upgrading the cluster.

What happens if my Gateway provider does not support the latest Gateway API version?

If your provider lags behind, you should stay on the latest version officially supported by that provider. Installing newer Gateway API CRDs than your controller supports can lead to missing features or undefined behavior. Provider compatibility should always take precedence over running the newest API version.

Is it safe to upgrade Gateway API CRDs without downtime?

In most cases, yes—when using the Standard channel. The Gateway API provides strong backwards compatibility guarantees for GA and Beta resources. However, you should always test upgrades in a non-production environment and verify that your Gateway provider supports the target version.

Sources

Kubernetes Dashboard Alternatives in 2026: Best Web UI Options After Official Retirement

Kubernetes Dashboard Alternatives in 2026: Best Web UI Options After Official Retirement

The Kubernetes Dashboard, once a staple tool for cluster visualization and management, has been officially archived and is no longer maintained. For many teams who relied on its straightforward web interface to monitor pods, deployments, and services, this retirement marks the end of an era. But it also signals something important: the Kubernetes ecosystem has evolved far beyond what the original dashboard was designed to handle.

Today’s Kubernetes environments are multi-cluster by default, driven by GitOps principles, guarded by strict RBAC policies, and operated by platform teams serving dozens or hundreds of developers. The operating model has simply outgrown the traditional dashboard’s capabilities.

So what comes next? If you’ve been using Kubernetes Dashboard and need to migrate to something more capable, or if you’re simply curious about modern alternatives, this guide will walk you through the best options available in 2026.

Why Kubernetes Dashboard Was Retired

The Kubernetes Dashboard served its purpose well in the early days of Kubernetes adoption. It provided a simple, browser-based interface for viewing cluster resources without needing to master kubectl commands. But as Kubernetes matured, several limitations became apparent:

  • Single-cluster focus: Most organizations now manage multiple clusters across different environments, but the dashboard was designed for viewing one cluster at a time
  • Limited RBAC capabilities: Modern platform teams need fine-grained access controls at the cluster, namespace, and workload levels
  • No GitOps integration: Contemporary workflows rely on declarative configuration and continuous deployment pipelines
  • Minimal observability: Beyond basic resource listing, the dashboard lacked advanced monitoring, alerting, and troubleshooting features
  • Security concerns: The dashboard’s architecture required careful configuration to avoid exposing cluster access

The community recognized these constraints, and the official recommendation now points toward Headlamp as the successor. But Headlamp isn’t the only option worth considering.

Top Kubernetes Dashboard Alternatives for 2026

1. Headlamp: The Official Successor

Headlamp is now the official recommendation from the Kubernetes SIG UI group. It’s a CNCF Sandbox project developed by Kinvolk (now part of Microsoft) that brings a modern approach to cluster visualization.

Key Features:

  • Clean, intuitive interface built with modern web technologies
  • Extensive plugin system for customization
  • Works both as an in-cluster deployment and desktop application
  • Uses your existing kubeconfig file for authentication
  • OpenID Connect (OIDC) support for enterprise SSO
  • Read and write operations based on RBAC permissions

Installation Options:

# Using Helm
helm repo add headlamp https://kubernetes-sigs.github.io/headlamp/
helm install my-headlamp headlamp/headlamp --namespace kube-system

# As Minikube addon
minikube addons enable headlamp
minikube service headlamp -n headlamp

Headlamp excels at providing a familiar dashboard experience while being extensible enough to grow with your needs. The plugin architecture means you can customize it for your specific workflows without waiting for upstream changes.

Best for: Teams transitioning from Kubernetes Dashboard who want a similar experience with modern features and official backing.

2. Portainer: Enterprise Multi-Cluster Management

Portainer has evolved from a Docker management tool into a comprehensive Kubernetes platform. It’s particularly strong when you need to manage multiple clusters from a single interface. We already covered in detail Portainer so you can also take a look

Key Features:

  • Multi-cluster management dashboard
  • Enterprise-grade RBAC with fine-grained access controls
  • Visual workload deployment and scaling
  • GitOps integration support
  • Comprehensive audit logging
  • Support for both Kubernetes and Docker environments

Best for: Organizations managing multiple clusters across different environments who need enterprise RBAC and centralized control.

3. Skooner (formerly K8Dash): Lightweight and Fast

Skooner keeps things simple. If you appreciated the straightforward nature of the original Kubernetes Dashboard, Skooner delivers a similar philosophy with a cleaner, faster interface.

Key Features:

  • Fast, real-time updates
  • Clean and minimal interface
  • Easy installation with minimal configuration
  • Real-time metrics visualization
  • Built-in OIDC authentication

Best for: Teams that want a simple, no-frills dashboard without complex features or steep learning curves.

4. Devtron: Complete DevOps Platform

Devtron goes beyond simple cluster visualization to provide an entire application delivery platform built on Kubernetes.

Key Features:

  • Multi-cluster application deployment
  • Built-in CI/CD pipelines
  • Advanced security scanning and compliance
  • Application-centric view rather than resource-centric
  • Support for seven different SSO providers
  • Chart store for Helm deployments

Best for: Platform teams building internal developer platforms who need comprehensive deployment pipelines alongside cluster management.

5. KubeSphere: Full-Stack Container Platform

KubeSphere positions itself as a distributed operating system for cloud-native applications, using Kubernetes as its kernel.

Key Features:

  • Multi-tenant architecture
  • Integrated DevOps workflows
  • Service mesh integration (Istio)
  • Multi-cluster federation
  • Observability and monitoring built-in
  • Plug-and-play architecture for third-party integrations

Best for: Organizations building comprehensive container platforms who want an opinionated, batteries-included experience.

6. Rancher: Battle-Tested Enterprise Platform

Rancher from SUSE has been in the Kubernetes management space for years and offers one of the most mature platforms available.

Key Features:

  • Manage any Kubernetes cluster (EKS, GKE, AKS, on-premises)
  • Centralized authentication and RBAC
  • Built-in monitoring with Prometheus and Grafana
  • Application catalog with Helm charts
  • Policy management and security scanning

Best for: Enterprise organizations managing heterogeneous Kubernetes environments across multiple cloud providers.

7. Octant: Developer-Focused Cluster Exploration

Octant (originally developed by VMware) takes a developer-centric approach to cluster visualization with a focus on understanding application architecture.

Key Features:

  • Plugin-based extensibility
  • Resource relationship visualization
  • Port forwarding directly from the UI
  • Log streaming
  • Context-aware resource inspection

Best for: Application developers who need to understand how their applications run on Kubernetes without being cluster administrators.

Desktop and CLI Alternatives Worth Considering

While this article focuses on web-based dashboards, it’s worth noting that not everyone needs a browser interface. Some of the most powerful Kubernetes management tools work as desktop applications or terminal UIs.

If you’re considering client-side tools, you might find these articles on my blog helpful:

These client tools offer advantages that web dashboards can’t match: offline access, better performance, and tighter integration with your local development workflow. FreeLens, in particular, has emerged as the lowest-risk choice for most organizations looking for a desktop Kubernetes IDE.

Choosing the Right Alternative for Your Team

With so many options available, how do you choose? Here’s a decision framework:

Choose Headlamp if:

  • You want the officially recommended path forward
  • You need a lightweight dashboard similar to what you had before
  • Plugin extensibility is important for future customization
  • You prefer CNCF-backed open source projects

Choose Portainer if:

  • You manage multiple Kubernetes clusters
  • Enterprise RBAC is a critical requirement
  • You also work with Docker environments
  • Visual deployment tools would benefit your team

Choose Skooner if:

  • You want the simplest possible alternative
  • Your needs are straightforward: view and manage resources
  • You don’t need advanced features or multi-cluster support

Choose Devtron or KubeSphere if:

  • You’re building an internal developer platform
  • You need integrated CI/CD pipelines
  • Application-centric workflows matter more than resource-centric views

Choose Rancher if:

  • You’re managing enterprise-scale, multi-cloud Kubernetes
  • You need battle-tested stability and vendor support
  • Policy management and compliance are critical

Consider desktop tools like FreeLens if:

  • You work primarily from a local development environment
  • You need offline access to cluster information
  • You prefer richer desktop application experiences

Migration Considerations

If you’re actively using Kubernetes Dashboard today, here’s what to think about when migrating:

  1. Authentication method: Most modern alternatives support OIDC/SSO, but verify your specific identity provider is supported
  2. RBAC policies: Review your existing ClusterRole and RoleBinding configurations to ensure they translate properly
  3. Custom workflows: If you’ve built automation around Dashboard URLs or specific features, you’ll need to adapt these
  4. User training: Even similar-looking alternatives have different UIs and workflows; budget time for team training
  5. Ingress configuration: If you expose your dashboard externally, you’ll need to reconfigure ingress rules

The Future of Kubernetes UI Management

The retirement of Kubernetes Dashboard isn’t a step backward—it’s recognition that the ecosystem has matured. Modern platforms need to handle multi-cluster management, GitOps workflows, comprehensive observability, and sophisticated RBAC out of the box.

The alternatives listed here represent different philosophies about what a Kubernetes interface should be:

  • Minimalist dashboards (Headlamp, Skooner) that stay close to the original vision
  • Enterprise platforms (Portainer, Rancher) that centralize multi-cluster management
  • Developer platforms (Devtron, KubeSphere) that integrate the entire application lifecycle
  • Desktop experiences (FreeLens, OpenLens) that bring IDE-like capabilities

The right choice depends on your team’s size, your infrastructure complexity, and whether you’re managing platforms or building applications. For most teams migrating from Kubernetes Dashboard, starting with Headlamp makes sense—it’s officially recommended, actively maintained, and provides a familiar experience. From there, you can evaluate whether you need to scale up to more comprehensive platforms.

Whatever you choose, the good news is that the Kubernetes ecosystem in 2026 offers more sophisticated, capable, and secure dashboard alternatives than ever before.

Frequently Asked Questions (FAQ)

Is Kubernetes Dashboard officially deprecated or just unmaintained?

The Kubernetes Dashboard has been officially archived by the Kubernetes project and is no longer actively maintained. While it may still run in existing clusters, it no longer receives security updates, bug fixes, or new features, making it unsuitable for production use in modern environments.

What is the official replacement for Kubernetes Dashboard?

Headlamp is the officially recommended successor by the Kubernetes SIG UI group. It provides a modern web interface, supports plugins, integrates with existing kubeconfig files, and aligns with current Kubernetes security and RBAC best practices.

Is Headlamp production-ready for enterprise environments?

Yes. Headlamp supports OIDC authentication, fine-grained RBAC, and can run either in-cluster or as a desktop application. While still evolving, it is actively maintained and suitable for many production use cases, especially when combined with proper access controls.

Are there lightweight alternatives similar to the old Kubernetes Dashboard?

Yes. Skooner is a lightweight, fast alternative that closely mirrors the simplicity of the original Kubernetes Dashboard while offering a cleaner UI and modern authentication options like OIDC.

Do I still need a web-based dashboard to manage Kubernetes?

Not necessarily. Many teams prefer desktop or CLI-based tools such as FreeLens, OpenLens, or K9s. These tools often provide better performance, offline access, and deeper integration with developer workflows compared to browser-based dashboards.

Is it safe to expose Kubernetes dashboards over the internet?

Exposing any Kubernetes dashboard publicly requires extreme caution. If external access is necessary, always use:
Strong authentication (OIDC / SSO)
Strict RBAC policies
Network restrictions (VPN, IP allowlists)
TLS termination and hardened ingress rules
In many cases, dashboards should only be accessible from internal networks.

Can these dashboards replace kubectl?

No. Dashboards are complementary tools, not replacements for kubectl. While they simplify visualization and some management tasks, advanced operations, automation, and troubleshooting still rely heavily on CLI tools and GitOps workflows.

What should I consider before migrating away from Kubernetes Dashboard?

Before migrating, review:
Authentication and identity provider compatibility
Existing RBAC roles and permissions
Multi-cluster requirements
GitOps and CI/CD integrations
Training needs for platform teams and developers
Starting with Headlamp is often the lowest-risk migration path

Which Kubernetes dashboard is best for developers rather than platform teams?

Tools like Octant and Devtron are more developer-focused. They emphasize application-centric views, resource relationships, and deployment workflows, making them ideal for developers who want insight without managing cluster infrastructure directly.

Which Kubernetes dashboard is best for multi-cluster management?

For multi-cluster environments, Portainer, Rancher, and KubeSphere are strong options. These platforms are designed to manage multiple clusters from a single control plane and offer enterprise-grade RBAC, auditing, and centralized authentication.

FreeLens vs OpenLens vs Lens: Choosing the Right Kubernetes IDE

FreeLens vs OpenLens vs Lens: Choosing the Right Kubernetes IDE

Introduction: When a Tool Choice Becomes a Legal and Platform Decision

If you’ve been operating Kubernetes clusters for a while, you’ve probably learned this the hard way:
tooling decisions don’t stay “just tooling” for long.

What starts as a developer convenience can quickly turn into:

  • a licensing discussion with Legal,
  • a procurement problem,
  • or a platform standard you’re stuck with for years.

The Kubernetes IDE ecosystem is a textbook example of this.

Many teams adopted Lens because it genuinely improved day-to-day operations. Then the license changed and we already cover the OpenLens vs Lens in the past. Then restrictions appeared. Then forks started to emerge.

Today, the real question is not “Which one looks nicer?” but:

  • Which one is actually maintained?
  • Which one is safe to use in a company?
  • Why is there a fork of a fork?
  • Are they still technically compatible?
  • What is the real switch cost?

Let’s go through this from a production and platform engineering perspective.

The Forking Story: How We Ended Up Here

Understanding the lineage matters because it explains why FreeLens exists at all.

Lens: The Original Product

Lens started as an open-core Kubernetes IDE with a strong community following. Over time, it evolved into a commercial product with:

  • a proprietary license,
  • paid enterprise features,
  • and restrictions on free usage in corporate environments.

This shift was legitimate from a business perspective, but it broke the implicit contract many teams assumed when they standardized on it.

OpenLens: The First Fork

OpenLens was created to preserve:

  • open-source licensing,
  • unrestricted commercial usage,
  • compatibility with Lens extensions.

For a while, OpenLens was the obvious alternative for teams that wanted to stay open-source without losing functionality.

FreeLens: The Fork of the Fork

FreeLens appeared later, and this is where many people raise an eyebrow.

Why fork OpenLens?

Because OpenLens development started to slow down:

  • release cadence became irregular,
  • upstream Kubernetes changes lagged,
  • governance and long-term stewardship became unclear.

FreeLens exists because some contributors were not willing to bet their daily production tooling on a project with uncertain momentum.

This was not ideology. It was operational risk management.

Are the Projects Still Maintained?

Short answer: yes, but not equally.

Lens

  • Actively developed
  • Backed by a commercial vendor
  • Fast adoption of new Kubernetes features

Trade-off:

  • Licensing constraints
  • Paid features
  • Requires legal review in most companies

OpenLens

  • Still maintained
  • Smaller contributor base
  • Slower release velocity

It works, but it no longer feels like a safe long-term default for platform teams.

FreeLens

  • Actively maintained
  • Explicit focus on long-term openness
  • Prioritizes Kubernetes API compatibility and stability

Right now, FreeLens shows the healthiest balance between maintenance and independence.

Technical Compatibility: Can You Switch Without Pain?

This is the good news: yes, mostly.

Cluster Access and Configuration

All three tools:

  • use standard kubeconfig files,
  • support multiple contexts and clusters,
  • work with RBAC, CRDs, and namespaces the same way.

No cluster-side changes are required.

Extensions and Plugins

  • Most Lens extensions work in OpenLens.
  • Most OpenLens extensions work in FreeLens.
  • Proprietary Lens-only extensions are the main exception.

In real-world usage:

  • ~90% of common workflows are identical
  • differences show up only in edge cases or paid features

UX Differences

There are some UI differences:

  • branding,
  • menu structure,
  • feature gating in Lens.

Nothing that requires retraining or documentation updates.

Legal and Licensing Considerations (This Is Where It Usually Breaks)

This is often the decisive factor in enterprise environments.

Lens

  • Requires license compliance checks
  • Free usage may violate internal policies
  • Paid plans required for broader adoption

If you operate in a regulated or audited environment, this alone can be a blocker.

OpenLens

  • Open-source license
  • Generally safe for corporate use
  • Slight uncertainty due to reduced activity

FreeLens

  • Explicitly open-source
  • No usage restrictions
  • Clear intent to remain free for commercial use

If Legal asks, “Can we standardize this across the company?”
FreeLens is the easiest answer.

Which One Should You Use in a Company?

A pragmatic recommendation:

Use Lens if:

  • you want vendor-backed support,
  • you are willing to pay,
  • you already standardized on Mirantis tooling.

Use OpenLens if:

  • you are already using it,
  • it meets your needs today,
  • you accept slower updates.

Use FreeLens if:

  • you want zero licensing risk,
  • you want an open-source default,
  • you care about long-term maintenance,
  • you need something you can standardize safely.

For most platform and DevOps teams, FreeLens is currently the lowest-risk choice.

Switch Cost: How Expensive Is It Really?

Surprisingly low.

Typical migration:

  • install the new binary,
  • reuse existing kubeconfigs,
  • reinstall extensions if needed.

What you don’t need:

  • cluster changes,
  • CI/CD modifications,
  • platform refactoring.

Downtime: none
Rollback: trivial

This is one of the rare cases where switching early is cheap.

Is a “Fork of a Fork” a Red Flag?

Normally, yes.

In this case, no.

FreeLens exists because:

  • maintenance mattered more than branding,
  • openness mattered more than monetization,
  • predictability mattered more than roadmap promises.

Ironically, this is very aligned with how Kubernetes itself evolved.

Conclusion: A Clear, Boring, Production-Safe Answer

If you strip away GitHub drama and branding:

  • Lens optimizes for revenue and enterprise features.
  • OpenLens preserved openness but lost momentum.
  • FreeLens optimizes for sustainability and freedom.

From a platform engineering perspective:

FreeLens is the safest default Kubernetes IDE today for most organizations.

Low switch cost, strong compatibility, no legal surprises.

And in production environments, boring and predictable almost always wins.

Helm Chart Testing in Production: Layers, Tools, and a Minimum CI Pipeline

Helm Chart Testing in Production: Layers, Tools, and a Minimum CI Pipeline

When a Helm chart fails in production, the impact is immediate and visible. A misconfigured ServiceAccount, a typo in a ConfigMap key, or an untested conditional in templates can trigger incidents that cascade through your entire deployment pipeline. The irony is that most teams invest heavily in testing application code while treating Helm charts as “just configuration.”

Chart testing is fundamental for production-quality Helm deployments. For comprehensive coverage of testing along with all other Helm best practices, visit our complete Helm guide.

Helm charts are infrastructure code. They define how your applications run, scale, and integrate with the cluster. Treating them with less rigor than your application logic is a risk most production environments cannot afford.

The Real Cost of Untested Charts

In late 2024, a medium-sized SaaS company experienced a 4-hour outage because a chart update introduced a breaking change in RBAC permissions. The chart had been tested locally with helm install --dry-run, but the dry-run validation doesn’t interact with the API server’s RBAC layer. The deployment succeeded syntactically but failed operationally.

The incident revealed three gaps in their workflow:

  1. No schema validation against the target Kubernetes version
  2. No integration tests in a live cluster
  3. No policy enforcement for security baselines

These gaps are common. According to a 2024 CNCF survey on GitOps practices, fewer than 40% of organizations systematically test Helm charts before production deployment.

The problem is not a lack of tools—it’s understanding which layer each tool addresses.

Testing Layers: What Each Level Validates

Helm chart testing is not a single operation. It requires validation at multiple layers, each catching different classes of errors.

Layer 1: Syntax and Structure Validation

What it catches: Malformed YAML, invalid chart structure, missing required fields

Tools:

  • helm lint: Built-in, minimal validation following Helm best practices
  • yamllint: Strict YAML formatting rules

Example failure caught:

# Invalid indentation breaks the chart
resources:
  limits:
      cpu: "500m"
    memory: "512Mi"  # Incorrect indentation

Limitation: Does not validate whether the rendered manifests are valid Kubernetes objects.

Layer 2: Schema Validation

What it catches: Manifests that would be rejected by the Kubernetes API

Primary tool: kubeconform

Kubeconform is the actively maintained successor to the deprecated kubeval. It validates against OpenAPI schemas for specific Kubernetes versions and can include custom CRDs.

Project Profile:

  • Maintenance: Active, community-driven
  • Strengths: CRD support, multi-version validation, fast execution
  • Why it matters: helm lint validates chart structure, but not if rendered manifests match Kubernetes schemas

Example failure caught:

apiVersion: apps/v1
kind: Deployment
spec:
  replicas: 2
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
      - name: app
        image: nginx:latest
# Missing required field: spec.selector

Configuration example:

helm template my-chart . | kubeconform \
  -kubernetes-version 1.30.0 \
  -schema-location default \
  -schema-location 'https://raw.githubusercontent.com/datreeio/CRDs-catalog/main/{{.Group}}/{{.ResourceKind}}_{{.ResourceAPIVersion}}.json' \
  -summary

Example CI integration:

#!/bin/bash
set -e

KUBE_VERSION="1.30.0"

echo "Rendering chart..."
helm template my-release ./charts/my-chart > manifests.yaml

echo "Validating against Kubernetes $KUBE_VERSION..."
kubeconform \
  -kubernetes-version "$KUBE_VERSION" \
  -schema-location default \
  -summary \
  -output json \
  manifests.yaml | jq -e '.summary.invalid == 0'

Alternative: kubectl --dry-run=server (requires cluster access, validates against actual API server)

Layer 3: Unit Testing

What it catches: Logic errors in templates, incorrect conditionals, wrong value interpolation

Unit tests validate that given a set of input values, the chart produces the expected manifests. This is where template logic is verified before reaching a cluster.

Primary tool: helm-unittest

helm-unittest is the most widely adopted unit testing framework for Helm charts.

Project Profile:

  • GitHub: 3.3k+ stars, ~100 contributors
  • Maintenance: Active (releases every 2-3 months)
  • Primary maintainer: Quentin Machu (originally @QubitProducts, now independent)
  • Commercial backing: None
  • Bus Factor: Medium-High (no institutional backing, but consistent community engagement)

Strengths:

  • Fast execution (no cluster required)
  • Familiar test syntax (similar to Jest/Mocha)
  • Snapshot testing support
  • Good documentation

Limitations:

  • Doesn’t validate runtime behavior
  • Cannot test interactions with admission controllers
  • No validation against actual Kubernetes API

Example test scenario:

# tests/deployment_test.yaml
suite: test deployment
templates:
  - deployment.yaml
tests:
  - it: should set resource limits when provided
    set:
      resources.limits.cpu: "1000m"
      resources.limits.memory: "1Gi"
    asserts:
      - equal:
          path: spec.template.spec.containers[0].resources.limits.cpu
          value: "1000m"
      - equal:
          path: spec.template.spec.containers[0].resources.limits.memory
          value: "1Gi"

  - it: should not create HPA when autoscaling disabled
    set:
      autoscaling.enabled: false
    template: hpa.yaml
    asserts:
      - hasDocuments:
          count: 0

Alternative: Terratest (Helm module)

Terratest is a Go-based testing framework from Gruntwork that includes first-class Helm support. Unlike helm-unittest, Terratest deploys charts to real clusters and allows programmatic assertions in Go.

Example Terratest test:

func TestHelmChartDeployment(t *testing.T) {
    kubectlOptions := k8s.NewKubectlOptions("", "", "default")
    options := &helm.Options{
        KubectlOptions: kubectlOptions,
        SetValues: map[string]string{
            "replicaCount": "3",
        },
    }
    
    defer helm.Delete(t, options, "my-release", true)
    helm.Install(t, options, "../charts/my-chart", "my-release")
    
    k8s.WaitUntilNumPodsCreated(t, kubectlOptions, metav1.ListOptions{
        LabelSelector: "app=my-app",
    }, 3, 30, 10*time.Second)
}

When to use Terratest vs helm-unittest:

  • Use helm-unittest for fast, template-focused validation in CI
  • Use Terratest when you need full integration testing with Go flexibility

Layer 4: Integration Testing

What it catches: Runtime failures, resource conflicts, actual Kubernetes behavior

Integration tests deploy the chart to a real (or ephemeral) cluster and verify it works end-to-end.

Primary tool: chart-testing (ct)

chart-testing is the official Helm project for testing charts in live clusters.

Project Profile:

  • Ownership: Official Helm project (CNCF)
  • Maintainers: Helm team (contributors from Microsoft, IBM, Google)
  • Governance: CNCF-backed with public roadmap
  • LTS: Aligned with Helm release cycle
  • Bus Factor: Low (institutional backing from CNCF provides strong long-term guarantees)

Strengths:

  • De facto standard for public Helm charts
  • Built-in upgrade testing (validates migrations)
  • Detects which charts changed in a PR (efficient for monorepos)
  • Integration with GitHub Actions via official action

Limitations:

  • Requires a live Kubernetes cluster
  • Initial setup more complex than unit testing
  • Does not include security scanning

What ct validates:

  • Chart installs successfully
  • Upgrades work without breaking state
  • Linting passes
  • Version constraints are respected

Example ct configuration:

# ct.yaml
target-branch: main
chart-dirs:
  - charts
chart-repos:
  - bitnami=https://charts.bitnami.com/bitnami
helm-extra-args: --timeout 600s
check-version-increment: true

Typical GitHub Actions workflow:

name: Lint and Test Charts

on: pull_request

jobs:
  lint-test:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v3
        with:
          fetch-depth: 0

      - name: Set up Helm
        uses: azure/setup-helm@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Set up chart-testing
        uses: helm/chart-testing-action@v2

      - name: Run chart-testing (lint)
        run: ct lint --config ct.yaml

      - name: Create kind cluster
        uses: helm/kind-action@v1

      - name: Run chart-testing (install)
        run: ct install --config ct.yaml

When ct is essential:

  • Public chart repositories (expected by community)
  • Charts with complex upgrade paths
  • Multi-chart repositories with CI optimization needs

Layer 5: Security and Policy Validation

What it catches: Security misconfigurations, policy violations, compliance issues

This layer prevents deploying charts that pass functional tests but violate organizational security baselines or contain vulnerabilities.

Policy Enforcement: Conftest (Open Policy Agent)

Conftest is the CLI interface to Open Policy Agent for policy-as-code validation.

Project Profile:

  • Parent: Open Policy Agent (CNCF Graduated Project)
  • Governance: Strong CNCF backing, multi-vendor support
  • Production adoption: Netflix, Pinterest, Goldman Sachs
  • Bus Factor: Low (graduated CNCF project with multi-vendor backing)

Strengths:

  • Policies written in Rego (reusable, composable)
  • Works with any YAML/JSON input (not Helm-specific)
  • Can enforce organizational standards programmatically
  • Integration with admission controllers (Gatekeeper)

Limitations:

  • Rego has a learning curve
  • Does not replace functional testing

Example Conftest policy:

# policy/security.rego
package main

import future.keywords.contains
import future.keywords.if
import future.keywords.in

deny[msg] {
  input.kind == "Deployment"
  container := input.spec.template.spec.containers[_]
  not container.resources.limits.memory
  msg := sprintf("Container '%s' must define memory limits", [container.name])
}

deny[msg] {
  input.kind == "Deployment"
  container := input.spec.template.spec.containers[_]
  not container.resources.limits.cpu
  msg := sprintf("Container '%s' must define CPU limits", [container.name])
}

Running the validation:

helm template my-chart . | conftest test -p policy/ -

Alternative: Kyverno

Kyverno offers policy enforcement using native Kubernetes manifests instead of Rego. Policies are written in YAML and can validate, mutate, or generate resources.

Example Kyverno policy:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-resource-limits
spec:
  validationFailureAction: Enforce
  rules:
  - name: check-container-limits
    match:
      resources:
        kinds:
        - Pod
    validate:
      message: "All containers must have CPU and memory limits"
      pattern:
        spec:
          containers:
          - resources:
              limits:
                memory: "?*"
                cpu: "?*"

Conftest vs Kyverno:

  • Conftest: Policies run in CI, flexible for any YAML
  • Kyverno: Runtime enforcement in-cluster, Kubernetes-native

Both can coexist: Conftest in CI for early feedback, Kyverno in cluster for runtime enforcement.

Vulnerability Scanning: Trivy

Trivy by Aqua Security provides comprehensive security scanning for Helm charts.

Project Profile:

  • Maintainer: Aqua Security (commercial backing with open-source core)
  • Scope: Vulnerability scanning + misconfiguration detection
  • Helm integration: Official trivy helm command
  • Bus Factor: Low (commercial backing + strong open-source adoption)

What Trivy scans in Helm charts:

  1. Vulnerabilities in referenced container images
  2. Misconfigurations (similar to Conftest but pre-built rules)
  3. Secrets accidentally committed in templates

Example scan:

trivy helm ./charts/my-chart --severity HIGH,CRITICAL --exit-code 1

Sample output:

myapp/templates/deployment.yaml (helm)
====================================

Tests: 12 (SUCCESSES: 10, FAILURES: 2)
Failures: 2 (HIGH: 1, CRITICAL: 1)

HIGH: Container 'app' of Deployment 'myapp' should set 'securityContext.runAsNonRoot' to true
════════════════════════════════════════════════════════════════════════════════════════════════
Ensure containers run as non-root users

See https://kubernetes.io/docs/concepts/security/pod-security-standards/
────────────────────────────────────────────────────────────────────────────────────────────────
 myapp/templates/deployment.yaml:42

Commercial support:
Aqua Security offers Trivy Enterprise with advanced features (centralized scanning, compliance reporting). For most teams, the open-source version is sufficient.

Other Security Tools

Polaris (Fairwinds)

Polaris scores charts based on security and reliability best practices. Unlike enforcement tools, it provides a health score and actionable recommendations.

Use case: Dashboard for chart quality across a platform

Checkov (Bridgecrew/Palo Alto)

Similar to Trivy but with a broader IaC focus (Terraform, CloudFormation, Kubernetes, Helm). Pre-built policies for compliance frameworks (CIS, PCI-DSS).

When to use Checkov:

  • Multi-IaC environment (not just Helm)
  • Compliance-driven validation requirements

Enterprise Selection Criteria

Bus Factor and Long-Term Viability

For production infrastructure, tool sustainability matters as much as features. Community support channels like Helm CNCF Slack (#helm-users, #helm-dev) and CNCF TAG Security provide valuable insights into which projects have active maintainer communities.

Questions to ask:

  • Is the project backed by a foundation (CNCF, Linux Foundation)?
  • Are multiple companies contributing?
  • Is the project used in production by recognizable organizations?
  • Is there a public roadmap?

Risk Classification:

Tool Governance Bus Factor Notes
chart-testing CNCF Low Helm official project
Conftest/OPA CNCF Graduated Low Multi-vendor backing
Trivy Aqua Security Low Commercial backing + OSS
kubeconform Community Medium Active, but single maintainer
helm-unittest Community Medium-High No institutional backing
Polaris Fairwinds Medium Company-sponsored OSS

Kubernetes Version Compatibility

Tools must explicitly support the Kubernetes versions you run in production.

Red flags:

  • No documented compatibility matrix
  • Hard-coded dependencies on old K8s versions
  • No testing against multiple K8s versions in CI

Example compatibility check:

# Does the tool support your K8s version?
kubeconform --help | grep -A5 "kubernetes-version"

For tools like ct, always verify they test against a matrix of Kubernetes versions in their own CI.

Commercial Support Options

When commercial support matters:

  • Regulatory compliance requirements (SOC2, HIPAA, etc.)
  • Limited internal expertise
  • SLA-driven operations

Available options:

  • Trivy: Aqua Security offers Trivy Enterprise
  • OPA/Conftest: Styra provides OPA Enterprise
  • Terratest: Gruntwork offers consulting and premium modules

Most teams don’t need commercial support for chart testing specifically, but it’s valuable in regulated industries where audits require vendor SLAs.

Security Scanner Integration

For enterprise pipelines, chart testing tools should integrate cleanly with:

  • SIEM/SOAR platforms
  • CI/CD notification systems
  • Security dashboards (e.g., Grafana, Datadog)

Required features:

  • Structured output formats (JSON, SARIF)
  • Exit codes for CI failure
  • Support for custom policies
  • Webhook or API for event streaming

Example: Integrating Trivy with SIEM

# .github/workflows/security.yaml
- name: Run Trivy scan
  run: trivy helm ./charts --format json --output trivy-results.json

- name: Send to SIEM
  run: |
    curl -X POST https://siem.company.com/api/events \
      -H "Content-Type: application/json" \
      -d @trivy-results.json

Testing Pipeline Architecture

A production-grade Helm chart pipeline combines multiple layers:

Pipeline efficiency principles:

  1. Fail fast: syntax and schema errors should never reach integration tests
  2. Parallel execution where possible (unit tests + security scans)
  3. Cache ephemeral cluster images to reduce setup time
  4. Skip unchanged charts (ct built-in change detection)

Decision Matrix: When to Use What

Scenario 1: Small Team / Early-Stage Startup

Requirements: Minimal overhead, fast iteration, reasonable safety

Recommended Stack:

Linting:      helm lint + yamllint
Validation:   kubeconform
Security:     trivy helm

Optional: helm-unittest (if template logic becomes complex)

Rationale: Zero-dependency baseline that catches 80% of issues without operational complexity.

Scenario 2: Enterprise with Compliance Requirements

Requirements: Auditable, comprehensive validation, commercial support available

Recommended Stack:

Linting:      helm lint + yamllint
Validation:   kubeconform
Unit Tests:   helm-unittest
Security:     Trivy Enterprise + Conftest (custom policies)
Integration:  chart-testing (ct)
Runtime:      Kyverno (admission control)

Optional: Terratest for complex upgrade scenarios

Rationale: Multi-layer defense with both pre-deployment and runtime enforcement. Commercial support available for security components.

Scenario 3: Multi-Tenant Internal Platform

Requirements: Prevent bad charts from affecting other tenants, enforce standards at scale

Recommended Stack:

CI Pipeline:
  • helm lint → kubeconform → helm-unittest → ct
  • Conftest (enforce resource quotas, namespaces, network policies)
  • Trivy (block critical vulnerabilities)

Runtime:
  • Kyverno or Gatekeeper (enforce policies at admission)
  • ResourceQuotas per namespace
  • NetworkPolicies by default

Additional tooling:

  • Polaris dashboard for chart quality scoring
  • Custom admission webhooks for platform-specific rules

Rationale: Multi-tenant environments cannot tolerate “soft” validation. Runtime enforcement is mandatory.

Scenario 4: Open Source Public Charts

Requirements: Community trust, transparent testing, broad compatibility

Recommended Stack:

Must-have:
  • chart-testing (expected standard)
  • Public CI (GitHub Actions with full logs)
  • Test against multiple K8s versions

Nice-to-have:
  • helm-unittest with high coverage
  • Automated changelog generation
  • Example values for common scenarios

Rationale: Public charts are judged by testing transparency. Missing ct is a red flag for potential users.

The Minimum Viable Testing Stack

For any environment deploying Helm charts to production, this is the baseline:

Layer 1: Pre-Commit (Developer Laptop)

helm lint charts/my-chart
yamllint charts/my-chart

Layer 2: CI Pipeline (Automated on PR)

# Fast validation
helm template my-chart ./charts/my-chart | kubeconform \
  -kubernetes-version 1.30.0 \
  -summary

# Security baseline
trivy helm ./charts/my-chart --exit-code 1 --severity CRITICAL,HIGH

Layer 3: Pre-Production (Staging Environment)

# Integration test with real cluster
ct install --config ct.yaml --charts charts/my-chart

Time investment:

  • Initial setup: 4-8 hours
  • Per-PR overhead: 3-5 minutes
  • Maintenance: ~1 hour/month

ROI calculation:

Average production incident caused by untested chart:

  • Detection: 15 minutes
  • Triage: 30 minutes
  • Rollback: 20 minutes
  • Post-mortem: 1 hour
  • Total: ~2.5 hours of engineering time

If chart testing prevents even one incident per quarter, it pays for itself in the first month.

Common Anti-Patterns to Avoid

Anti-Pattern 1: Only using --dry-run

helm install --dry-run validates syntax but skips:

  • Admission controller logic
  • RBAC validation
  • Actual resource creation

Better: Combine dry-run with kubeconform and at least one integration test.

Anti-Pattern 2: Testing only in production-like clusters

“We test in staging, which is identical to production.”

Problem: Staging clusters rarely match production exactly (node counts, storage classes, network policies). Integration tests should run in isolated, ephemeral environments.

Anti-Pattern 3: Security scanning without enforcement

Running trivy helm without failing the build on critical findings is theater.

Better: Set --exit-code 1 and enforce in CI.

Anti-Pattern 4: Ignoring upgrade paths

Most chart failures happen during upgrades, not initial installs. Chart-testing addresses this with ct install --upgrade.

Conclusion: Testing is Infrastructure Maturity

The gap between teams that test Helm charts and those that don’t is not about tooling availability—it’s about treating infrastructure code with the same discipline as application code.

The cost of testing is measured in minutes per PR. The cost of not testing is measured in hours of production incidents, eroded trust in automation, and teams reverting to manual deployments because “Helm is too risky.”

The testing stack you choose matters less than the fact that you have one. Start with the minimal viable stack (lint + schema + security), run it consistently, and expand as your charts become more complex.

By implementing a structured testing pipeline, you catch 95% of chart issues before they reach production. The remaining 5% are edge cases that require production observability, not more testing layers.

Helm chart testing is not about achieving perfection—it’s about eliminating the preventable failures that undermine confidence in your deployment pipeline.

Frequently Asked Questions (FAQ)

What is Helm chart testing and why is it important in production?

Helm chart testing ensures that Kubernetes manifests generated from Helm templates are syntactically correct, schema-compliant, secure, and function correctly when deployed. In production, untested charts can cause outages, security incidents, or failed upgrades, even if application code itself is stable.

Is helm lint enough to validate a Helm chart?

No. helm lint only validates chart structure and basic best practices. It does not validate rendered manifests against Kubernetes API schemas, test template logic, or verify runtime behavior. Production-grade testing requires additional layers such as schema validation, unit tests, and integration tests.

What is the difference between Helm unit tests and integration tests?

Unit tests (e.g., using helm-unittest) validate template logic by asserting expected output for given input values without deploying anything. Integration tests (e.g., using chart-testing or Terratest) deploy charts to a real Kubernetes cluster and validate runtime behavior, upgrades, and interactions with the API server.

Which tools are recommended for validating Helm charts against Kubernetes schemas?

The most commonly recommended tool is kubeconform, which validates rendered manifests against Kubernetes OpenAPI schemas for specific Kubernetes versions and supports CRDs. An alternative is kubectl --dry-run=server, which validates against a live API server.

How can Helm chart testing prevent production outages?

Testing catches common failure modes before deployment, such as missing selectors in Deployments, invalid RBAC permissions, incorrect conditionals, or incompatible API versions. Many production outages originate from configuration and chart logic errors rather than application bugs.

What is the role of security scanning in Helm chart testing?

Security scanning detects misconfigurations, policy violations, and vulnerabilities that functional tests may miss. Tools like Trivy and Conftest (OPA) help enforce security baselines, prevent unsafe defaults, and block deployments that violate organizational or compliance requirements.

Is chart-testing (ct) required for private Helm charts?

While not strictly required, chart-testing is highly recommended for any chart deployed to production. It is considered the de facto standard for integration testing, especially for charts with upgrades, multiple dependencies, or shared cluster environments.

What is the minimum viable Helm testing pipeline for CI?

At a minimum, a production-ready pipeline should include:
helm lint for structural validation
kubeconform for schema validation
trivy helm for security scanning
Integration tests can be added as charts grow in complexity or criticality.

MinIO Maintenance Mode Explained: Impact on Users & S3 Alternatives

MinIO Maintenance Mode Explained: Impact on Users & S3 Alternatives

Background: MinIO and the Maintenance Mode announcement

MinIO has long been one of the most popular self-hosted S3-compatible object storage solutions for Kubernetes, especially for logs, backups, and internal object storage in on‑premise and cloud-native environments. Its simplicity, performance, and API compatibility made it a common default choice for backups, artifacts, logs, and internal object storage.

In late 2025, MinIO marked its upstream repository as Maintenance Mode and clarified that the Community Edition would be distributed source-only, without official pre-built binaries or container images. This move triggered renewed discussion across the industry about sustainability, governance, and the risks of relying on a single-vendor-controlled “open core” storage layer.

A detailed industry analysis of this shift, including its broader ecosystem impact, can be found in this InfoQ article

What exactly changed?

1. Maintenance Mode

Maintenance Mode means:

  • No new features
  • No roadmap-driven improvements
  • Limited fixes, typically only for critical issues
  • No active review of community pull requests

As highlighted by InfoQ, this effectively freezes MinIO Community as a stable but stagnant codebase, pushing innovation and evolution exclusively toward the commercial offerings.

2. Source-only distribution

Official binaries and container images are no longer published for the Community Edition. Users must:

  • Build MinIO from source
  • Maintain their own container images
  • Handle signing, scanning, and provenance themselves

This aligns with a broader industry pattern noted by InfoQ: infrastructure projects increasingly shifting operational burden back to users unless they adopt paid tiers.

Direct implications for Community users

Security and patching

With no active upstream development:

  • Vulnerability response times may increase
  • Users must monitor security advisories independently
  • Regulated environments may find Community harder to justify

InfoQ emphasizes that this does not make MinIO insecure by default, but it changes the shared-responsibility model significantly.

Operational overhead

Teams now need to:

  • Pin commits or tags explicitly
  • Build and test their own releases
  • Maintain CI pipelines for a core storage dependency

This is a non-trivial cost for what was previously perceived as a “drop‑in” component.

Support and roadmap

The strategic message is clear: active development, roadmap influence, and predictable maintenance live behind the commercial subscription.

Impact on OEM and embedded use cases

The InfoQ analysis draws an important distinction between API consumers and technology embedders.

Using MinIO as an external S3 service

If your application simply consumes an S3 endpoint:

  • The impact is moderate
  • Migration is largely operational
  • Application code usually remains unchanged

Embedding or redistributing MinIO

If your product:

  • Ships MinIO internally
  • Builds gateways or features on MinIO internals
  • Depends on MinIO-specific operational tooling

Then the impact is high:

  • You inherit maintenance and security responsibility
  • Long-term internal forking becomes likely
  • Licensing (AGPL) implications must be reassessed carefully

For OEM vendors, this often forces a strategic re-evaluation rather than a tactical upgrade.

Forks and community reactions

At the time of writing:

  • Several community forks focus on preserving the MinIO Console / UI experience
  • No widely adopted, full replacement fork of the MinIO server exists
  • Community discussion, as summarized by InfoQ, reflects caution rather than rapid consolidation

The absence of a strong server-side fork suggests that most organizations are choosing migration over replacement-by-fork.

Best S3-Compatible Alternatives to MinIO: Ceph RGW, SeaweedFS, Garage

InfoQ highlights that the industry response is not about finding a single “new MinIO”, but about selecting storage systems whose governance and maintenance models better match long-term needs.

Ceph RGW

Best for: Enterprise-grade, highly available environments
Strengths: Mature ecosystem, large community, strong governance
Trade-offs: Operational complexity

SeaweedFS

Best for: Teams seeking simplicity and permissive licensing
Strengths: Apache-2.0 license, active development, integrated S3 API
Trade-offs: Partial S3 compatibility for advanced edge cases

Garage

Best for: Self-hosted and geo-distributed systems
Strengths: Resilience-first design, active open-source development
Trade-offs: AGPL license considerations

Zenko / CloudServer

Best for: Multi-cloud and Scality-aligned architectures
Strengths: Open-source S3 API implementation
Trade-offs: Different architectural assumptions than MinIO

Recommended strategies by scenario

If you need to reduce risk immediately

  • Freeze your current MinIO version
  • Build, scan, and sign your own images
  • Define and rehearse a migration path

If you operate Kubernetes on-prem with HA requirements

  • Ceph RGW is often the most future-proof option

If licensing flexibility is critical

  • Start evaluation with SeaweedFS

If operational UX matters

  • Shift toward automation-first workflows
  • Treat UI forks as secondary tooling, not core infrastructure

Conclusion

MinIO’s shift of the Community Edition into Maintenance Mode is less about short-term breakage and more about long-term sustainability and control.

As the InfoQ analysis makes clear, the real risk is not technical incompatibility but governance misalignment. Organizations that treat object storage as critical infrastructure should favor solutions with transparent roadmaps, active communities, and predictable maintenance models.

For many teams, this moment serves as a natural inflection point: either commit to self-maintaining MinIO, move to a commercially supported path, or migrate to a fully open-source alternative designed for the long run.

📚 Want to dive deeper into Kubernetes? This article is part of our , where you’ll find all fundamental and advanced concepts explained step by step.

Frequently Asked Questions

What does Maintenance Mode mean for MinIO Community Edition?

Maintenance Mode for MinIO Community Edition means the upstream codebase is effectively frozen. There will be no new features, only critical bug fixes, and community pull requests will not be actively reviewed. Furthermore, official pre-built binaries and container images are no longer provided; users must build from source.

Is MinIO Community Edition still safe to use?

The code itself isn’t inherently insecure, but the shared-responsibility model changes. Security patching for non-critical issues will be slower or non-existent. For production, especially in regulated environments, you must now actively monitor advisories and maintain your own built and scanned images, which increases operational risk.

What is the best open-source alternative to MinIO for Kubernetes?

The best alternative depends on your needs. For enterprise-grade, high-availability setups, Ceph RGW is the most robust choice. For simplicity and Apache 2.0 licensing, SeaweedFS is excellent. For geo-distributed, resilience-first designs, evaluate Garage. There is no direct drop-in replacement; each requires evaluation.

Should I fork MinIO or migrate to another solution?

For most organizations, migration is preferable to forking. Maintaining a full fork of a complex storage server involves significant long-term commitment to security, bug fixes, and potential feature backporting. The industry trend, as noted, is toward migration to systems with active upstream development and clear governance.

How does this affect products that embed MinIO (OEM use)?

The impact on OEMs is high. You inherit full maintenance and security responsibility. Long-term internal forking becomes likely, and the AGPL licensing implications must be carefully reassessed. This often forces a strategic re-evaluation of your embedded storage layer rather than a simple tactical update.

Related posts