Kubernetes Security Best Practices: A Production Hardening Guide

Kubernetes Security Best Practices: A Production Hardening Guide

Kubernetes security is not a single feature you enable — it is a layered discipline that spans the control plane, workloads, networking, supply chain, and runtime. This guide covers the security controls that matter most in production, why each one exists, and how to implement them without breaking your cluster.

The Kubernetes Attack Surface

Before hardening anything, understand what you are protecting. A Kubernetes cluster has several distinct attack surfaces:

  • API server — The central control plane. Any entity that can reach it with valid credentials can read cluster state, modify workloads, or escalate privileges.
  • etcd — Stores all cluster state in plain text, including Secrets. Direct etcd access is equivalent to root on every node.
  • Nodes — A compromised node can access all Secrets mounted on pods running on it, access the kubelet API, and potentially escape to the hypervisor.
  • Pods — Privileged pods, host-network pods, and pods with excessive capabilities can break container isolation.
  • Supply chain — Malicious images, compromised registries, and unsigned artifacts can introduce attacker-controlled code into your cluster.
  • RBAC — Overly permissive roles allow lateral movement and privilege escalation once an attacker gains any foothold.

The controls below address each of these surfaces. Prioritize based on your threat model — a public-facing multi-tenant cluster needs all of them; an internal development cluster can relax some.

1. RBAC: Least Privilege from Day One

Role-Based Access Control is Kubernetes’ primary authorization mechanism. Most clusters fail at RBAC not because it is misconfigured, but because it is over-permissive by default and nobody reviews it systematically.

Common RBAC Mistakes

  • Binding to cluster-admin for convenience. Almost no workload needs cluster-admin. Use namespaced roles wherever possible.
  • Using * verbs or resources in roles. Wildcard permissions are almost always broader than intended.
  • Not auditing ServiceAccount token usage. Every pod gets a ServiceAccount. The default ServiceAccount in most namespaces has no permissions, but custom workloads often get over-permissive SAs.
  • Forgetting automountServiceAccountToken: false. If a workload does not need to talk to the Kubernetes API, disable token mounting entirely.

Practical RBAC Patterns

For a workload that only needs to read ConfigMaps in its own namespace:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: configmap-reader
  namespace: my-app
rules:
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: my-app-configmap-reader
  namespace: my-app
subjects:
- kind: ServiceAccount
  name: my-app
  namespace: my-app
roleRef:
  kind: Role
  name: configmap-reader
  apiGroup: rbac.authorization.k8s.io

Audit existing RBAC with kubectl-who-can or rbac-tool to find overly permissive bindings before attackers do.

2. Pod Security Standards

PodSecurityPolicy was deprecated in Kubernetes 1.21 and removed in 1.25. Its replacement is Pod Security Admission (PSA), a built-in admission controller that enforces one of three security profiles at the namespace level:

  • Privileged — No restrictions. For system components only.
  • Baseline — Prevents the most critical privilege escalations: privileged containers, hostPID, hostIPC, hostNetwork, dangerous capabilities.
  • Restricted — Enforces current hardening best practices. Requires running as non-root, dropping all capabilities, and using a restricted seccomp profile.

Enable enforcement at the namespace level with labels:

apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: v1.30
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/warn-version: v1.30
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/audit-version: v1.30

A pod that runs as root or requests host-network in a namespace enforcing restricted will be rejected at admission. The warn and audit modes let you test before enforcing.

PSA covers the most critical pod-level escalations, but it is coarse-grained. For fine-grained policy control, use Kyverno alongside PSA.

3. Network Policies: Micro-Segmentation

By default, every pod in a Kubernetes cluster can communicate with every other pod across all namespaces. This is a flat network model that gives attackers unrestricted lateral movement once they compromise any workload.

Network Policies define L3/L4 allow-rules for pod-to-pod communication. They are enforced by your CNI plugin (Calico, Cilium, Weave — not Flannel, which does not support NetworkPolicy).

Default Deny Pattern

Start by denying all ingress and egress in every namespace, then open only what is explicitly needed:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

Then allow specific traffic:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-api-to-db
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: postgres
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: api
    ports:
    - protocol: TCP
      port: 5432

Do not forget DNS egress — most workloads need to resolve names via kube-dns, which requires UDP port 53 egress to the kube-system namespace.

4. Secrets Management

Kubernetes Secrets are base64-encoded, not encrypted. Stored in etcd in plain text by default. Anyone with get permission on Secrets can read them. This is not a vulnerability — it is a design decision that puts the responsibility on you to:

  • Enable encryption at rest for etcd. Configure EncryptionConfiguration with an AES-CBC or AES-GCM provider. This encrypts Secrets before they are written to etcd.
  • Use external secret stores. HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault with the External Secrets Operator means actual secret values never live in Kubernetes at all.
  • Restrict Secret RBAC aggressively. Never give list on Secrets cluster-wide — it returns all values. Use get on named resources where possible.
  • Avoid environment variables for secrets. Prefer volume mounts. Env vars are visible in pod inspect output and can leak through application logging.
# etcd encryption at rest - in kube-apiserver config
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
  - secrets
  providers:
  - aescbc:
      keys:
      - name: key1
        secret: <base64-encoded-32-byte-key>
  - identity: {}

5. Image Security and Supply Chain

Your runtime security posture is only as good as the images you run. A compromised image from a public registry bypasses every runtime control you have.

Scan images in CI

Use Trivy, Grype, or Snyk to scan images as part of your CI pipeline. Block deployments of images with critical CVEs:

# In your CI pipeline
trivy image --exit-code 1 --severity CRITICAL your-image:tag

Use a private registry with admission control

Only allow images from your private registry using an admission webhook (Kyverno, OPA Gatekeeper). This prevents developers from running arbitrary public images in production:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: restrict-image-registries
spec:
  validationFailureAction: Enforce
  rules:
  - name: validate-registries
    match:
      any:
      - resources:
          kinds: ["Pod"]
    validate:
      message: "Images must come from registry.company.com"
      pattern:
        spec:
          containers:
          - image: "registry.company.com/*"

Use distroless or minimal base images

Distroless images contain only the application and its runtime dependencies — no shell, no package manager, no debugging tools. This drastically reduces the attack surface and the number of CVEs. Google’s distroless images are available for Java, Node.js, Python, and Go.

Sign and verify images

Cosign (from the Sigstore project) lets you sign container images and verify signatures at admission time using Kyverno or Connaisseur. This prevents image substitution attacks where an attacker replaces a legitimate image in your registry.

6. Runtime Security

Runtime security detects and responds to malicious activity after a container is running. The primary tool in this space is Falco — a CNCF project that uses eBPF to monitor system calls and raise alerts when containers behave unexpectedly.

Default Falco rules catch common attack patterns:

  • Shell spawned in a container
  • Network connection to an unexpected IP
  • Write to a sensitive file path (/etc/passwd, /etc/shadow)
  • Privilege escalation via setuid binaries
  • Container drift (new executable files written at runtime)

Combine Falco with seccomp profiles to restrict the system calls a container can make at the kernel level. The RuntimeDefault seccomp profile (available since Kubernetes 1.27 as a default) blocks 300+ system calls that containers virtually never need.

spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: app
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      runAsNonRoot: true
      runAsUser: 65534
      capabilities:
        drop: ["ALL"]

These four securityContext settings together (allowPrivilegeEscalation: false, readOnlyRootFilesystem: true, runAsNonRoot: true, capabilities.drop: ALL) make container escape significantly harder and satisfy the Kubernetes Restricted pod security standard.

7. API Server Hardening

The API server is the most critical component to harden. Key settings:

  • Disable anonymous authentication. --anonymous-auth=false ensures every request is authenticated.
  • Enable audit logging. Log all API server requests to a file or webhook. Without audit logs, you cannot investigate incidents or detect RBAC abuse.
  • Restrict admission plugins. Ensure NodeRestriction is enabled — it prevents node kubelets from modifying objects outside their own node.
  • Do not expose the API server to the internet. Use a VPN, bastion host, or private endpoint. If you must expose it, restrict access by IP.
# Minimal audit policy - log all requests at metadata level,
# and full request body for sensitive resources
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: RequestResponse
  resources:
  - group: ""
    resources: ["secrets", "configmaps"]
- level: Metadata
  omitStages: ["RequestReceived"]

8. etcd Security

etcd stores all cluster state. Treat it as sensitive as your production database:

  • Enable TLS for all etcd communication. Both peer communication (etcd-to-etcd) and client communication (apiserver-to-etcd) must use mutual TLS.
  • Restrict network access to etcd. etcd should only be reachable by the API server. Use firewall rules or security groups to enforce this.
  • Enable encryption at rest. As described in the Secrets section above.
  • Backup etcd regularly. An etcd snapshot is a complete copy of all cluster state, including all Secrets. Encrypt backups and store them separately from the cluster.

9. CIS Kubernetes Benchmark

The CIS Kubernetes Benchmark is a comprehensive checklist of security controls covering the control plane, nodes, and workloads. Running kube-bench against your cluster gives you a scored assessment against the CIS controls:

kubectl apply -f https://raw.githubusercontent.com/aquasecurity/kube-bench/main/job.yaml
kubectl logs $(kubectl get pods -l app=kube-bench -o name)

kube-bench outputs PASS/FAIL/WARN for each control with remediation guidance. Run it after initial cluster setup and after major configuration changes.

10. Continuous Security Posture with Kubescape

Kubescape and similar tools (Starboard/Trivy Operator, KubeScore) provide continuous security scanning of live cluster state — not just a one-time audit. They check workloads against NSA/CISA hardening guidelines, MITRE ATT&CK framework, and the CIS benchmark in real time.

Deploy Trivy Operator for continuous in-cluster scanning:

helm repo add aquasecurity https://aquasecurity.github.io/helm-charts/
helm install trivy-operator aquasecurity/trivy-operator 
  --namespace trivy-system 
  --create-namespace 
  --set="trivy.ignoreUnfixed=true"

Trivy Operator creates VulnerabilityReport, ConfigAuditReport, and RbacAssessmentReport custom resources in the same namespace as each workload. These can be scraped by Prometheus and displayed in Grafana for a security dashboard.

Security Hardening Checklist

  • ✅ RBAC reviewed — no wildcard roles, no unnecessary cluster-admin bindings
  • ✅ ServiceAccount token automount disabled for workloads that do not need API access
  • ✅ Pod Security Standards enforced at namespace level (at least Baseline, Restricted where possible)
  • ✅ Network policies deployed — default deny with explicit allows
  • ✅ Secrets encrypted at rest in etcd
  • ✅ Images scanned in CI — no critical CVEs in production
  • ✅ Private registry enforced via admission control
  • ✅ Container securityContext hardened (non-root, read-only filesystem, no capabilities)
  • ✅ seccomp RuntimeDefault profile enabled
  • ✅ API server audit logging enabled
  • ✅ etcd TLS and network access restricted
  • ✅ kube-bench run and critical/high findings remediated
  • ✅ Runtime security (Falco) deployed and alerts routed to on-call
  • ✅ Continuous scanning (Trivy Operator or Kubescape) deployed

FAQ

Where do I start if my cluster has no security controls today?

Start with the highest-impact, lowest-effort controls first: audit your RBAC (revoke cluster-admin where not needed), enable Pod Security Admission in warn mode on all namespaces, and deploy Trivy Operator. These three steps give you immediate visibility and prevent the most common privilege escalations without breaking anything.

Does enabling Network Policies break DNS resolution?

Yes, if you deploy a default-deny egress policy without explicitly allowing DNS. Add an egress rule allowing UDP port 53 to the kube-dns service in kube-system when applying default-deny network policies.

Is Kubernetes certified for PCI-DSS or SOC 2?

Kubernetes itself is not certified — your configuration and the controls you implement determine compliance. The CIS Kubernetes Benchmark maps to many PCI-DSS and SOC 2 requirements. Managed Kubernetes offerings (EKS, GKE, AKS) have their own compliance certifications for the underlying infrastructure.

Should I use OPA Gatekeeper or Kyverno?

Both enforce admission policies, but Kyverno is Kubernetes-native (policies are written as YAML) while Gatekeeper uses Rego (a purpose-built policy language). For teams without Rego expertise, Kyverno is significantly faster to adopt and maintain. For teams already using OPA elsewhere in their stack, Gatekeeper offers consistency. Both integrate well with GitOps workflows.

How often should I update Kubernetes for security patches?

Follow a patch release within 30 days of release for CVEs rated High or Critical. Minor version upgrades (e.g., 1.29 → 1.30) should happen within the support window — Kubernetes maintains the last three minor versions. Falling more than one minor version behind means running without security patches for a growing subset of the codebase.

For a deeper look at how security fits into the broader Kubernetes platform architecture, see the Kubernetes architecture patterns guide and the guide on building a security-first Kubernetes culture.

ArgoCD Guide: GitOps Continuous Delivery for Kubernetes

ArgoCD Guide: GitOps Continuous Delivery for Kubernetes

ArgoCD has become the de facto standard for GitOps-based continuous delivery in Kubernetes. If you are running production workloads on Kubernetes and still deploying with raw kubectl apply or untracked Helm releases, ArgoCD solves a class of problems you may not even know you have yet. This guide covers everything from core concepts to production-grade configuration.

The Problem ArgoCD Solves

Traditional CI/CD pushes deployments into a cluster. A CI system runs tests, builds an image, and then executes kubectl apply or helm upgrade against the cluster. This model has several structural problems:

  • Drift goes undetected. Someone applies a hotfix directly to the cluster. Now your Git repository no longer reflects reality, and nobody knows it.
  • No single source of truth. The cluster state is authoritative, not Git. Your desired state and actual state can diverge silently.
  • Rollback is painful. Rolling back a bad deployment means re-running old CI pipelines or manually reversing changes, neither of which is fast.
  • Multi-cluster management compounds the problem. Each cluster becomes a snowflake with its own history of undocumented changes.

GitOps inverts this model. Git is the source of truth. The cluster pulls its desired state from Git and continuously reconciles toward it. ArgoCD is the most mature GitOps operator for Kubernetes, implementing this pull-based model with a production-ready feature set.

How ArgoCD Works: Core Architecture

ArgoCD runs as a set of controllers inside your Kubernetes cluster. The core components are:

  • Application Controller — Watches both the Git repository and the live cluster state. Computes the diff and drives reconciliation.
  • API Server — Exposes the gRPC/REST API consumed by the CLI, UI, and external systems.
  • Repository Server — Generates Kubernetes manifests from source (Helm, Kustomize, plain YAML, Jsonnet).
  • Redis — Caches cluster state and repository data to reduce API server load.
  • Dex (optional) — Provides OIDC authentication for SSO integration.

The fundamental unit in ArgoCD is an Application — a CRD that maps a source (a path in a Git repo at a specific revision) to a destination (a namespace in a cluster). ArgoCD continuously compares the desired state from Git with the live state in the cluster and reports on the sync status.

Sync Status vs Health Status

Two orthogonal concepts you need to understand from day one:

  • Sync Status — Does the live state match what Git says it should be? Values: Synced, OutOfSync, Unknown.
  • Health Status — Is the application actually working? Values: Healthy, Progressing, Degraded, Suspended, Missing, Unknown.

An application can be Synced but Degraded — the manifests were applied correctly, but a pod is crash-looping. Conversely, it can be OutOfSync but Healthy — someone applied a change directly to the cluster outside of Git.

Installing ArgoCD

The official installation method uses a single manifest. For production, always pin to a specific version:

kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/v2.11.0/manifests/install.yaml

This deploys ArgoCD in the argocd namespace with full cluster-admin access. For a production HA setup, use the manifests/ha/install.yaml variant, which deploys multiple replicas of the API server and application controller.

Accessing the UI and CLI

The initial admin password is auto-generated and stored in a secret:

argocd admin initial-password -n argocd

For local access, port-forward the API server:

kubectl port-forward svc/argocd-server -n argocd 8080:443

Then log in via the CLI:

argocd login localhost:8080 --username admin --password <password> --insecure

For production, expose the ArgoCD server via an Ingress or LoadBalancer with a proper TLS certificate. If you’re using NGINX Ingress Controller:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: argocd-server-ingress
  namespace: argocd
  annotations:
    nginx.ingress.kubernetes.io/ssl-passthrough: "true"
    nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
spec:
  ingressClassName: nginx
  rules:
  - host: argocd.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: argocd-server
            port:
              number: 443

Defining Your First Application

Applications can be created via the UI, the CLI, or declaratively with a YAML manifest. The declarative approach is the recommended one — it means your ArgoCD configuration itself is in Git:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/your-org/your-app
    targetRevision: HEAD
    path: k8s/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
    - CreateNamespace=true

Key fields to understand:

  • targetRevision — Can be a branch name, tag, or commit SHA. For production, pin to a tag rather than HEAD.
  • path — The directory within the repo containing your Kubernetes manifests.
  • automated.prune — Automatically delete resources that are no longer in Git. Required for true GitOps but use carefully — it will delete things.
  • automated.selfHeal — Automatically revert manual changes made directly to the cluster. This is what enforces Git as the single source of truth.

Helm Integration

ArgoCD has native Helm support. It can deploy Helm charts directly from chart repositories or from your Git repository. You can override values per environment:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: prometheus-stack
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://prometheus-community.github.io/helm-charts
    chart: kube-prometheus-stack
    targetRevision: 58.4.0
    helm:
      releaseName: prometheus-stack
      valuesObject:
        grafana:
          adminPassword: "${GRAFANA_PASSWORD}"
        prometheus:
          prometheusSpec:
            retention: 30d
            storageSpec:
              volumeClaimTemplate:
                spec:
                  storageClassName: fast-ssd
                  resources:
                    requests:
                      storage: 50Gi
  destination:
    server: https://kubernetes.default.svc
    namespace: observability

One important nuance: ArgoCD renders Helm charts server-side using its own templating engine, not helm install. This means Helm hooks (pre-install, post-upgrade, etc.) are supported, but the release is not tracked in Helm’s release history. Running helm list will not show ArgoCD-managed releases unless you configure ArgoCD to use the Helm secrets backend.

Projects: Multi-Tenancy and Access Control

ArgoCD Projects provide multi-tenancy within a single ArgoCD instance. They let you restrict which source repositories, destination clusters, and namespaces a team can deploy to. Every Application belongs to a Project.

apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: platform-team
  namespace: argocd
spec:
  description: Platform team applications
  sourceRepos:
  - 'https://github.com/your-org/*'
  destinations:
  - namespace: 'platform-*'
    server: https://kubernetes.default.svc
  clusterResourceWhitelist:
  - group: ''
    kind: Namespace
  namespaceResourceBlacklist:
  - group: ''
    kind: ResourceQuota

Projects are where you define the boundaries of what each team can do. The default project has no restrictions — never use it for production workloads. Create dedicated projects per team or per environment.

RBAC Configuration

ArgoCD has its own RBAC system layered on top of Kubernetes RBAC. It is configured via the argocd-rbac-cm ConfigMap. Roles are defined per project or globally:

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-rbac-cm
  namespace: argocd
data:
  policy.default: role:readonly
  policy.csv: |
    # Platform team has full access to platform-team project
    p, role:platform-admin, applications, *, platform-team/*, allow
    p, role:platform-admin, projects, get, platform-team, allow
    p, role:platform-admin, repositories, *, *, allow

    # Dev team can sync but not delete
    p, role:developer, applications, get, */*, allow
    p, role:developer, applications, sync, */*, allow
    p, role:developer, applications, action/*, */*, allow

    # Bind SSO groups to roles
    g, your-org:platform-team, role:platform-admin
    g, your-org:developers, role:developer

The policy.default: role:readonly ensures that any authenticated user who has no explicit role assignment gets read-only access — a safe default for production.

Multi-Cluster Management

ArgoCD can manage multiple Kubernetes clusters from a single control plane. Register external clusters with the CLI:

# First, ensure the target cluster context is in your kubeconfig
argocd cluster add production-eu-west --name production-eu-west

# Verify registration
argocd cluster list

ArgoCD will create a ServiceAccount in the target cluster and store its credentials as a Kubernetes secret in the ArgoCD namespace. Applications can then target this cluster by name in their destination.server field.

For large-scale multi-cluster setups, consider the App of Apps pattern or ApplicationSets. ApplicationSets are a controller that generates Applications dynamically based on generators — cluster lists, Git directory structures, or matrix combinations:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: cluster-addons
  namespace: argocd
spec:
  generators:
  - clusters:
      selector:
        matchLabels:
          environment: production
  template:
    metadata:
      name: '{{name}}-addons'
    spec:
      project: platform
      source:
        repoURL: https://github.com/your-org/cluster-addons
        targetRevision: HEAD
        path: 'addons/{{metadata.labels.region}}'
      destination:
        server: '{{server}}'
        namespace: kube-system

This single ApplicationSet deploys the appropriate addons to every cluster labeled environment: production, using each cluster’s region label to select the correct path in the repository.

Sync Strategies and Waves

When deploying complex applications with dependencies between resources, you need to control the order of deployment. ArgoCD provides two mechanisms:

Sync Phases

Resources are deployed in three phases: PreSync, Sync, and PostSync. Use Sync Hooks for resources that must complete before the main sync proceeds (database migrations, certificate issuance, etc.):

apiVersion: batch/v1
kind: Job
metadata:
  name: db-migration
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
  template:
    spec:
      containers:
      - name: migrate
        image: your-app:v1.2.3
        command: ["./migrate.sh"]
      restartPolicy: Never

Sync Waves

Within the Sync phase, waves control ordering. Resources with a lower wave number are applied and must become healthy before resources with higher wave numbers are applied:

# Applied first
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "1"

# Applied after wave 1 is healthy
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "2"

Notifications and Alerting

ArgoCD Notifications is a standalone controller that sends alerts when Application state changes. It supports Slack, PagerDuty, GitHub commit status, email, and a dozen other providers. Configure it via the argocd-notifications-cm ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-notifications-cm
  namespace: argocd
data:
  service.slack: |
    token: $slack-token
  template.app-sync-failed: |
    slack:
      attachments: |
        [{
          "title": "{{.app.metadata.name}}",
          "color": "#E96D76",
          "fields": [{
            "title": "Sync Status",
            "value": "{{.app.status.sync.status}}",
            "short": true
          },{
            "title": "Message",
            "value": "{{range .app.status.conditions}}{{.message}}{{end}}",
            "short": false
          }]
        }]
  trigger.on-sync-failed: |
    - when: app.status.sync.status == 'Unknown'
      send: [app-sync-failed]
    - when: app.status.operationState.phase in ['Error', 'Failed']
      send: [app-sync-failed]

Secret Management with ArgoCD

ArgoCD intentionally has no secret management built in — storing secrets in Git as plain text is never acceptable. The common patterns are:

  • Sealed Secrets (Bitnami) — Encrypts secrets with a cluster-specific key. The encrypted secret can be committed to Git; only the cluster can decrypt it.
  • External Secrets Operator — Syncs secrets from Vault, AWS Secrets Manager, GCP Secret Manager, etc. into Kubernetes secrets. The ArgoCD Application manages the ExternalSecret CRD, not the actual secret value.
  • argocd-vault-plugin — A plugin that replaces placeholder values in manifests with secrets retrieved from Vault at sync time.

The External Secrets Operator approach is the most flexible for teams already using a centralized secrets backend. The Application in ArgoCD deploys ExternalSecret objects, which the ESO controller resolves at runtime without ever touching Git.

Production Best Practices

  • Run ArgoCD in HA mode. Use manifests/ha/install.yaml with 3 replicas of the API server and multiple application controller shards for large clusters (100+ applications).
  • Pin image versions. Never use latest for the ArgoCD image itself. Pin to a specific version and upgrade deliberately.
  • Use the App of Apps pattern for bootstrapping. A single root Application deploys all other Applications. This makes cluster bootstrapping idempotent and reproducible.
  • Separate ArgoCD config from application config. Store ArgoCD Application manifests in a dedicated gitops repository, separate from application source code.
  • Enable resource tracking via annotations. Use application.resourceTrackingMethod: annotation in argocd-cm instead of the default label-based tracking, which can conflict with Helm’s own labels.
  • Set resource limits on ArgoCD controllers. Application controller CPU and memory scale with the number of resources tracked. Monitor and tune accordingly.
  • Restrict auto-sync in production. Consider requiring manual sync approval for production environments even when using GitOps — or at minimum require a PR approval gate before changes reach the target branch.

ArgoCD vs Flux

Flux v2 is the other major GitOps operator. Both are CNCF projects. The main differences in practice:

FeatureArgoCDFlux v2
UIBuilt-in web UINo official UI (use Weave GitOps)
Multi-clusterSingle control plane manages many clustersAgent per cluster, pull model
ApplicationSetsNativeKustomization + HelmRelease
Secret managementPlugin-basedSOPS native integration
Learning curveSteeper (more concepts)Lower (Kubernetes-native CRDs)
CNCF statusGraduatedGraduated

ArgoCD wins when you need the UI, multi-cluster management from a central plane, or have a large operations team that benefits from the visual application topology view. Flux wins when you want a simpler, purely Kubernetes-native approach with better SOPS integration for secret management.

FAQ

Can ArgoCD deploy to the cluster it runs in?

Yes. The https://kubernetes.default.svc destination refers to the local cluster. ArgoCD can manage both its own cluster and external clusters simultaneously.

Does ArgoCD support private Git repositories?

Yes. Configure repository credentials via argocd repo add with SSH keys, HTTPS username/password, or GitHub App credentials. Credentials are stored as Kubernetes secrets in the ArgoCD namespace.

How does ArgoCD handle CRD installation?

CRDs can be managed by ArgoCD, but there is a chicken-and-egg problem: if a CRD is not yet installed, ArgoCD cannot validate resources that use it. The recommended pattern is to put CRDs in wave 1 and dependent resources in wave 2, or to use a separate Application for CRDs.

What is the difference between an Application and an AppProject?

An Application is the unit of deployment — it maps a Git source to a cluster destination. An AppProject is a grouping and access control boundary — it restricts what sources and destinations an Application within the project can use. Every Application belongs to exactly one AppProject.

How do I roll back a deployment with ArgoCD?

The GitOps way: revert the commit in Git and let ArgoCD reconcile. ArgoCD also provides a UI-based rollback to any previous sync revision, but this is considered a temporary measure — the Git history should always be updated to match.

Getting Started

The fastest path from zero to a working ArgoCD setup on a local cluster:

# 1. Create a local cluster (kind or minikube)
kind create cluster --name argocd-demo

# 2. Install ArgoCD
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

# 3. Wait for pods
kubectl wait --for=condition=Ready pods --all -n argocd --timeout=120s

# 4. Get the initial admin password
argocd admin initial-password -n argocd

# 5. Port-forward and log in
kubectl port-forward svc/argocd-server -n argocd 8080:443 &
argocd login localhost:8080 --username admin --insecure

# 6. Deploy your first application
argocd app create guestbook 
  --repo https://github.com/argoproj/argocd-example-apps.git 
  --path guestbook 
  --dest-server https://kubernetes.default.svc 
  --dest-namespace guestbook 
  --sync-policy automated

From here, the natural next steps are integrating ArgoCD with your existing CI pipeline (CI builds and pushes the image, updates the image tag in Git, ArgoCD detects the change and syncs), configuring SSO via Dex, and setting up the App of Apps pattern for managing multiple applications declaratively.

For teams looking to go deeper on GitOps and ArgoCD in production, the Kubernetes architecture patterns guide covers how ArgoCD fits into a broader platform engineering stack alongside service mesh, policy enforcement, and observability tooling.

Debugging Distroless Containers: kubectl debug, Ephemeral Containers, and When to Use Each

Developer inspecting a distroless container with magnifying glass

The container works fine in CI. It deploys successfully to staging. Then something goes wrong in production and you type the command you always type: kubectl exec -it my-pod -- /bin/bash. The response is immediate: OCI runtime exec failed: exec failed: unable to start container process: exec: "/bin/bash": stat /bin/bash: no such file or directory.

You try /bin/sh. Same error. You try ls. Same error. The container image is distroless — it ships only your application binary and its runtime dependencies, with no shell, no package manager, no debugging tools of any kind. This is intentional and correct from a security standpoint. It is also a significant operational challenge the first time you face it in production.

This article covers every practical technique for debugging distroless containers in Kubernetes: kubectl debug with ephemeral containers (the standard approach), pod copy strategy (for Kubernetes versions without ephemeral container support, or when you need to modify the running pod spec), debug image variants (the pragmatic developer shortcut), cdebug (a purpose-built tool that simplifies the process), and node-level debugging (the last resort with the most power). For each technique I will explain what it can and cannot do, what Kubernetes version or RBAC permissions it requires, and in which scenario — developer in local, platform engineer in staging, ops in production — it is the appropriate choice.

Why Distroless Breaks the Normal Debugging Workflow

Traditional container debugging assumes you can exec into the container and use shell tools: ps, netstat, strace, curl, a text editor. Distroless images remove all of this by design. The Google distroless project, Chainguard’s Wolfi-based images, and the broader minimal image ecosystem deliberately exclude everything that is not required to run the application. The result is a dramatically smaller attack surface: no shell means no RCE via shell injection, no package manager means no easy escalation path, fewer binaries means fewer CVEs in the image scan.

The tradeoff is operational: when something goes wrong, you cannot use the tools that the process itself is not allowed to run. A Java application in gcr.io/distroless/java17-debian12 has the JRE and nothing else. A Go binary compiled with CGO disabled and shipped in gcr.io/distroless/static-debian12 has literally only the binary and the necessary CA certificates and timezone data. There is no wget to download a debug binary, no apt to install one, no bash to run a script.

Kubernetes solves this at the platform level with ephemeral containers, added as stable in Kubernetes 1.25. The principle is that a debug container — which can have a full shell and any tools you want — can be injected into a running pod and share its process namespace, network namespace, and filesystem mounts without modifying the original container or restarting the pod.

Option 1: kubectl debug with Ephemeral Containers

Ephemeral containers are the canonical solution. Since Kubernetes 1.25 (stable), kubectl debug can inject a temporary container into a running pod. The container shares the target pod’s network namespace by default, and with --target it can also share the process namespace of a specific container, allowing you to inspect its running processes and open file descriptors.

The basic invocation is:

kubectl debug -it my-pod \
  --image=busybox:latest \
  --target=my-container

The --target flag is the critical piece. Without it, the ephemeral container gets its own process namespace. With it, it shares the process namespace of the specified container — meaning you can run ps aux and see the application’s processes, use ls -la /proc/<pid>/fd to inspect open file descriptors, and read the application’s environment via cat /proc/<pid>/environ.

For a more capable debug environment, replace busybox with a richer image:

kubectl debug -it my-pod \
  --image=nicolaka/netshoot \
  --target=my-container

nicolaka/netshoot includes tcpdump, curl, dig, nmap, ss, iperf3, and dozens of other network diagnostic tools, making it the standard choice for network debugging scenarios.

What You Can and Cannot Do

Ephemeral containers share the pod’s network namespace and, when --target is used, the process namespace. This gives you:

  • Full visibility into the application’s network traffic from inside the pod (tcpdump, ss, netstat)
  • Process inspection via /proc/<pid> — open files, memory maps, environment variables, CPU/memory usage
  • Access to the pod’s DNS resolution context — exactly the same /etc/resolv.conf the application sees
  • Ability to make outbound network calls from the same network namespace (testing service endpoints, DNS resolution)

What you do not get with ephemeral containers:

  • Access to the application container’s filesystem. The ephemeral container has its own root filesystem. You cannot cat /app/config.yaml from the application container’s filesystem unless you access it via /proc/<pid>/root/.
  • Ability to remove the container once added. Ephemeral containers are permanent until the pod is deleted. This is by design — the Kubernetes API does not allow removing them after creation.
  • Volume mount modifications via CLI. You cannot add volume mounts to an ephemeral container via kubectl debug (though the API spec supports it, the CLI does not expose this).
  • Resource limits. Ephemeral containers do not support resource requests and limits in the kubectl debug CLI, though this is evolving.

Accessing the Application Filesystem

The most common surprise for developers new to ephemeral containers is that they cannot directly browse the application container’s filesystem. The workaround is the /proc filesystem:

# Find the application's PID
ps aux

# Browse its filesystem via /proc
ls /proc/1/root/app/
cat /proc/1/root/etc/config.yaml

# Or set the root to the application's root
chroot /proc/1/root /bin/sh  # only if /bin/sh exists in the app image

The /proc/<pid>/root path is a symlink to the container’s root filesystem as seen from the process namespace. Because the ephemeral container shares the process namespace with --target, the application’s PID is typically 1, and /proc/1/root gives you full read access to its filesystem.

RBAC Requirements

Ephemeral containers require the pods/ephemeralcontainers subresource permission. This is separate from pods/exec, which controls kubectl exec. A common mistake is to grant pods/exec for debugging purposes without realizing that ephemeral containers require an additional grant:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: ephemeral-debugger
rules:
- apiGroups: [""]
  resources: ["pods/ephemeralcontainers"]
  verbs: ["update", "patch"]
- apiGroups: [""]
  resources: ["pods/attach"]
  verbs: ["create", "get"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list"]

In production environments, this permission should be tightly scoped: time-limited via RoleBinding rather than permanent ClusterRoleBinding, restricted to specific namespaces, and ideally gated behind an approval workflow. The debug container runs as root by default, which can create privilege escalation paths if the application container runs as a non-root user with shared process namespace — the debug container can attach to the application’s processes with higher privileges.

Option 2: kubectl debug –copy-to (Pod Copy Strategy)

When you need to modify the pod’s container spec — replace the image, change environment variables, add a sidecar with a shared filesystem — the --copy-to flag creates a full copy of the pod with your modifications applied:

kubectl debug my-pod \
  -it \
  --copy-to=my-pod-debug \
  --image=my-app:debug \
  --share-processes

This creates a new pod named my-pod-debug that is a copy of my-pod but with the container image replaced by my-app:debug. If my-app:debug is your application image built with debug tooling included (or a debug variant from your registry), this lets you interact with the exact same binary in the exact same configuration as the original pod.

A more common use of --copy-to is to attach a debug container alongside the existing application container while keeping the original image unchanged:

kubectl debug my-pod \
  -it \
  --copy-to=my-pod-debug \
  --image=busybox \
  --share-processes \
  --container=debugger

This creates the copy-pod with both the original containers and a new debugger container sharing the process namespace. Unlike ephemeral containers, this approach supports volume mounts and resource limits, and the debug pod can be deleted cleanly when you are done.

Limitations of the Copy Strategy

The pod copy approach has a critical limitation: it is not debugging the original pod. It creates a new pod that may behave differently because:

  • It does not share the original pod’s in-memory state — if the issue is a goroutine leak or heap corruption that has been accumulating for hours, the fresh copy will not exhibit it immediately
  • It creates a new Pod UID, which means any admission webhooks, network policies, or pod-level security contexts that depend on pod identity may apply differently
  • If the original pod is crashing (CrashLoopBackOff), the copy will also crash — this technique does not help for crash debugging unless you also change the entrypoint

For crash debugging specifically, combine --copy-to with a modified entrypoint to keep the container alive:

kubectl debug my-crashing-pod \
  -it \
  --copy-to=my-pod-debug \
  --image=busybox \
  --share-processes \
  -- sleep 3600

Option 3: Debug Image Variants

The most pragmatic approach — and the one most appropriate for developer workflows — is to maintain a debug variant of your application image that includes shell tooling. Both the Google distroless project and Chainguard provide this pattern officially.

Google distroless images have a :debug tag that adds BusyBox to the image:

# Production image
FROM gcr.io/distroless/java17-debian12

# Debug variant — identical but with BusyBox shell
FROM gcr.io/distroless/java17-debian12:debug

Chainguard images follow a similar convention with :latest-dev variants that include apk, a shell, and common utilities:

# Production (zero shell, minimal footprint)
FROM cgr.dev/chainguard/go:latest

# Development/debug variant
FROM cgr.dev/chainguard/go:latest-dev

If you build your own base images, the recommended approach is to use multi-stage builds and maintain separate build targets:

FROM golang:1.22 AS builder
WORKDIR /app
COPY . .
RUN go build -o myapp .

# Production: static distroless image
FROM gcr.io/distroless/static-debian12 AS production
COPY --from=builder /app/myapp /myapp
ENTRYPOINT ["/myapp"]

# Debug variant: same binary, with shell tools
FROM gcr.io/distroless/static-debian12:debug AS debug
COPY --from=builder /app/myapp /myapp
ENTRYPOINT ["/myapp"]

In your CI/CD pipeline, build both targets and push my-app:${VERSION} (production) and my-app:${VERSION}-debug (debug variant) to your registry. The debug image is never deployed to production by default, but it exists and is ready to be used with kubectl debug --copy-to when needed.

Security Considerations for Debug Variants

Debug image variants defeat much of the security benefit of distroless if they are used in production, even temporarily. Track usage carefully: log when debug images are deployed, require explicit approval, and ensure they are removed after the debugging session. In regulated environments, consider whether deploying a debug variant to production namespaces is permitted by your security policy — in many cases it is not, and you must use ephemeral containers (which add a debug process to the pod without modifying the application image) instead.

Option 4: cdebug

cdebug is an open-source CLI tool that simplifies distroless debugging by wrapping kubectl debug with more ergonomic defaults and additional capabilities. Its primary value is in making ephemeral container debugging feel like a native shell experience:

# Install
brew install cdebug
# or: go install github.com/iximiuz/cdebug@latest

# Debug a running pod
cdebug exec -it my-pod

# Specify a namespace and container
cdebug exec -it -n production my-pod -c my-container

# Use a specific debug image
cdebug exec -it my-pod --image=nicolaka/netshoot

What cdebug adds over raw kubectl debug:

  • Automatic filesystem chroot. cdebug exec automatically sets the filesystem root of the debug container to the target container’s filesystem, so you browse / and see the application’s files — not the debug image’s files. This addresses the most common friction point with kubectl debug.
  • Docker integration. cdebug exec works identically for Docker containers (cdebug exec -it <container-id>), making it the same muscle memory for local and cluster debugging.
  • No RBAC complications for Docker-based local development — useful for developer workflows before the code reaches Kubernetes.

The tradeoff: cdebug is a third-party dependency and requires installation. In environments with strict tooling policies (regulated industries, air-gapped clusters), it may not be an option. In those cases, the raw kubectl debug workflow with /proc/1/root filesystem navigation is the baseline.

Option 5: Node-Level Debugging

When everything else fails — the pod is in CrashLoopBackOff too fast to attach to, the issue is a kernel-level problem, or you need tools like strace that require elevated privileges — node-level debugging gives you direct access to the container’s processes from the host node.

kubectl debug node/ creates a privileged pod on the target node that mounts the node’s root filesystem under /host:

kubectl debug node/my-node-name \
  -it \
  --image=nicolaka/netshoot

From this privileged pod, you can use nsenter to enter the namespaces of any container running on the node:

# Find the container's PID on the node
# (from within the node debug pod)
crictl ps | grep my-container
crictl inspect <container-id> | grep pid

# Enter the container's namespaces
nsenter -t <pid> -m -u -i -n -p -- /bin/sh

# Or just the network namespace (for network debugging)
nsenter -t <pid> -n -- ip a

The nsenter approach lets you run tools from the node’s or debug container’s toolset while operating in the namespaces of the target container. This is how you run strace against a distroless process: strace is not in the application container, but you can run it from the node level while targeting the application’s PID.

# Trace all syscalls from the application process
nsenter -t <pid> -- strace -p <pid> -f -e trace=network

RBAC and Security for Node Debugging

Node-level debugging requires nodes/proxy and the ability to create privileged pods, which in most production clusters is restricted to cluster administrators. The debug pod runs with hostPID: true and hostNetwork: true, giving it visibility into all processes and network traffic on the node — not just the target container. This is significant: every process running on the node, including those in other tenants’ namespaces, is visible.

This technique should be treated as a break-glass procedure: log the access, require dual approval in production environments, and clean up immediately after the debugging session with kubectl delete pod --selector=app=node-debugger.

Choosing the Right Approach: Access Profile and Environment Matrix

The technique you should use depends on two axes: who you are (developer, platform engineer, ops/SRE) and where the issue is (local development, staging, production). The requirements and constraints differ significantly across these combinations.

Developer — Local or Development Cluster

Goal: Reproduce and understand a bug, inspect configuration, verify network connectivity to services.
Constraints: None material — full cluster admin on local or personal dev namespace.
Recommended approach: Debug image variants or cdebug.

In local development (Minikube, Kind, Docker Desktop), the fastest path is to build the debug variant of your image and deploy it directly. If you are working with another team’s service, cdebug exec gives you a shell in the container with automatic filesystem root without any special RBAC. The goal is speed and iteration — reserve the more structured approaches for higher environments.

Developer — Staging Cluster

Goal: Debug integration issues, inspect live configuration, verify environment-specific behavior.
Constraints: Shared cluster — cannot deploy arbitrary workloads to other teams’ namespaces, but has pods/ephemeralcontainers in own namespace.
Recommended approach: kubectl debug with ephemeral containers (--target), scoped to own namespace.

Staging is where ephemeral containers earn their keep. You can attach to a running pod without restarting it, without modifying the deployment spec, and without affecting other users of the same cluster. Grant developers pods/ephemeralcontainers in their team’s namespaces and they can self-service debug without needing ops involvement.

Platform Engineer / SRE — Production

Goal: Diagnose a live production incident. The pod is behaving unexpectedly — high latency, memory growth, unexpected connections, incorrect responses.
Constraints: Changes to running pods are high-risk. Any debug image deployment must be gated. The issue is live and affecting users.
Recommended approach: kubectl debug with ephemeral containers (ephemeral containers do not restart the pod, do not modify the deployment, and are auditable via API audit logs).

The key production requirements are auditability and minimal blast radius. Ephemeral containers satisfy both: they are recorded in the Kubernetes API audit log (who attached, when, to which pod), they do not modify the running application container, and they are limited to the pod’s own network and process namespaces. Document the debug session in your incident ticket: pod name, time, what was observed, who ran the debug container.

The --copy-to strategy is generally inappropriate for production incident response: it creates a new pod that may or may not exhibit the issue, it adds load to the cluster during an incident, and if it is attached to the same services (databases, downstream APIs), it produces additional traffic that complicates forensics.

Platform Engineer — Production, Node-Level Issue

Goal: Diagnose a kernel-level issue, a container runtime problem, a networking issue that spans multiple pods, or a situation where the pod is crashing too fast to attach to.
Constraints: Maximum privilege required. High operational risk.
Recommended approach: Node-level debug pod with nsenter. Treat as break-glass.

For this scenario, create a dedicated RBAC role that grants nodes/proxy access and the ability to create pods with hostPID: true in a dedicated debug namespace. Bind it only to specific users, require a separate authentication step (e.g., kubectl auth can-i check against a time-limited binding), and log all access. This level of access should generate a PagerDuty-style alert so that the security team knows a privileged debug session is active in production.

Common Errors and Solutions

Error: “ephemeral containers are disabled for this cluster”

Ephemeral containers require Kubernetes 1.16+ (alpha, behind feature gate) and are stable from 1.25. If you are on 1.16–1.22, you need to enable the EphemeralContainers feature gate on the API server and kubelet. From 1.23 it was beta and enabled by default. From 1.25 it is stable and always on. On managed Kubernetes services (EKS, GKE, AKS), check the cluster version — versions older than 1.25 may still have it disabled depending on your configuration.

Error: “cannot update ephemeralcontainers” (RBAC)

You have pods/exec but not pods/ephemeralcontainers. Add the grant shown in the RBAC section above. Note that pods/exec and pods/ephemeralcontainers are separate subresources — having one does not imply the other.

Error: “container not found” with –target

The container name in --target must match exactly the container name as defined in the Pod spec — not the image name. Check with kubectl get pod my-pod -o jsonpath='{.spec.containers[*].name}' to get the exact container names.

Error: Can see processes but cannot read /proc/1/root

The application container runs as a non-root user (e.g., UID 1000) and the ephemeral container runs as root. The application’s filesystem may have files owned by UID 1000 that are not readable by other UIDs depending on permissions. The /proc/<pid>/root path itself requires CAP_SYS_PTRACE capability. If your cluster’s PodSecurityStandards (PSS) are set to restricted, the debug container may not have this capability. Use the Baseline PSS profile for debug namespaces or explicitly add SYS_PTRACE to the ephemeral container’s securityContext.

Error: tcpdump shows no traffic

When using nicolaka/netshoot for network debugging, ensure the ephemeral container is created without --target if your goal is to capture all traffic on the pod’s network interface (not just the specific container’s process). With --target, you share the process namespace but the network namespace is shared at the pod level regardless. Run tcpdump -i any to capture on all interfaces including loopback, which is where inter-container traffic within a pod travels.

Decision Framework

Use this as a starting point to select the right technique for your situation:

ScenarioTechniqueRequirement
Active production incident, pod runningkubectl debug + ephemeral containerpods/ephemeralcontainers RBAC, k8s 1.25+
Pod crashing too fast to attachkubectl debug –copy-to + modified entrypointAbility to create pods in namespace
Developer debugging in dev/stagingcdebug exec or kubectl debugpods/ephemeralcontainers or pod create
Need full filesystem accesskubectl debug –copy-to + debug image variantDebug image in registry, pod create
Need strace or kernel tracingNode-level debug with nsenternodes/proxy, cluster admin equivalent
Network packet capturekubectl debug + nicolaka/netshootpods/ephemeralcontainers
Local Docker debuggingcdebug exec <container-id>Docker socket access
CI-reproducible debug environmentDebug image variant in separate build targetSeparate image tag in registry

Production RBAC Design

A clean RBAC design for production distroless debugging separates three roles with different privilege levels:

# Tier 1: Developer self-service in team namespaces
# Allows attaching ephemeral containers, no node access
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: distroless-debugger
  namespace: team-namespace
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list"]
- apiGroups: [""]
  resources: ["pods/ephemeralcontainers"]
  verbs: ["update", "patch"]
- apiGroups: [""]
  resources: ["pods/attach"]
  verbs: ["create", "get"]
---
# Tier 2: SRE production incident access
# Ephemeral containers across all namespaces
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: sre-distroless-debugger
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list"]
- apiGroups: [""]
  resources: ["pods/ephemeralcontainers"]
  verbs: ["update", "patch"]
- apiGroups: [""]
  resources: ["pods/attach"]
  verbs: ["create", "get"]
---
# Tier 3: Break-glass node access
# Only for platform team, time-limited binding recommended
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: node-debugger
rules:
- apiGroups: [""]
  resources: ["nodes/proxy"]
  verbs: ["get"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["create", "get", "list", "delete"]
  # Restrict to debug namespace via RoleBinding, not ClusterRoleBinding

Bind Tier 1 permanently to your developers. Bind Tier 2 to SREs permanently but with audit alerts on use. Bind Tier 3 only on-demand (via a Kubernetes operator that creates time-limited RoleBindings) and never as a permanent ClusterRoleBinding.

Summary

Distroless containers are the correct choice for production workloads. They reduce attack surface, eliminate unnecessary CVEs, and force a cleaner separation between application and tooling. The operational cost is that your traditional debugging workflow — exec into the container, run some commands — no longer works by default.

Kubernetes provides a clean answer with ephemeral containers and kubectl debug: inject a debug container with whatever tools you need into the running pod, sharing its network and process namespaces, without restarting or modifying the application. For scenarios where ephemeral containers are insufficient — filesystem access, crash debugging, kernel-level investigation — the copy strategy and node-level debug fill the remaining gaps.

The key to making this work at scale is not the technique itself but the access model: developers get self-service ephemeral container access in their own namespaces, SREs get cluster-wide ephemeral container access for production incidents, and node-level access is a break-glass procedure with audit trail and time limits. With that model in place, distroless becomes an operational non-issue rather than an obstacle.

Kubernetes HPA Best Practices: When CPU Works, Why Memory Almost Never Does

Kubernetes HPA best practices diagram showing CPU vs memory autoscaling behavior

There is a configuration that appears in virtually every Kubernetes cluster: a HorizontalPodAutoscaler targeting 70% CPU utilization and 70% memory utilization. It looks reasonable. It follows the examples in the official documentation. And in many cases, it silently causes more harm than good.

The problems surface in predictable ways: workloads that do nothing get scaled up because their memory footprint is naturally high. Latency-sensitive APIs scale too slowly because the CPU spike is already over by the time new pods are ready. Batch jobs oscillate between scaling up and down during normal operation. And teams spend hours debugging autoscaling behavior that should have been straightforward.

This article is about understanding why the default HPA configuration fails, under which exact conditions memory-based HPA is appropriate (and when it is not), and what alternative metrics — custom metrics, event-driven triggers, and external signals — produce autoscaling behavior that actually matches workload demand.

How HPA Actually Decides to Scale

Before diagnosing the problems, it is worth understanding the mechanics precisely. HPA computes a desired replica count using this formula:

desiredReplicas = ceil(currentReplicas × (currentMetricValue / desiredMetricValue))

For a CPU target of 70%, with 2 replicas currently consuming an average of 140% of their CPU request, HPA computes ceil(2 × (140 / 70)) = 4 replicas. This is conceptually simple but has a critical dependency that most configurations ignore: the metric value is expressed relative to the resource request, not the resource limit.

This distinction is fundamental to understanding every failure mode that follows. If a container has a CPU request of 100m and a limit of 2000m, and it is currently consuming 80m, HPA sees 80% utilization — even though the container is using only 4% of its allowed ceiling. Set an HPA threshold of 70% on a container with a CPU request of 100m and any nontrivial workload will trigger scaling immediately.

The HPA controller polls metrics every 15 seconds by default (--horizontal-pod-autoscaler-sync-period). Scale-up happens quickly — within one to three polling cycles when the threshold is consistently exceeded. Scale-down is deliberately slow: by default the controller waits 5 minutes (--horizontal-pod-autoscaler-downscale-stabilization) before reducing replicas, to avoid thrashing. This asymmetry matters when debugging oscillation.

CPU-Based HPA: When It Works and When It Doesn’t

CPU is a compressible resource. When a container hits its CPU limit, the kernel throttles it — the process slows down but does not crash or get evicted. This property makes CPU a reasonable proxy for load in many, but not all, scenarios.

Where CPU HPA Works Well

Stateless request-processing workloads are the sweet spot for CPU-based HPA. If your service does CPU-bound work per request — REST APIs performing data transformation, compute-heavy business logic, image processing — then CPU utilization correlates strongly with request volume. More requests means more CPU consumed, which means HPA adds replicas, which distributes the load.

The key prerequisites for CPU HPA to work correctly are:

  • Accurate CPU requests. Set requests to the actual sustained consumption of the workload under normal load, not a low placeholder. Use VPA in recommendation mode or historical Prometheus data to right-size requests before enabling HPA.
  • Reasonable request-to-limit ratio. A ratio of 1:4 or less keeps HPA thresholds meaningful. A container with request 100m and limit 4000m makes percentage-based thresholds nearly useless.
  • CPU consumption that tracks user load linearly. If your service does CPU-heavy background work independent of incoming requests, CPU utilization will trigger scaling regardless of actual demand.

Where CPU HPA Fails

Latency-sensitive services with sharp traffic spikes. HPA reacts to average CPU utilization measured over the polling window. For a service that handles traffic bursts — a flash sale, a cron-triggered batch of API calls, a notification broadcast — by the time the HPA controller detects the spike, queues new pods, and those pods pass readiness checks, the burst may already be over. The result is replicas added after the damage is done, with the added cost of a scale-down cycle afterward.

I/O-bound workloads. A service that spends most of its time waiting on database queries, external API calls, or message queue reads will show low CPU utilization even under heavy load. HPA will not add replicas while the service is degraded — it sees idle CPUs while goroutines or threads are blocked waiting on I/O.

Workloads with cold-start costs. If a new replica takes 30-60 seconds to warm up (loading ML models, establishing connection pools, populating caches), scaling decisions need to happen earlier — before CPU peaks — not in reaction to it.

Memory-Based HPA: Why It Almost Always Breaks

Memory is an incompressible resource. Unlike CPU — which can be throttled without killing a process — when a container exhausts its memory limit, the OOM killer terminates it. This single property cascades into a set of fundamental problems with using memory as an HPA trigger.

The Core Problem: Memory Doesn’t Naturally Correlate With Load

For most well-architected services, memory consumption is relatively stable. A Go service allocates memory at startup for its runtime structures, connection pools, and caches — and then maintains roughly that footprint regardless of traffic. A JVM application allocates a heap at startup and uses garbage collection to manage it. In both cases, memory usage under 10 requests per second and under 10,000 requests per second may be nearly identical.

This means a memory-based HPA with a 70% threshold will either:

  • Never trigger, because the workload’s memory is stable and always below the threshold — rendering the HPA useless.
  • Always trigger, because the workload’s baseline memory consumption is naturally above the threshold — causing the workload to scale out permanently and never scale back in.

Neither outcome corresponds to actual scaling need.

The Request Misconfiguration Trap

This is the failure mode the user mentioned, and it is the most common cause of “my workload scales up for no reason.” Consider a Java service that needs 512Mi of heap to run normally. The team sets memory request to 256Mi — too conservative, either to save cost or because the initial estimate was wrong. The service immediately consumes 200% of its memory request just by being alive. An HPA with a 70% memory target will scale this workload to maximum replicas within minutes of deployment, and it will stay there forever.

The fix is never “adjust the HPA threshold.” The fix is right-sizing the memory request. But this reveals the deeper issue: memory-based HPA is extremely sensitive to the accuracy of your resource requests, and most teams do not have accurate requests — especially for newer workloads or after code changes that alter memory footprint.

JVM and Go Runtime Memory Behavior

JVM workloads are particularly problematic. By default, the JVM allocates heap up to a maximum (-Xmx) and then holds that memory — it does not release heap back to the OS aggressively, even after garbage collection. A JVM service that handles one request per hour will show nearly the same memory footprint as one handling thousands of requests per minute. Furthermore, the JVM’s garbage collector introduces memory spikes during collection cycles that are unrelated to load.

In containerized JVM environments, you also need to account for the container memory limit aware flag (-XX:+UseContainerSupport, enabled by default since JDK 11) which affects how the JVM calculates its heap ceiling relative to the container limit. Without proper tuning, the JVM may allocate a heap that fills 80-90% of the container’s memory limit — immediately triggering any memory-based HPA.

Go workloads behave differently but also poorly with memory HPA. Go’s garbage collector is designed to maintain low latency rather than minimal memory use. The runtime may hold memory above what is strictly needed, and the memory footprint can vary based on GC tuning parameters (GOGC, GOMEMLIMIT) in ways that are not correlated with incoming request load.

When Memory HPA Is Actually Appropriate

There are narrow cases where memory-based HPA makes sense:

  • Workloads where memory consumption genuinely tracks with load linearly. Some data processing pipelines, in-memory caches that grow with request volume, or streaming applications that buffer data proportionally to throughput. If you can demonstrate from metrics that memory and load have a strong linear correlation, memory HPA is defensible.
  • As a safety valve alongside CPU HPA. Using memory as a secondary metric (not primary) to protect against memory leaks or runaway allocations in a service that normally scales on CPU. In this case, set the memory threshold high — 85-90% — so it only triggers in genuine overconsumption scenarios.
  • Caching services where eviction is not desirable. If a service uses memory as a performance cache and you want to scale out before memory pressure causes cache eviction, memory utilization can be a useful trigger — provided requests are accurately sized.

Outside these specific cases, removing memory from your HPA spec and relying on the signals below will produce better behavior in virtually every scenario.

Right-Sizing Requests Before You Add HPA

No HPA strategy works correctly without accurate resource requests. Before adding any autoscaler — CPU, memory, or custom metrics — run your workload under representative load and measure actual consumption. The easiest way to do this is with VPA in recommendation mode:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-service-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-service
  updatePolicy:
    updateMode: "Off"   # Recommendation only — don't auto-apply

After 24-48 hours of traffic, check the VPA recommendations:

kubectl describe vpa my-service-vpa

The lowerBound, target, and upperBound values give you a data-driven baseline for setting requests. Set your requests at or near the VPA target value before configuring HPA. This single step eliminates the most common cause of HPA misbehavior.

Note that VPA and HPA cannot both manage the same resource metric simultaneously. If VPA is set to auto-update CPU or memory, and HPA is also scaling on those metrics, the two controllers will fight each other. The safe combination is: HPA on CPU/memory + VPA in recommendation-only mode, or HPA on custom metrics + VPA on CPU/memory in auto mode. See the Kubernetes VPA guide for the full details.

Better Signals: What to Scale On Instead

The fundamental shift is moving from resource consumption metrics (which describe the past) to demand metrics (which describe what the workload is being asked to do right now or will be asked to do in seconds).

Requests Per Second (RPS)

For HTTP services, requests per second per replica is usually the most accurate proxy for load. Unlike CPU, it measures demand directly — not a side-effect of demand. An HPA that maintains 500 RPS per replica will scale predictably as traffic grows, regardless of whether the service is CPU-bound, memory-bound, or I/O-bound.

RPS is available as a custom metric from your service mesh (Istio exposes it as istio_requests_total), from your ingress controller (NGINX exposes request rates via Prometheus), or from your application’s own Prometheus metrics. Configuring HPA on custom metrics requires the Prometheus Adapter or a compatible custom metrics API implementation.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-service
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "500"   # 500 RPS per replica

Queue Depth and Lag

For consumer workloads — services reading from Kafka, RabbitMQ, SQS, or any message queue — the right scaling signal is consumer lag: how many messages are waiting to be processed. A lag of zero means consumers are keeping up; a growing lag means you need more consumers.

CPU will not give you this signal reliably. A consumer blocked on a slow database write will show low CPU but growing lag. An idle consumer will show low CPU even if the queue contains millions of unprocessed messages. Scaling on lag directly solves both problems.

This is precisely the use case that KEDA was built for. KEDA’s Kafka scaler, for example, reads consumer group lag directly and scales replicas to maintain a configurable lag threshold — no custom metrics pipeline required.

Latency

P99 latency per replica is an excellent scaling signal for latency-sensitive services. If your SLO is a 200ms P99 response time and latency starts climbing toward 400ms, that is a direct signal that the service is overloaded — regardless of what CPU or memory shows.

Latency-based autoscaling requires custom metrics from your service mesh or APM tool, but the added complexity is often justified for user-facing APIs where latency directly impacts experience.

Scheduled and Predictive Scaling

For workloads with predictable traffic patterns — business-hours services, weekly batch jobs, end-of-month processing peaks — proactive scaling outperforms reactive scaling by definition. Rather than waiting for CPU to spike and then scrambling to add replicas, you pre-scale before the expected load increase.

KEDA’s Cron scaler enables this pattern declaratively, defining scale rules based on time windows rather than observed metrics.

HPA Configuration Best Practices

Always Set minReplicas ≥ 2 for Production

A minReplicas: 1 HPA means your service has a single point of failure during scale-in events. When HPA scales down to 1 replica and that pod is evicted for node maintenance, your service has zero available instances for the duration of the new pod’s startup time. For any production workload, set minReplicas: 2 as a baseline.

Tune Stabilization Windows

The default 5-minute scale-down stabilization window is too aggressive for many workloads. A service that processes jobs in 3-minute batches will show a predictable CPU trough between batches — HPA will attempt to scale down, only to scale back up when the next batch arrives. Increase the stabilization window to match your workload’s natural cycle:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-service
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 600   # 10 minutes
      policies:
      - type: Percent
        value: 25                        # Scale down max 25% of replicas at once
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0     # Scale up immediately
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30

The behavior block (available in HPA v2, GA since Kubernetes 1.23) gives you independent control over scale-up and scale-down behavior. Aggressive scale-up with conservative scale-down is the right default for most production services.

Use a Lower CPU Threshold Than You Think

A CPU target of 70% sounds like it leaves headroom, but it does not account for the time required to scale. If your service takes 45 seconds to pass readiness checks after a new pod starts, and you scale at 70% CPU, the existing pods will be at 100%+ CPU (throttled) for 45 seconds before relief arrives. Set CPU targets at 50-60% for services where scale-up latency matters. This keeps more headroom available during the scaling reaction window.

Combine HPA with PodDisruptionBudgets

HPA scale-down terminates pods. Without a PodDisruptionBudget, HPA can terminate multiple replicas simultaneously during a scale-down event, potentially taking your service below its minimum healthy instance count during cluster maintenance. Always pair an HPA with a PDB:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-service-pdb
spec:
  minAvailable: "50%"
  selector:
    matchLabels:
      app: my-service

Don’t Mix VPA Auto-Update with HPA on the Same Metric

If VPA is set to auto-update CPU or memory requests, and HPA is also scaling on CPU or memory utilization, you create a control loop conflict. VPA changes the request (the denominator of the utilization calculation), which immediately changes the apparent utilization, which triggers HPA to change replica count, which changes the per-pod load, which triggers VPA again. Use VPA in Off or Initial mode when HPA is managing the same workload on resource metrics.

Decision Framework: Which Autoscaler for Which Workload

Use this as a starting point when configuring autoscaling for a new workload:

Workload typeRecommended signalTool
Stateless HTTP API, CPU-boundCPU utilization at 50-60%HPA
Stateless HTTP API, I/O-boundRPS per replica or P99 latencyHPA + custom metrics
Message queue consumerConsumer lag / queue depthKEDA
Event-driven / Kafka / SQSEvent rate or lagKEDA
Predictable traffic patternSchedule (time-based)KEDA Cron scaler
Workload with memory leak riskCPU primary + memory at 85% secondaryHPA (v2 multi-metric)
Right-sizing before HPAHistorical CPU/memory recommendationsVPA recommendation mode

Going Beyond HPA: KEDA and Custom Metrics

Once you outgrow what HPA v2 can express — particularly for event-driven architectures, external system triggers, or composite scaling conditions — KEDA provides a Kubernetes-native autoscaling framework that extends the HPA model without replacing it.

KEDA works by implementing a custom metrics API that HPA can consume, plus its own ScaledObject CRD that abstracts the configuration of over 60 built-in scalers: Kafka, RabbitMQ, Azure Service Bus, AWS SQS, Prometheus queries, Datadog metrics, HTTP request rate, and more. The important architectural point is that KEDA does not replace HPA — it feeds it. Under the hood, KEDA creates and manages an HPA resource targeting the scaled deployment. You get HPA’s stabilization windows, replica bounds, and Kubernetes-native behavior, driven by signals that HPA itself cannot access natively.

For a detailed walkthrough of KEDA scalers and real-world event-driven patterns, see Event-Driven Autoscaling in Kubernetes with KEDA.

For workloads where the right scaling signal comes from a Prometheus metric — request rates, custom business metrics, queue sizes exposed via exporters — the Kubernetes Autoscaling 1.26 and HPA v2 article covers how the custom metrics API pipeline works and how changes in Kubernetes 1.26 affected KEDA behavior.

❓ FAQ

Can I use both CPU and memory in the same HPA?

Yes. HPA v2 supports multiple metrics simultaneously — it scales to satisfy the most demanding metric. If CPU is at 40% (below threshold) but memory is at 80% (above threshold), HPA will scale up. This multi-metric capability is useful for using memory as a safety valve while CPU drives normal scaling behavior. Set the CPU threshold at 60% and the memory threshold at 85% so memory only triggers in genuine overconsumption scenarios.

Why does my workload scale up immediately after deployment?

Almost always a resource request misconfiguration. Check kubectl top pods immediately after deployment and compare the actual consumption to the configured request. If the workload is consuming 200% of its request by simply being alive, the request is set too low. Use VPA in recommendation mode for 24 hours and adjust the request to match actual usage before re-enabling HPA.

Why does HPA scale down too aggressively and cause latency spikes?

Increase the scaleDown.stabilizationWindowSeconds in the HPA behavior block. The default 300 seconds is too short for workloads with cyclical load patterns. Also add a Percent policy to scale down at most 25% of replicas per minute, preventing simultaneous termination of multiple pods during a rapid scale-down event.

Should I set HPA on every deployment?

No. HPA is appropriate for workloads where replica count meaningfully affects capacity — stateless services, consumers, request handlers. It is not appropriate for stateful workloads (databases, caches) where scaling requires more than just adding replicas, for singleton controllers that should never have more than one replica, or for batch jobs that should run to completion without scaling. Adding HPA to every deployment creates operational noise and potential instability without benefit.

What is the minimum CPU request I should set to use HPA reliably?

There is no absolute minimum, but requests below 100m make percentage thresholds very coarse-grained. At 50m CPU request and a 70% threshold, HPA triggers when the pod consumes 35m CPU — essentially any non-trivial activity. In practice, if your workload genuinely needs less than 100m CPU under load, it probably should not be using CPU-based HPA at all. Consider RPS or custom metrics instead.

How do I debug HPA scaling decisions?

Start with kubectl describe hpa <name> — it shows the current metric values, the computed desired replica count, and the last scaling event reason. For deeper inspection, check HPA events with kubectl get events --field-selector involvedObject.kind=HorizontalPodAutoscaler. If using custom metrics, verify the metrics server is returning expected values with kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1".


Further Reading

Istio ServiceEntry Explained: External Services, DNS, and Traffic Control

Istio ServiceEntry Explained: External Services, DNS, and Traffic Control

Every production Kubernetes cluster talks to the outside world. Your services call payment APIs, connect to managed databases, push events to SaaS analytics platforms, and reach legacy systems that will never run inside the mesh. By default, Istio lets all outbound traffic flow freely — or blocks it entirely if you flip outboundTrafficPolicy to REGISTRY_ONLY. Neither extreme gives you what you actually need: selective, observable, policy-controlled access to external services.

That is exactly what Istio ServiceEntry solves. It registers external endpoints in the mesh’s internal service registry so that Envoy sidecars can apply the same traffic management, security, and observability features to outbound calls that you already enjoy for east-west traffic. No new proxies, no egress gateways required for the basic case — just a YAML resource that tells the mesh “this external thing exists, and here is how to reach it.”

In this guide, I will walk through every field of the ServiceEntry spec, explain the four DNS resolution modes with real-world use cases, and show production-ready patterns for external APIs, databases, TCP services, and legacy workloads. We will also cover how to combine ServiceEntry with DestinationRule and VirtualService to get circuit breaking, retries, connection pooling, and even sticky sessions for external dependencies.

What Is a ServiceEntry

Istio maintains an internal service registry that merges Kubernetes Services with any additional entries you declare. When a sidecar proxy needs to decide how to route a request, it consults this registry. Services inside the mesh are automatically registered. Services outside the mesh are not — unless you create a ServiceEntry.

A ServiceEntry is a custom resource that adds an entry to the mesh’s service registry. Once registered, the external service becomes a first-class citizen: Envoy generates clusters, routes, and listeners for it, which means you get metrics (istio_requests_total), access logs, distributed traces, mTLS origination, retries, timeouts, circuit breaking — the full Istio feature set.

Without a ServiceEntry, outbound traffic to an external host either passes through as a raw TCP connection (in ALLOW_ANY mode) with no telemetry, or gets dropped with a 502/503 (in REGISTRY_ONLY mode). Both outcomes are undesirable in production. The ServiceEntry bridges that gap.

ServiceEntry Anatomy: All Fields Explained

Let us look at a complete ServiceEntry and then break down each field.

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: external-api
  namespace: production
spec:
  hosts:
    - api.stripe.com
  location: MESH_EXTERNAL
  ports:
    - number: 443
      name: https
      protocol: TLS
  resolution: DNS
  exportTo:
    - "."
    - "istio-system"

hosts

A list of hostnames associated with the service. For external services, this is typically the DNS name your application uses (e.g., api.stripe.com). For services using HTTP protocols, the hosts field is matched against the HTTP Host header. For non-HTTP protocols and services without a DNS name, you can use a synthetic hostname and pair it with addresses or static endpoints.

addresses

Optional virtual IP addresses associated with the service. Useful for TCP services where you want to assign a VIP that the sidecar will intercept. Not required for HTTP/HTTPS services that use hostname-based routing.

ports

The ports on which the external service is exposed. Each port needs a number, name, and protocol. The protocol matters: setting it to TLS tells Envoy to perform SNI-based routing without terminating TLS. Setting it to HTTPS means HTTP over TLS. For databases, you’ll typically use TCP.

location

MESH_EXTERNAL or MESH_INTERNAL. Use MESH_EXTERNAL for services outside your cluster (third-party APIs, managed databases). Use MESH_INTERNAL for services inside your infrastructure that are not part of the mesh — for example, VMs running in the same VPC that do not have a sidecar, or a Kubernetes Service in a namespace without injection enabled. The location affects how mTLS is applied and how metrics are labeled.

resolution

How the sidecar resolves the endpoint addresses. This is the most critical field and I will dedicate the next section to it. Options: NONE, STATIC, DNS, DNS_ROUND_ROBIN.

endpoints

An explicit list of network endpoints. Required when resolution is STATIC. Optional with DNS resolution to provide labels or locality information. Each endpoint can have an address, ports, labels, network, locality, and weight.

exportTo

Controls the visibility of this ServiceEntry across namespaces. Use "." for the current namespace only, "*" for all namespaces. In multi-team clusters, restrict exports to avoid namespace pollution.

Resolution Types: NONE vs STATIC vs DNS vs DNS_ROUND_ROBIN

The resolution field determines how Envoy discovers the IP addresses behind the service. Getting this wrong is the number one cause of ServiceEntry misconfigurations. Here is a clear breakdown.

ResolutionHow It WorksBest For
NONEEnvoy uses the original destination IP from the connection. No DNS lookup by the proxy.Wildcard entries, pass-through scenarios, services where the application already resolved the IP.
STATICEnvoy routes to the IPs listed in the endpoints field. No DNS involved.Services with stable, known IPs (e.g., on-prem databases, VMs with fixed IPs).
DNSEnvoy resolves the hostname at connection time and creates an endpoint per returned IP. Uses async DNS with health checking per IP.External APIs behind load balancers, managed databases with DNS endpoints (RDS, CloudSQL).
DNS_ROUND_ROBINEnvoy resolves the hostname and uses a single logical endpoint, rotating across returned IPs. No per-IP health checking.Simple external services, services where you do not need per-endpoint circuit breaking.

When to Use NONE

Use NONE when you want to register a range of external IPs or wildcard hosts without Envoy performing any address resolution. This is common for broad egress policies: “allow traffic to *.googleapis.com on port 443.” Envoy will simply forward traffic to whatever IP the application resolved via kube-dns. The downside: Envoy has limited ability to apply per-endpoint policies.

When to Use STATIC

Use STATIC when the external service has known, stable IP addresses that rarely change. This avoids DNS dependencies entirely. You define the IPs in the endpoints list. Classic use case: a legacy Oracle database on a fixed IP in your data center.

When to Use DNS

Use DNS for most external API integrations. Envoy performs asynchronous DNS resolution and creates a cluster endpoint for each returned IP address. This enables per-endpoint health checking and circuit breaking — critical for production reliability. This is the mode you want for services like api.stripe.com or your RDS instance endpoint.

When to Use DNS_ROUND_ROBIN

Use DNS_ROUND_ROBIN when the external hostname returns many IPs and you do not need per-IP circuit breaking. Envoy treats all resolved IPs as a single logical endpoint and round-robins across them. This is lighter weight than DNS mode and avoids creating a large number of endpoints in Envoy’s cluster configuration.

Practical Patterns

Pattern 1: External HTTP API (api.stripe.com)

The most common ServiceEntry pattern. Your application calls a third-party HTTPS API. You want Istio telemetry, and optionally retries and timeouts.

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: stripe-api
  namespace: payments
spec:
  hosts:
    - api.stripe.com
  location: MESH_EXTERNAL
  ports:
    - number: 443
      name: tls
      protocol: TLS
  resolution: DNS

Note the protocol is TLS, not HTTPS. Since your application initiates the TLS handshake directly, Envoy handles this as opaque TLS using SNI-based routing. If you were terminating TLS at the sidecar and doing TLS origination via a DestinationRule, you would set the protocol to HTTP and handle the upgrade separately — but for most external APIs, let the application manage its own TLS.

Pattern 2: External Managed Database (RDS / CloudSQL)

Managed databases expose a DNS endpoint that resolves to one or more IPs. During failover, the DNS record changes. You need Envoy to respect DNS TTLs and route to the current primary.

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: orders-database
  namespace: orders
spec:
  hosts:
    - orders-db.abc123.us-east-1.rds.amazonaws.com
  location: MESH_EXTERNAL
  ports:
    - number: 5432
      name: postgres
      protocol: TCP
  resolution: DNS

For TCP services, Envoy cannot use HTTP headers to route, so it relies on IP-based matching. The DNS resolution mode ensures Envoy periodically re-resolves the hostname and updates its endpoint list. This is critical for RDS multi-AZ failover scenarios where the DNS endpoint flips to a new IP.

Pattern 3: Legacy Internal Service Not in the Mesh

You have a monitoring service running on a set of VMs at known IP addresses inside your VPC. It is not part of the mesh, but your meshed services need to talk to it.

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: legacy-monitoring
  namespace: observability
spec:
  hosts:
    - legacy-monitoring.internal
  location: MESH_INTERNAL
  ports:
    - number: 8080
      name: http
      protocol: HTTP
  resolution: STATIC
  endpoints:
    - address: 10.0.5.10
    - address: 10.0.5.11
    - address: 10.0.5.12

Key differences: location is MESH_INTERNAL because the service lives inside your network, and resolution is STATIC because we know the IPs. The hostname legacy-monitoring.internal is synthetic — your application uses it, and Istio’s DNS proxy (or a CoreDNS entry) resolves it to one of the listed endpoints.

Pattern 4: TCP Services with Multiple Ports

Some external services expose multiple TCP ports — for example, an Elasticsearch cluster with both data (9200) and transport (9300) ports.

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: external-elasticsearch
  namespace: search
spec:
  hosts:
    - es.example.com
  location: MESH_EXTERNAL
  ports:
    - number: 9200
      name: http
      protocol: HTTP
    - number: 9300
      name: transport
      protocol: TCP
  resolution: DNS

Each port gets its own Envoy listener configuration. The HTTP port benefits from full Layer 7 telemetry and traffic management. The TCP port gets Layer 4 metrics and connection-level policies.

Combining ServiceEntry with DestinationRule

A ServiceEntry alone registers the external service. To apply traffic policies — connection pooling, circuit breaking, TLS origination, load balancing — you pair it with a DestinationRule. This is where things get powerful.

Connection Pooling and Circuit Breaking

External APIs have rate limits. Your managed database has a maximum connection count. Protecting these dependencies at the mesh level prevents cascading failures.

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: stripe-api
  namespace: payments
spec:
  hosts:
    - api.stripe.com
  location: MESH_EXTERNAL
  ports:
    - number: 443
      name: tls
      protocol: TLS
  resolution: DNS
---
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: stripe-api-dr
  namespace: payments
spec:
  host: api.stripe.com
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 50
        connectTimeout: 5s
      http:
        h2UpgradePolicy: DO_NOT_UPGRADE
        maxRequestsPerConnection: 100
    outlierDetection:
      consecutive5xxErrors: 3
      interval: 30s
      baseEjectionTime: 60s
      maxEjectionPercent: 100

This configuration caps outbound connections to Stripe at 50, sets a 5-second connection timeout, and ejects endpoints that return 3 consecutive 5xx errors. In production, this prevents a degraded third-party API from consuming all your connection slots and causing a domino effect across your services.

TLS Origination

Sometimes your application speaks plain HTTP, but the external service requires HTTPS. Instead of modifying application code, you can offload TLS origination to the sidecar.

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: external-api
  namespace: default
spec:
  hosts:
    - api.external-service.com
  location: MESH_EXTERNAL
  ports:
    - number: 80
      name: http
      protocol: HTTP
    - number: 443
      name: https
      protocol: TLS
  resolution: DNS
---
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: external-api-tls
  namespace: default
spec:
  host: api.external-service.com
  trafficPolicy:
    portLevelSettings:
      - port:
          number: 443
        tls:
          mode: SIMPLE

Your application sends HTTP to port 80. A VirtualService (shown in the next section) redirects that to port 443. The DestinationRule initiates TLS to the external endpoint. The application never knows TLS happened.

Combining ServiceEntry with VirtualService

VirtualService gives you Layer 7 traffic management for external services: retries, timeouts, fault injection, header-based routing, and traffic shifting. This is invaluable when you are migrating between API providers or need resilience policies for unreliable external dependencies.

Retries and Timeouts

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: stripe-api-vs
  namespace: payments
spec:
  hosts:
    - api.stripe.com
  http:
    - route:
        - destination:
            host: api.stripe.com
            port:
              number: 443
      timeout: 10s
      retries:
        attempts: 3
        perTryTimeout: 3s
        retryOn: connect-failure,refused-stream,unavailable,cancelled,retriable-status-codes
        retryRemoteLocalities: true

This applies a 10-second overall timeout with up to 3 retry attempts (3 seconds each) for specific failure conditions. Note that this only works for HTTP-protocol ServiceEntries. For TLS-protocol entries where Envoy cannot see the HTTP layer, you are limited to TCP-level connection retries configured via the DestinationRule.

Traffic Shifting Between External Providers

Migrating from one external API to another? Use weighted routing to shift traffic gradually.

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: geocoding-primary
  namespace: geo
spec:
  hosts:
    - geocoding.internal
  location: MESH_EXTERNAL
  ports:
    - number: 443
      name: tls
      protocol: TLS
  resolution: STATIC
  endpoints:
    - address: api.old-geocoding-provider.com
      labels:
        provider: old
    - address: api.new-geocoding-provider.com
      labels:
        provider: new
---
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: geocoding-dr
  namespace: geo
spec:
  host: geocoding.internal
  trafficPolicy:
    tls:
      mode: SIMPLE
  subsets:
    - name: old-provider
      labels:
        provider: old
    - name: new-provider
      labels:
        provider: new
---
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: geocoding-vs
  namespace: geo
spec:
  hosts:
    - geocoding.internal
  http:
    - route:
        - destination:
            host: geocoding.internal
            subset: old-provider
          weight: 80
        - destination:
            host: geocoding.internal
            subset: new-provider
          weight: 20

This sends 80% of geocoding traffic to the old provider and 20% to the new one. Adjust the weights as you gain confidence. Fully reversible — just set the old provider back to 100%.

DNS Resolution Patterns: Istio DNS Proxy vs kube-dns

Istio DNS resolution for external services involves two layers: how your application resolves the hostname (kube-dns / CoreDNS), and how the sidecar resolves the hostname (Envoy’s async DNS or Istio’s DNS proxy). Understanding the interplay is crucial for reliable Istio DNS behavior.

Default Flow (Without Istio DNS Proxy)

Your application calls api.stripe.com. kube-dns resolves it to an IP. The application opens a connection to that IP. The sidecar intercepts the connection and — if the ServiceEntry uses DNS resolution — Envoy independently resolves api.stripe.com to determine its endpoint list. Two separate DNS lookups happen, which can lead to inconsistencies if DNS records change between the two resolutions.

With Istio DNS Proxy (dns.istio.io)

Istio’s sidecar includes a DNS proxy that intercepts DNS queries from the application. When enabled (via meshConfig.defaultConfig.proxyMetadata.ISTIO_META_DNS_CAPTURE and ISTIO_META_DNS_AUTO_ALLOCATE), the proxy can:

  • Auto-allocate virtual IPs for ServiceEntry hosts that do not have addresses defined, which is critical for TCP ServiceEntries that need IP-based matching.
  • Resolve ServiceEntry hosts directly, avoiding the round-trip to kube-dns for known mesh services.
  • Ensure consistency between the application’s DNS resolution and the sidecar’s endpoint resolution.

In modern Istio installations (1.18+), DNS capture is enabled by default. Verify with:

istioctl proxy-config bootstrap <pod-name> -n <namespace> | grep -A2 "ISTIO_META_DNS"

When DNS Proxy Matters Most

The DNS proxy is especially important for TCP ServiceEntries without an explicit addresses field. Without a VIP, Envoy cannot match an incoming TCP connection to the correct ServiceEntry because there is no HTTP Host header to inspect. The DNS proxy solves this by auto-allocating a VIP from the 240.240.0.0/16 range and returning that VIP when the application resolves the hostname. The sidecar then intercepts traffic to that VIP and routes it to the correct external endpoint.

Sticky Sessions with ServiceEntry

Some external services require session affinity — for example, a legacy service that stores session state in memory, or a WebSocket endpoint that must maintain a persistent connection to the same backend. Istio supports sticky sessions for external services through consistent hashing in a DestinationRule.

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: legacy-session-service
  namespace: default
spec:
  hosts:
    - legacy-session.internal
  location: MESH_INTERNAL
  ports:
    - number: 8080
      name: http
      protocol: HTTP
  resolution: STATIC
  endpoints:
    - address: 10.0.1.10
    - address: 10.0.1.11
    - address: 10.0.1.12
---
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: legacy-session-dr
  namespace: default
spec:
  host: legacy-session.internal
  trafficPolicy:
    loadBalancer:
      consistentHash:
        httpCookie:
          name: SERVERID
          ttl: 3600s

This configuration hashes on an HTTP cookie named SERVERID. If the cookie does not exist, Envoy generates one and sets it on the response so that subsequent requests from the same client stick to the same endpoint. You can also hash on:

  • HTTP header: consistentHash.httpHeaderName: "x-user-id" — useful when your application sends a user identifier in every request.
  • Source IP: consistentHash.useSourceIp: true — simplest option but breaks in environments with NAT or shared egress IPs.
  • Query parameter: consistentHash.httpQueryParameterName: "session_id" — for REST APIs that include a session identifier in the URL.

Sticky sessions with ServiceEntry work identically to in-mesh sticky sessions. The key requirement is that the ServiceEntry must use STATIC or DNS resolution (not NONE) so that Envoy has multiple endpoints to hash across. With DNS_ROUND_ROBIN, there is only one logical endpoint, so consistent hashing has no effect.

Troubleshooting Common Issues

503 Errors When Calling External Services

The most common ServiceEntry issue. Start with this diagnostic sequence:

# Check if the ServiceEntry is applied and visible to the proxy
istioctl proxy-config cluster <pod-name> -n <namespace> | grep <external-host>

# Check the listeners
istioctl proxy-config listener <pod-name> -n <namespace> --port <port>

# Look at Envoy access logs for the specific request
kubectl logs <pod-name> -n <namespace> -c istio-proxy | grep <external-host>

Common causes of 503 errors:

  • Wrong protocol: Setting protocol: HTTPS when your application initiates TLS. Use TLS for pass-through; use HTTP only if the sidecar does TLS origination.
  • Missing ServiceEntry in REGISTRY_ONLY mode: If outboundTrafficPolicy is REGISTRY_ONLY, any host without a ServiceEntry is blocked.
  • exportTo restriction: The ServiceEntry is in namespace A, exported only to ".", and the calling pod is in namespace B.
  • DNS resolution failure: Envoy cannot resolve the hostname. Check that the DNS servers are reachable from the pod.

DNS Resolution Failures

When Envoy’s async DNS resolver fails, you will see UH (upstream unhealthy) or UF (upstream connection failure) flags in access logs.

# Verify DNS works from inside the sidecar
kubectl exec <pod-name> -n <namespace> -c istio-proxy -- \
  pilot-agent request GET /dns_resolve?proxyID=<pod-name>.<namespace>&host=api.stripe.com

# Check Envoy cluster health
istioctl proxy-config endpoint <pod-name> -n <namespace> | grep <external-host>

If the endpoint shows UNHEALTHY, Envoy resolved the DNS but the outlier detection ejected the host. If no endpoint appears at all, DNS resolution is failing. Common fix: ensure your pods can reach an external DNS server, or that CoreDNS is configured to forward queries for the external domain.

TLS Origination Not Working

If you configured TLS origination via a DestinationRule but traffic still fails:

  • Ensure the ServiceEntry port protocol is HTTP, not TLS. If you set it to TLS, Envoy treats the connection as opaque TLS pass-through and will not apply the DestinationRule’s TLS settings.
  • Verify the DestinationRule’s host field exactly matches the ServiceEntry’s hosts entry.
  • Check that the VirtualService (if used) routes to the correct port number.

TCP ServiceEntry Not Intercepting Traffic

For TCP-protocol ServiceEntries without the DNS proxy, Envoy cannot match traffic by hostname. You must either:

  • Set an explicit addresses field with a VIP that your application targets.
  • Enable Istio’s DNS proxy to auto-allocate VIPs.
  • Ensure the destination IP matches what the ServiceEntry resolves to.

Without one of these, TCP traffic goes through the PassthroughCluster and bypasses your ServiceEntry entirely.

Frequently Asked Questions

Do I need a ServiceEntry if outboundTrafficPolicy is set to ALLOW_ANY?

You do not need one for connectivity — your services can reach external hosts without it. But you should create ServiceEntries anyway. Without them, outbound traffic goes through the PassthroughCluster, which means no detailed metrics per destination, no access logging with the external hostname, no circuit breaking, no retries, and no timeout policies. A ServiceEntry is the difference between “it works” and “it works reliably with observability.”

What is the difference between protocol TLS and HTTPS in a ServiceEntry port?

TLS tells Envoy to treat the connection as opaque TLS. Envoy reads the SNI header to determine routing but does not decrypt the payload. Use this when your application initiates TLS directly. HTTPS tells Envoy the protocol is HTTP over TLS, which implies Envoy should handle TLS. In practice, for external services where the application manages its own TLS, use TLS. Use HTTP with a DestinationRule TLS origination when you want the sidecar to handle TLS.

Can I use wildcards in ServiceEntry hosts?

Yes, but with limitations. You can use *.example.com to match any subdomain of example.com. However, wildcard entries only work with resolution: NONE because Envoy cannot perform DNS lookups for wildcard hostnames. This means you lose the ability to apply per-endpoint traffic policies. Wildcard ServiceEntries are best used for broad egress access control rather than fine-grained traffic management.

How do I configure sticky sessions for an external service behind a ServiceEntry?

Create a ServiceEntry with STATIC or DNS resolution (so Envoy has multiple endpoints), then pair it with a DestinationRule that configures consistentHash under trafficPolicy.loadBalancer. You can hash on an HTTP cookie, header, source IP, or query parameter. The ServiceEntry must expose multiple endpoints for consistent hashing to have any effect. See the “Sticky Sessions with ServiceEntry” section above for a complete YAML example.

How does ServiceEntry interact with NetworkPolicy and Istio AuthorizationPolicy?

A ServiceEntry does not bypass Kubernetes NetworkPolicy. If a NetworkPolicy blocks egress to the external IP, traffic will be dropped at the CNI level before Envoy can route it. Istio AuthorizationPolicy can also restrict which workloads are allowed to call specific ServiceEntry hosts. For defense in depth, use ServiceEntry for traffic management and observability, AuthorizationPolicy for workload-level access control, and NetworkPolicy for network-level enforcement.

Wrapping Up

ServiceEntry is one of the most practical Istio resources you will use in production. It transforms opaque outbound connections into managed, observable, policy-controlled traffic — and it does so without requiring changes to your application code. Start with the basics: create a ServiceEntry for each external dependency, set the correct resolution type, and pair it with a DestinationRule for connection limits and circuit breaking. As you mature, add VirtualServices for retries and timeouts, configure sticky sessions where needed, and enable the DNS proxy for seamless TCP service integration.

The pattern is always the same: register the service, apply policies, observe the traffic. Every external dependency you formalize with a ServiceEntry is one fewer blind spot in your production mesh.

Prometheus Alertmanager vs Grafana Alerting (2026): Architecture, Features, and When to Use Each

Prometheus Alertmanager vs Grafana Alerting (2026): Architecture, Features, and When to Use Each

Most observability stacks that have been running in production for more than a year end up with alerting spread across two systems: Prometheus Alertmanager handling metric-based alerts and Grafana Alerting managing everything else. Engineers add a Slack integration in Grafana because it is convenient, then realize their Alertmanager routing tree already covers the same service. Before long, the on-call team receives duplicated pages, silencing rules live in two places, and nobody is confident which system is authoritative.

This is the alerting consolidation problem, and it affects teams of every size. The question is straightforward: should you standardize on Prometheus Alertmanager, move everything into Grafana Alerting, or deliberately run both? The answer depends on your datasource mix, your GitOps maturity, and how your organization manages on-call routing. This guide breaks down the architecture, features, and operational trade-offs of each system so you can make a deliberate choice instead of drifting into accidental complexity.

Architecture Overview

Before comparing features, you need to understand how each system fits into the alerting pipeline. They occupy the same logical space — “receive a condition, route a notification” — but they get there from fundamentally different starting points.

Prometheus Alertmanager: The Standalone Receiver

Alertmanager is a dedicated, standalone component in the Prometheus ecosystem. It does not evaluate alert rules itself. Instead, Prometheus (or any compatible sender like Thanos Ruler, Cortex, or Mimir Ruler) evaluates PromQL expressions and pushes firing alerts to the Alertmanager API. Alertmanager then handles deduplication, grouping, inhibition, silencing, and notification delivery.

# Simplified Prometheus → Alertmanager flow
#
# [Prometheus] --evaluates rules--> [firing alerts]
#        |
#        +--POST /api/v2/alerts--> [Alertmanager]
#                                      |
#                          +-----------+-----------+
#                          |           |           |
#                       [Slack]    [PagerDuty]  [Email]

The entire configuration lives in a single YAML file (alertmanager.yml). This includes the routing tree, receiver definitions, inhibition rules, and silence templates. There is no database, no UI-driven state — just a config file and an optional local storage directory for notification state and silences. This makes it trivially reproducible and ideal for GitOps workflows.

For high availability, you run multiple Alertmanager instances in a gossip-based cluster. They use a mesh protocol to share silence and notification state, ensuring that failover does not result in duplicate or lost notifications. The HA model is well-understood and has been stable for years.

Grafana Alerting: The Integrated Platform

Grafana Alerting (sometimes called “Grafana Unified Alerting,” introduced in Grafana 8 and significantly matured through Grafana 11 and 12) takes a different architectural approach. It embeds the entire alerting lifecycle — rule evaluation, state management, routing, and notification — inside the Grafana server process. Under the hood, it actually uses a fork of Alertmanager for the routing and notification layer, but this is an implementation detail that is invisible to users.

# Simplified Grafana Alerting flow
#
# [Grafana Server]
#   ├── Rule Evaluation Engine
#   │     ├── queries Prometheus
#   │     ├── queries Loki
#   │     ├── queries CloudWatch
#   │     └── queries any supported datasource
#   │
#   ├── Alert State Manager (internal)
#   │
#   └── Embedded Alertmanager (routing + notifications)
#           |
#           +-----------+-----------+
#           |           |           |
#        [Slack]    [PagerDuty]  [Email]

The critical distinction is that Grafana Alerting evaluates alert rules itself, querying any configured datasource — not just Prometheus. It can fire alerts based on Loki log queries, Elasticsearch searches, CloudWatch metrics, PostgreSQL queries, or any of the 100+ datasource plugins available in Grafana. Rule definitions, contact points, notification policies, and mute timings are stored in the Grafana database (or provisioned via YAML files and the Grafana API).

For high availability in self-hosted environments, Grafana Alerting relies on a shared database and a peer-discovery mechanism between Grafana instances. In Grafana Cloud, HA is fully managed by Grafana Labs.

Feature Comparison

The following table provides a side-by-side comparison of the capabilities that matter most in production alerting systems. Both systems are mature, but they prioritize different things.

FeaturePrometheus AlertmanagerGrafana Alerting
DatasourcesPrometheus-compatible only (Prometheus, Thanos, Mimir, VictoriaMetrics)Any Grafana datasource (Prometheus, Loki, Elasticsearch, CloudWatch, SQL databases, etc.)
Rule evaluationExternal (Prometheus/Ruler evaluates rules and pushes alerts)Built-in (Grafana evaluates rules directly)
Routing treeHierarchical YAML-based routing with match/match_re, continue, group_byNotification policies with label matchers, nested policies, mute timings
GroupingFull support via group_by, group_wait, group_intervalFull support via notification policies with equivalent controls
InhibitionNative inhibition rules (suppress alerts when a related alert is firing)Supported since Grafana 10.3 but less flexible than Alertmanager
SilencingLabel-based silences via API or UI, time-limitedMute timings (recurring schedules) and silences (ad-hoc, label-based)
Notification channelsEmail, Slack, PagerDuty, OpsGenie, VictoriaOps, webhook, WeChat, Telegram, SNS, WebexAll of the above plus Teams, Discord, Google Chat, LINE, Threema, Oncall, and more via contact points
TemplatingGo templates in notification configGo templates with access to Grafana template variables and functions
Multi-tenancyNot built-in; achieved via separate instances or Mimir AlertmanagerNative multi-tenancy via Grafana organizations and RBAC
High availabilityGossip-based cluster (peer mesh, well-proven)Database-backed HA with peer discovery between Grafana instances
Configuration modelSingle YAML file, fully declarativeUI + API + provisioning YAML files, stored in database
GitOps compatibilityExcellent — config file lives in version control nativelyPossible via provisioning files or Terraform provider, but requires extra tooling
External alert sourcesAny system that can POST to the Alertmanager APISupported via the Grafana Alerting API (external alerts can be pushed)
Managed serviceAvailable via Grafana Cloud (as Mimir Alertmanager), Amazon Managed PrometheusAvailable via Grafana Cloud

Alertmanager Strengths

Alertmanager has been a production staple since 2015. Over a decade of use across thousands of organizations has made it one of the most battle-tested components in the CNCF ecosystem. Here is where it genuinely excels.

Declarative, GitOps-Native Configuration

The entire Alertmanager configuration is a single YAML file. There is no hidden state in a database, no click-driven configuration that someone forgets to document. You check it into Git, review it in a pull request, and deploy it through your CI/CD pipeline like any other infrastructure code. This is a significant operational advantage for teams that have invested in GitOps.

# alertmanager.yml — everything in one file
global:
  resolve_timeout: 5m
  slack_api_url: "https://hooks.slack.com/services/T00/B00/XXX"

route:
  receiver: platform-team
  group_by: [alertname, cluster, namespace]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: pagerduty-oncall
      group_wait: 10s
    - match_re:
        team: "^(payments|checkout)$"
      receiver: payments-slack
      continue: true

receivers:
  - name: platform-team
    slack_configs:
      - channel: "#platform-alerts"
  - name: pagerduty-oncall
    pagerduty_configs:
      - service_key: ""
  - name: payments-slack
    slack_configs:
      - channel: "#payments-oncall"

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: [alertname, cluster]

Every change is auditable. Rollbacks are a git revert away. This matters enormously when you are debugging why an alert did not fire at 3 AM.

Lightweight and Single-Purpose

Alertmanager does one thing: route and deliver notifications. It has no dashboard, no query engine, no datasource plugins. This single-purpose design makes it operationally simple. Resource consumption is minimal — a small Alertmanager instance handles thousands of active alerts on a few hundred megabytes of memory. It starts in milliseconds and requires almost no maintenance.

Mature Inhibition and Routing

Alertmanager’s inhibition rules are first-class citizens. You can suppress downstream warnings when a critical alert is already firing, preventing alert storms from overwhelming your on-call team. The hierarchical routing tree with continue flags allows for nuanced delivery: send to the team channel AND escalate to PagerDuty simultaneously, with different grouping strategies at each level.

Proven High Availability

The gossip-based HA cluster has been stable for years. Running three Alertmanager replicas behind a load balancer (or using Kubernetes service discovery) gives you reliable notification delivery without shared storage. The protocol handles deduplication across instances automatically, which is the hardest part of distributed alerting.

Grafana Alerting Strengths

Grafana Alerting has matured considerably since its rocky introduction in Grafana 8. By Grafana 11 and 12, it has become a legitimate production alerting platform with capabilities that Alertmanager cannot match on its own.

Multi-Datasource Alert Rules

This is Grafana Alerting’s strongest differentiator. You can write alert rules that query Loki for error log spikes, CloudWatch for AWS resource utilization, Elasticsearch for application errors, or a PostgreSQL database for business metrics — all from the same alerting system. If your observability stack includes more than just Prometheus, this eliminates the need for separate alerting tools per datasource.

# Grafana alert rule provisioning example — alerting on Loki log errors
apiVersion: 1
groups:
  - orgId: 1
    name: application-errors
    folder: Production
    interval: 1m
    rules:
      - uid: loki-error-spike
        title: "High error rate in payment service"
        condition: C
        data:
          - refId: A
            datasourceUid: loki-prod
            model:
              expr: 'sum(rate({app="payment-service"} |= "ERROR" [5m]))'
          - refId: B
            datasourceUid: "__expr__"
            model:
              type: reduce
              expression: A
              reducer: last
          - refId: C
            datasourceUid: "__expr__"
            model:
              type: threshold
              expression: B
              conditions:
                - evaluator:
                    type: gt
                    params: [10]
        for: 5m
        labels:
          severity: warning
          team: payments

This is something Alertmanager simply cannot do. Alertmanager only receives pre-evaluated alerts — it has no concept of datasources or query execution.

Unified UI for Alert Management

Grafana provides a single pane of glass for alert rule creation, visualization, notification policy management, contact point configuration, and silence management. For teams where not every engineer is comfortable editing YAML routing trees, the visual notification policy editor significantly reduces the barrier to entry. You can see the state of every alert rule, its evaluation history, and the exact notification path it will take — all without leaving the browser.

Native Multi-Tenancy and RBAC

Grafana’s organization model and role-based access control extend naturally to alerting. Different teams can manage their own alert rules, contact points, and notification policies within their organization or folder scope, without seeing or interfering with other teams. Achieving this with standalone Alertmanager requires either running separate instances per tenant or using Mimir’s multi-tenant Alertmanager.

Mute Timings and Richer Scheduling

While Alertmanager supports silences (ad-hoc, time-limited suppressions), Grafana Alerting adds mute timings — recurring time-based windows where notifications are suppressed. This is useful for scheduled maintenance windows, business-hours-only alerting, or suppressing non-critical alerts on weekends. Alertmanager requires external tooling or manual silence creation for recurring windows.

Grafana Cloud as a Managed Option

For teams that want to avoid managing alerting infrastructure entirely, Grafana Cloud provides a fully managed Grafana Alerting stack. This includes HA, state persistence, and notification delivery without any self-hosted components. The Grafana Cloud alerting stack also includes a managed Mimir Alertmanager, which means you can use Prometheus-native alerting rules if you prefer that model while still benefiting from the managed infrastructure.

When to Use Prometheus Alertmanager

Alertmanager is the right choice when the following conditions describe your environment:

  • Your metrics stack is Prometheus-native. If all your alert rules are PromQL expressions evaluated by Prometheus, Thanos Ruler, or Mimir Ruler, Alertmanager is the natural fit. There is no added value in routing those alerts through Grafana.
  • GitOps is non-negotiable. If every infrastructure change must go through a pull request and be fully declarative, Alertmanager’s single-file configuration model is significantly easier to manage than Grafana’s database-backed state. Tools like amtool provide config validation in CI pipelines.
  • You need fine-grained routing with inhibition. Complex routing trees with multiple levels of grouping, inhibition rules, and continue flags are more naturally expressed in Alertmanager’s YAML format. The routing logic has been stable and well-documented for years.
  • You run microservices with per-team routing. If each team owns its routing subtree and the routing logic is complex, Alertmanager’s hierarchical model scales better than UI-driven configuration. Teams can own their section of the config file via CODEOWNERS in Git.
  • You want minimal operational overhead. Alertmanager is a single binary with minimal resource requirements. There is no database to back up, no migrations to run, and no UI framework to keep updated.

When to Use Grafana Alerting

Grafana Alerting is the right choice when these conditions apply:

  • You alert on more than just Prometheus metrics. If you need alert rules based on Loki logs, Elasticsearch queries, CloudWatch metrics, or database queries, Grafana Alerting is the only option that handles all of these natively. The alternative is running separate alerting tools per datasource, which is worse.
  • Your team prefers UI-driven configuration. Not every engineer wants to edit YAML routing trees. If your organization values a visual interface for managing alerts, contact points, and notification policies, Grafana’s UI is a major productivity advantage.
  • You are using Grafana Cloud. If you are already on Grafana Cloud, using its built-in alerting is the path of least resistance. You get HA, managed notification delivery, and a unified experience without running any additional infrastructure.
  • Multi-tenancy is a requirement. If multiple teams need isolated alerting configurations with RBAC, Grafana’s native organization and folder-based access model is significantly easier to set up than running per-tenant Alertmanager instances.
  • You want mute timings for recurring maintenance windows. If your team regularly needs to suppress alerts during scheduled windows (deploy windows, batch processing hours, weekend non-critical suppression), Grafana’s mute timings feature is more ergonomic than creating and managing recurring silences in Alertmanager.

Running Both Together: The Hybrid Pattern

In practice, many production environments run both Alertmanager and Grafana Alerting. This is not necessarily a mistake — it can be a deliberate architectural choice when done with clear boundaries.

Common Hybrid Architecture

The most common pattern looks like this:

  • Prometheus Alertmanager handles all metric-based alerts. PromQL rules are evaluated by Prometheus or a long-term storage ruler (Thanos, Mimir). Alertmanager owns routing, grouping, and notification for these alerts.
  • Grafana Alerting handles non-Prometheus alerts: log-based alerts from Loki, business metrics from SQL datasources, and cross-datasource correlation rules.

The key to making this work without chaos is establishing clear ownership rules:

# Ownership boundaries for hybrid alerting
#
# Prometheus Alertmanager owns:
#   - All PromQL-based alert rules
#   - Infrastructure alerts (node, kubelet, etcd, CoreDNS)
#   - Application SLO/SLI alerts based on metrics
#
# Grafana Alerting owns:
#   - Log-based alert rules (Loki, Elasticsearch)
#   - Business metric alerts (SQL datasources)
#   - Cross-datasource correlation rules
#   - Alerts for teams that prefer UI-driven management
#
# Shared:
#   - Contact points / receivers use the same Slack channels and PagerDuty services
#   - On-call rotations are managed externally (PagerDuty, Grafana OnCall)

Both systems can deliver to the same notification channels. The critical discipline is ensuring that silencing and maintenance windows are applied in both systems when needed. This is the primary operational cost of the hybrid approach.

Grafana as a Viewer for Alertmanager

Even if you use Alertmanager exclusively for routing and notification, Grafana can serve as a read-only viewer. Grafana natively supports connecting to an external Alertmanager datasource, allowing you to see firing alerts, active silences, and alert groups in the Grafana UI. This gives you the operational visibility of Grafana without moving your alerting logic into it.

# Grafana datasource provisioning for external Alertmanager
apiVersion: 1
datasources:
  - name: Alertmanager
    type: alertmanager
    url: http://alertmanager.monitoring.svc:9093
    access: proxy
    jsonData:
      implementation: prometheus

Migration Considerations

If you are moving from one system to the other, here are the practical considerations to plan for.

Migrating from Alertmanager to Grafana Alerting

  • Rule conversion. Your PromQL-based recording and alerting rules defined in Prometheus rule files need to be recreated as Grafana alert rules. Grafana provides a migration tool that can import Prometheus-format rules, but complex expressions may need manual adjustment.
  • Routing tree translation. Alertmanager’s hierarchical routing tree maps to Grafana’s notification policies, but the semantics are not identical. Test the notification routing thoroughly — the continue flag behavior and default routes may differ.
  • Silence and inhibition migration. Active silences are ephemeral and do not need migration. Inhibition rules need to be recreated in Grafana’s format. Recurring maintenance windows should be converted to mute timings.
  • Run in parallel first. The safest migration strategy is to run both systems in parallel for two to four weeks, sending notifications from both, then cutting over when you have confidence in the Grafana setup. Accept the temporary noise of duplicate alerts — it is far cheaper than missing a critical page during migration.

Migrating from Grafana Alerting to Alertmanager

  • Datasource limitation. You can only migrate alerts that are based on Prometheus-compatible datasources. Alerts querying Loki, Elasticsearch, or SQL datasources have no equivalent in Alertmanager — you will need an alternative solution for those.
  • Rule export. Export Grafana alert rules and convert them to Prometheus-format rule files. The Grafana API (GET /api/v1/provisioning/alert-rules) provides structured output that can be transformed with a script.
  • Contact point mapping. Map Grafana contact points to Alertmanager receivers. The configuration format is different, but the concepts are equivalent.
  • State loss. Alertmanager does not carry over Grafana’s alert evaluation history. You start fresh. Plan for a brief period where alerts may re-fire as Prometheus evaluates rules that were previously managed by Grafana.

Decision Framework

If you want a quick decision path, use this framework:

Start here:
│
├── Do you alert on non-Prometheus datasources (Loki, ES, SQL, CloudWatch)?
│   ├── YES → Grafana Alerting (at least for those datasources)
│   └── NO ↓
│
├── Is GitOps/declarative config a hard requirement?
│   ├── YES → Alertmanager
│   └── NO ↓
│
├── Do you need multi-tenancy with RBAC?
│   ├── YES → Grafana Alerting (or Mimir Alertmanager)
│   └── NO ↓
│
├── Are you on Grafana Cloud?
│   ├── YES → Grafana Alerting (path of least resistance)
│   └── NO ↓
│
└── Default → Alertmanager (simpler, lighter, well-proven)

For many teams, the honest answer is “both” — Alertmanager for the Prometheus-native metric pipeline, Grafana Alerting for everything else. That is a valid architecture as long as the ownership boundaries are documented and the on-call team knows where to look.

Frequently Asked Questions

What is the difference between Alertmanager and Grafana Alerting?

Prometheus Alertmanager is a standalone notification routing engine that receives pre-evaluated alerts from Prometheus and delivers them to receivers like Slack, PagerDuty, or email. It does not evaluate alert rules itself. Grafana Alerting is an integrated alerting platform embedded in Grafana that both evaluates alert rules (querying any supported datasource) and handles notification routing. Alertmanager is configured entirely via YAML, while Grafana Alerting offers a UI, API, and file-based provisioning. The fundamental difference is scope: Alertmanager handles only the routing and notification phase, while Grafana Alerting handles the full lifecycle from query evaluation to notification.

Can Grafana Alerting replace Prometheus Alertmanager?

Yes, for many use cases. Grafana Alerting can evaluate PromQL rules directly against your Prometheus datasource, so you do not strictly need a separate Alertmanager instance. However, there are scenarios where Alertmanager remains the better choice: heavily GitOps-driven environments, teams that need Alertmanager’s mature inhibition rules, or architectures where Prometheus rule evaluation happens externally (Thanos Ruler, Mimir Ruler) and a dedicated Alertmanager is already in the pipeline. If your only datasource is Prometheus and you value declarative configuration, Alertmanager is still simpler and lighter.

Is Grafana Alertmanager the same as Prometheus Alertmanager?

Not exactly. Grafana Alerting uses a fork of the Prometheus Alertmanager code internally for its notification routing engine, but it is not the same product. The Grafana “Alertmanager” you see in the UI is a managed, embedded component with a different configuration interface (notification policies, contact points, mute timings) compared to the standalone Prometheus Alertmanager (routing tree, receivers, inhibition rules in YAML). Grafana can also connect to an external Prometheus Alertmanager as a datasource, which adds to the confusion. When people refer to “Grafana Alertmanager,” they usually mean the embedded routing engine inside Grafana Alerting.

What are the best alternatives to Prometheus Alertmanager?

The most direct alternative is Grafana Alerting, which can receive and route Prometheus alerts while also supporting other datasources. Beyond that, other options include: Grafana OnCall for on-call management and escalation (often used alongside Alertmanager rather than replacing it), PagerDuty or Opsgenie as managed incident response platforms that can receive alerts directly, Keep as an open-source AIOps alert management platform, and Mimir Alertmanager for multi-tenant environments running Grafana Mimir. The choice depends on whether you need an Alertmanager replacement (routing and notification) or a complementary tool for escalation and incident response.

Should I use Prometheus alerts or Grafana alerts for Kubernetes monitoring?

For Kubernetes monitoring specifically, the kube-prometheus-stack (which includes Prometheus, Alertmanager, and a comprehensive set of pre-built alerting rules) remains the industry standard. These rules are PromQL-based and are designed to work with Alertmanager. If you are deploying kube-prometheus-stack, using Alertmanager for metric-based alerts is the straightforward choice. Add Grafana Alerting on top if you also need to alert on logs (via Loki) or non-metric datasources. For Kubernetes-specific monitoring, the combination of Prometheus rules with Alertmanager for routing is the most mature and well-supported path.

Final Thoughts

The Alertmanager vs Grafana Alerting debate is not really about which tool is better — it is about which tool fits your operational context. Alertmanager is simpler, lighter, and more GitOps-friendly. Grafana Alerting is more versatile, more accessible to UI-oriented teams, and the only option if you need multi-datasource alerting. Running both is perfectly valid when the boundaries are clear.

The worst outcome is not picking the “wrong” tool. The worst outcome is running both accidentally, with overlapping coverage, duplicated notifications, and no clear ownership. Whatever you choose, document the decision, define the ownership boundaries, and make sure your on-call team knows exactly where to go when they need to silence an alert at 3 AM.

Building a Kubernetes Migration Framework: Lessons from Ingress-NGINX

Building a Kubernetes Migration Framework: Lessons from Ingress-NGINX

The recent announcement regarding the deprecation of the Ingress-NGINX controller sent a ripple through the Kubernetes community. For many organizations, it’s the first major deprecation of a foundational, widely-adopted ecosystem component. While the immediate reaction is often tactical—”What do we replace it with?”—the more valuable long-term question is strategic: “How do we systematically manage this and future migrations?”

This event isn’t an anomaly; it’s a precedent. As Kubernetes matures, core add-ons, APIs, and patterns will evolve or sunset. Platform engineering teams need a repeatable, low-risk framework for navigating these changes. Drawing from the Ingress-NGINX transition and established deployment management principles, we can abstract a robust Kubernetes Migration Framework applicable to any major component, from service meshes to CSI drivers.

Why Ad-Hoc Migrations Fail in Production

Attempting a “big bang” replacement or a series of manual, one-off changes is a recipe for extended downtime, configuration drift, and undetected regression. Production Kubernetes environments are complex systems with deep dependencies:

  • Interdependent Workloads: Multiple applications often share the same ingress controller, relying on specific annotations, custom snippets, or behavioral quirks.
  • Automation and GitOps Dependencies: Helm charts, Kustomize overlays, and ArgoCD/Flux manifests are tightly coupled to the existing component’s API and schema.
  • Observability and Security Integration: Monitoring dashboards, logging parsers, and security policies are tuned for the current implementation.
  • Knowledge Silos: Tribal knowledge about workarounds and specific configurations isn’t documented.

A structured framework mitigates these risks by enforcing discipline, creating clear validation gates, and ensuring the capability to roll back at any point.

The Four-Phase Kubernetes Migration Framework

This framework decomposes the migration into four distinct phases: Assessment, Parallel Run, Cutover, and Decommission. Each phase has defined inputs, activities, and exit criteria.

Phase 1: Deep Assessment & Dependency Mapping

Before writing a single line of new configuration, understand the full scope. The goal is to move from “we use Ingress-NGINX” to a precise inventory of how it’s used.

  • Inventory All Ingress Resources: Use kubectl get ingress --all-namespaces as a starting point, but go deeper.
  • Analyze Annotation Usage: Script an analysis to catalog every annotation in use (e.g., nginx.ingress.kubernetes.io/rewrite-target, nginx.ingress.kubernetes.io/configuration-snippet). This reveals functional dependencies.
  • Map to Backend Services: For each Ingress, identify the backend Services and Namespaces. This highlights critical applications and potential blast radius.
  • Review Customizations: Document any custom ConfigMaps for main NGINX configuration, custom template patches, or modifications to the controller deployment itself.
  • Evaluate Alternatives: Based on the inventory, evaluate candidate replacements (e.g., Gateway API with a compatible implementation, another Ingress controller like Emissary-ingress or Traefik). The Google Cloud migration framework provides a useful decision tree for ingress-specific migrations.

The output of this phase is a migration manifesto: a concrete list of what needs to be converted, grouped by complexity and criticality.

Phase 2: Phased Rollout & Parallel Run

This is the core of a low-risk migration. Instead of replacing, you run the new and old systems in parallel, shifting traffic gradually. For ingress, this often means installing the new controller alongside the old one.

  • Dual Installation: Deploy the new ingress controller in the same cluster, configured with a distinct ingress class (e.g., ingressClassName: gateway vs. nginx).
  • Create Canary Ingress Resources: For a low-risk application, create a parallel Ingress or Gateway resource pointing to the new controller. Use techniques like managed deployments with canary patterns to control exposure.
    # Example: A new Gateway API HTTPRoute for a canary service
    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
    name: app-canary
    spec:
    parentRefs:
    - name: company-gateway
    rules:
    - backendRefs:
    - name: app-service
    port: 8080
    weight: 10 # Start with 10% of traffic

  • Validate Equivalency: Use traffic mirroring (if supported) or direct synthetic testing against both ingress paths. Compare logs, response headers, latency, and error rates.
  • Iterate and Expand: Gradually increase traffic weight or add more applications to the new stack, group by group, based on the assessment from Phase 1.

This phase relies heavily on your observability stack. Dashboards comparing error rates, latency (p50, p99), and throughput between the old and new paths are essential.

Phase 3: Validation & Automated Cutover

The cutover is not a manual event. It’s the final step in a validation process.

  • Define Validation Tests: Create a suite of tests that must pass before full cutover. This includes:
    • Smoke tests for all critical user journeys.
    • Load tests to verify performance under expected traffic patterns.
    • Security scan validation (e.g., no unintended ports open).
    • Compliance checks (e.g., specific headers are present).
  • Automate the Switch: For each application, the cutover is ultimately a change in its Ingress or Gateway resource. This should be done via your GitOps pipeline. Update the source manifests (e.g., change the ingressClassName), merge, and let automation apply it. This ensures the state is declarative and recorded.
  • Maintain Rollback Capacity: The old system must remain operational and routable (with reduced capacity) during this phase. The GitOps rollback is simply reverting the manifest change.

Phase 4: Observability & Decommission

Once all traffic is successfully migrated and validated over a sustained period (e.g., 72 hours), you can decommission the old component.

  • Monitor Aggressively: Keep a close watch on all key metrics for at least one full business cycle (a week).
  • Remove Old Resources: Delete the old controller’s Deployment, Service, ConfigMaps, and CRDs (if no longer needed).
  • Clean Up Auxiliary Artifacts: Remove old RBAC bindings, service accounts, and any custom monitoring alerts or dashboards specific to the old component.
  • Document Lessons Learned: Update runbooks and architecture diagrams. Note any surprises, gaps in the process, or validation tests that were particularly valuable.

Key Principles for a Resilient Framework

Beyond the phases, these principles should guide your framework’s design:

  • Always Maintain Rollback Capability: Every step should be reversible with minimal disruption. This is a core tenet of managing Kubernetes deployments.
  • Leverage GitOps for State Management: All desired state changes (Ingress resources, controller deployments) must flow through version-controlled manifests. This provides an audit trail, consistency, and the simplest rollback mechanism (git revert).
  • Validate with Production Traffic Patterns: Synthetic tests are insufficient. Use canary weights and traffic mirroring to validate with real user traffic in a controlled manner.
  • Communicate Transparently: Platform teams should maintain a clear migration status page for internal stakeholders, showing which applications have been migrated, which are in progress, and the overall timeline.

Conclusion: Building a Migration-Capable Platform

The deprecation of Ingress-NGINX is a wake-up call. The next major change is a matter of “when,” not “if.” By investing in a structured migration framework now, platform teams transform a potential crisis into a manageable, repeatable operational procedure.

This framework—Assess, Run in Parallel, Validate, and Decommission—abstracts the specific lessons from the ingress migration into a generic pattern. It can be applied to migrating from PodSecurityPolicies to Pod Security Standards, from a deprecated CSI driver, or from one service mesh to another. The tools (GitOps, canary deployments, observability) are already in your stack. The value is in stitching them together into a disciplined process that ensures platform evolution doesn’t compromise platform stability.

Start by documenting this framework as a runbook template. Then, apply it to your next significant component update, even a minor one, to refine the process. When the next major deprecation announcement lands in your inbox, you’ll be ready.

Kubernetes Dashboard Alternatives in 2026: Best Web UI Options After Official Retirement

Kubernetes Dashboard Alternatives in 2026: Best Web UI Options After Official Retirement

The Kubernetes Dashboard, once a staple tool for cluster visualization and management, has been officially archived and is no longer maintained. For many teams who relied on its straightforward web interface to monitor pods, deployments, and services, this retirement marks the end of an era. But it also signals something important: the Kubernetes ecosystem has evolved far beyond what the original dashboard was designed to handle.

Today’s Kubernetes environments are multi-cluster by default, driven by GitOps principles, guarded by strict RBAC policies, and operated by platform teams serving dozens or hundreds of developers. The operating model has simply outgrown the traditional dashboard’s capabilities.

So what comes next? If you’ve been using Kubernetes Dashboard and need to migrate to something more capable, or if you’re simply curious about modern alternatives, this guide will walk you through the best options available in 2026.

Why Kubernetes Dashboard Was Retired

The Kubernetes Dashboard served its purpose well in the early days of Kubernetes adoption. It provided a simple, browser-based interface for viewing cluster resources without needing to master kubectl commands. But as Kubernetes matured, several limitations became apparent:

  • Single-cluster focus: Most organizations now manage multiple clusters across different environments, but the dashboard was designed for viewing one cluster at a time
  • Limited RBAC capabilities: Modern platform teams need fine-grained access controls at the cluster, namespace, and workload levels
  • No GitOps integration: Contemporary workflows rely on declarative configuration and continuous deployment pipelines
  • Minimal observability: Beyond basic resource listing, the dashboard lacked advanced monitoring, alerting, and troubleshooting features
  • Security concerns: The dashboard’s architecture required careful configuration to avoid exposing cluster access

The community recognized these constraints, and the official recommendation now points toward Headlamp as the successor. But Headlamp isn’t the only option worth considering.

Top Kubernetes Dashboard Alternatives for 2026

1. Headlamp: The Official Successor

Headlamp is now the official recommendation from the Kubernetes SIG UI group. It’s a CNCF Sandbox project developed by Kinvolk (now part of Microsoft) that brings a modern approach to cluster visualization.

Key Features:

  • Clean, intuitive interface built with modern web technologies
  • Extensive plugin system for customization
  • Works both as an in-cluster deployment and desktop application
  • Uses your existing kubeconfig file for authentication
  • OpenID Connect (OIDC) support for enterprise SSO
  • Read and write operations based on RBAC permissions

Installation Options:

# Using Helm
helm repo add headlamp https://kubernetes-sigs.github.io/headlamp/
helm install my-headlamp headlamp/headlamp --namespace kube-system

# As Minikube addon
minikube addons enable headlamp
minikube service headlamp -n headlamp

Headlamp excels at providing a familiar dashboard experience while being extensible enough to grow with your needs. The plugin architecture means you can customize it for your specific workflows without waiting for upstream changes.

Best for: Teams transitioning from Kubernetes Dashboard who want a similar experience with modern features and official backing.

2. Portainer: Enterprise Multi-Cluster Management

Portainer has evolved from a Docker management tool into a comprehensive Kubernetes platform. It’s particularly strong when you need to manage multiple clusters from a single interface. We already covered in detail Portainer so you can also take a look

Key Features:

  • Multi-cluster management dashboard
  • Enterprise-grade RBAC with fine-grained access controls
  • Visual workload deployment and scaling
  • GitOps integration support
  • Comprehensive audit logging
  • Support for both Kubernetes and Docker environments

Best for: Organizations managing multiple clusters across different environments who need enterprise RBAC and centralized control.

3. Skooner (formerly K8Dash): Lightweight and Fast

Skooner keeps things simple. If you appreciated the straightforward nature of the original Kubernetes Dashboard, Skooner delivers a similar philosophy with a cleaner, faster interface.

Key Features:

  • Fast, real-time updates
  • Clean and minimal interface
  • Easy installation with minimal configuration
  • Real-time metrics visualization
  • Built-in OIDC authentication

Best for: Teams that want a simple, no-frills dashboard without complex features or steep learning curves.

4. Devtron: Complete DevOps Platform

Devtron goes beyond simple cluster visualization to provide an entire application delivery platform built on Kubernetes.

Key Features:

  • Multi-cluster application deployment
  • Built-in CI/CD pipelines
  • Advanced security scanning and compliance
  • Application-centric view rather than resource-centric
  • Support for seven different SSO providers
  • Chart store for Helm deployments

Best for: Platform teams building internal developer platforms who need comprehensive deployment pipelines alongside cluster management.

5. KubeSphere: Full-Stack Container Platform

KubeSphere positions itself as a distributed operating system for cloud-native applications, using Kubernetes as its kernel.

Key Features:

  • Multi-tenant architecture
  • Integrated DevOps workflows
  • Service mesh integration (Istio)
  • Multi-cluster federation
  • Observability and monitoring built-in
  • Plug-and-play architecture for third-party integrations

Best for: Organizations building comprehensive container platforms who want an opinionated, batteries-included experience.

6. Rancher: Battle-Tested Enterprise Platform

Rancher from SUSE has been in the Kubernetes management space for years and offers one of the most mature platforms available.

Key Features:

  • Manage any Kubernetes cluster (EKS, GKE, AKS, on-premises)
  • Centralized authentication and RBAC
  • Built-in monitoring with Prometheus and Grafana
  • Application catalog with Helm charts
  • Policy management and security scanning

Best for: Enterprise organizations managing heterogeneous Kubernetes environments across multiple cloud providers.

7. Octant: Developer-Focused Cluster Exploration

Octant (originally developed by VMware) takes a developer-centric approach to cluster visualization with a focus on understanding application architecture.

Key Features:

  • Plugin-based extensibility
  • Resource relationship visualization
  • Port forwarding directly from the UI
  • Log streaming
  • Context-aware resource inspection

Best for: Application developers who need to understand how their applications run on Kubernetes without being cluster administrators.

Desktop and CLI Alternatives Worth Considering

While this article focuses on web-based dashboards, it’s worth noting that not everyone needs a browser interface. Some of the most powerful Kubernetes management tools work as desktop applications or terminal UIs.

If you’re considering client-side tools, you might find these articles on my blog helpful:

These client tools offer advantages that web dashboards can’t match: offline access, better performance, and tighter integration with your local development workflow. FreeLens, in particular, has emerged as the lowest-risk choice for most organizations looking for a desktop Kubernetes IDE.

Choosing the Right Alternative for Your Team

With so many options available, how do you choose? Here’s a decision framework:

Choose Headlamp if:

  • You want the officially recommended path forward
  • You need a lightweight dashboard similar to what you had before
  • Plugin extensibility is important for future customization
  • You prefer CNCF-backed open source projects

Choose Portainer if:

  • You manage multiple Kubernetes clusters
  • Enterprise RBAC is a critical requirement
  • You also work with Docker environments
  • Visual deployment tools would benefit your team

Choose Skooner if:

  • You want the simplest possible alternative
  • Your needs are straightforward: view and manage resources
  • You don’t need advanced features or multi-cluster support

Choose Devtron or KubeSphere if:

  • You’re building an internal developer platform
  • You need integrated CI/CD pipelines
  • Application-centric workflows matter more than resource-centric views

Choose Rancher if:

  • You’re managing enterprise-scale, multi-cloud Kubernetes
  • You need battle-tested stability and vendor support
  • Policy management and compliance are critical

Consider desktop tools like FreeLens if:

  • You work primarily from a local development environment
  • You need offline access to cluster information
  • You prefer richer desktop application experiences

Migration Considerations

If you’re actively using Kubernetes Dashboard today, here’s what to think about when migrating:

  1. Authentication method: Most modern alternatives support OIDC/SSO, but verify your specific identity provider is supported
  2. RBAC policies: Review your existing ClusterRole and RoleBinding configurations to ensure they translate properly
  3. Custom workflows: If you’ve built automation around Dashboard URLs or specific features, you’ll need to adapt these
  4. User training: Even similar-looking alternatives have different UIs and workflows; budget time for team training
  5. Ingress configuration: If you expose your dashboard externally, you’ll need to reconfigure ingress rules

The Future of Kubernetes UI Management

The retirement of Kubernetes Dashboard isn’t a step backward—it’s recognition that the ecosystem has matured. Modern platforms need to handle multi-cluster management, GitOps workflows, comprehensive observability, and sophisticated RBAC out of the box.

The alternatives listed here represent different philosophies about what a Kubernetes interface should be:

  • Minimalist dashboards (Headlamp, Skooner) that stay close to the original vision
  • Enterprise platforms (Portainer, Rancher) that centralize multi-cluster management
  • Developer platforms (Devtron, KubeSphere) that integrate the entire application lifecycle
  • Desktop experiences (FreeLens, OpenLens) that bring IDE-like capabilities

The right choice depends on your team’s size, your infrastructure complexity, and whether you’re managing platforms or building applications. For most teams migrating from Kubernetes Dashboard, starting with Headlamp makes sense—it’s officially recommended, actively maintained, and provides a familiar experience. From there, you can evaluate whether you need to scale up to more comprehensive platforms.

Whatever you choose, the good news is that the Kubernetes ecosystem in 2026 offers more sophisticated, capable, and secure dashboard alternatives than ever before.

Frequently Asked Questions (FAQ)

Is Kubernetes Dashboard officially deprecated or just unmaintained?

The Kubernetes Dashboard has been officially archived by the Kubernetes project and is no longer actively maintained. While it may still run in existing clusters, it no longer receives security updates, bug fixes, or new features, making it unsuitable for production use in modern environments.

What is the official replacement for Kubernetes Dashboard?

Headlamp is the officially recommended successor by the Kubernetes SIG UI group. It provides a modern web interface, supports plugins, integrates with existing kubeconfig files, and aligns with current Kubernetes security and RBAC best practices.

Is Headlamp production-ready for enterprise environments?

Yes. Headlamp supports OIDC authentication, fine-grained RBAC, and can run either in-cluster or as a desktop application. While still evolving, it is actively maintained and suitable for many production use cases, especially when combined with proper access controls.

Are there lightweight alternatives similar to the old Kubernetes Dashboard?

Yes. Skooner is a lightweight, fast alternative that closely mirrors the simplicity of the original Kubernetes Dashboard while offering a cleaner UI and modern authentication options like OIDC.

Do I still need a web-based dashboard to manage Kubernetes?

Not necessarily. Many teams prefer desktop or CLI-based tools such as FreeLens, OpenLens, or K9s. These tools often provide better performance, offline access, and deeper integration with developer workflows compared to browser-based dashboards.

Is it safe to expose Kubernetes dashboards over the internet?

Exposing any Kubernetes dashboard publicly requires extreme caution. If external access is necessary, always use:
Strong authentication (OIDC / SSO)
Strict RBAC policies
Network restrictions (VPN, IP allowlists)
TLS termination and hardened ingress rules
In many cases, dashboards should only be accessible from internal networks.

Can these dashboards replace kubectl?

No. Dashboards are complementary tools, not replacements for kubectl. While they simplify visualization and some management tasks, advanced operations, automation, and troubleshooting still rely heavily on CLI tools and GitOps workflows.

What should I consider before migrating away from Kubernetes Dashboard?

Before migrating, review:
Authentication and identity provider compatibility
Existing RBAC roles and permissions
Multi-cluster requirements
GitOps and CI/CD integrations
Training needs for platform teams and developers
Starting with Headlamp is often the lowest-risk migration path

Which Kubernetes dashboard is best for developers rather than platform teams?

Tools like Octant and Devtron are more developer-focused. They emphasize application-centric views, resource relationships, and deployment workflows, making them ideal for developers who want insight without managing cluster infrastructure directly.

Which Kubernetes dashboard is best for multi-cluster management?

For multi-cluster environments, Portainer, Rancher, and KubeSphere are strong options. These platforms are designed to manage multiple clusters from a single control plane and offer enterprise-grade RBAC, auditing, and centralized authentication.

MinIO Maintenance Mode Explained: Impact on Users & S3 Alternatives

MinIO Maintenance Mode Explained: Impact on Users & S3 Alternatives

Background: MinIO and the Maintenance Mode announcement

MinIO has long been one of the most popular self-hosted S3-compatible object storage solutions for Kubernetes, especially for logs, backups, and internal object storage in on‑premise and cloud-native environments. Its simplicity, performance, and API compatibility made it a common default choice for backups, artifacts, logs, and internal object storage.

In late 2025, MinIO marked its upstream repository as Maintenance Mode and clarified that the Community Edition would be distributed source-only, without official pre-built binaries or container images. This move triggered renewed discussion across the industry about sustainability, governance, and the risks of relying on a single-vendor-controlled “open core” storage layer.

A detailed industry analysis of this shift, including its broader ecosystem impact, can be found in this InfoQ article

What exactly changed?

1. Maintenance Mode

Maintenance Mode means:

  • No new features
  • No roadmap-driven improvements
  • Limited fixes, typically only for critical issues
  • No active review of community pull requests

As highlighted by InfoQ, this effectively freezes MinIO Community as a stable but stagnant codebase, pushing innovation and evolution exclusively toward the commercial offerings.

2. Source-only distribution

Official binaries and container images are no longer published for the Community Edition. Users must:

  • Build MinIO from source
  • Maintain their own container images
  • Handle signing, scanning, and provenance themselves

This aligns with a broader industry pattern noted by InfoQ: infrastructure projects increasingly shifting operational burden back to users unless they adopt paid tiers.

Direct implications for Community users

Security and patching

With no active upstream development:

  • Vulnerability response times may increase
  • Users must monitor security advisories independently
  • Regulated environments may find Community harder to justify

InfoQ emphasizes that this does not make MinIO insecure by default, but it changes the shared-responsibility model significantly.

Operational overhead

Teams now need to:

  • Pin commits or tags explicitly
  • Build and test their own releases
  • Maintain CI pipelines for a core storage dependency

This is a non-trivial cost for what was previously perceived as a “drop‑in” component.

Support and roadmap

The strategic message is clear: active development, roadmap influence, and predictable maintenance live behind the commercial subscription.

Impact on OEM and embedded use cases

The InfoQ analysis draws an important distinction between API consumers and technology embedders.

Using MinIO as an external S3 service

If your application simply consumes an S3 endpoint:

  • The impact is moderate
  • Migration is largely operational
  • Application code usually remains unchanged

Embedding or redistributing MinIO

If your product:

  • Ships MinIO internally
  • Builds gateways or features on MinIO internals
  • Depends on MinIO-specific operational tooling

Then the impact is high:

  • You inherit maintenance and security responsibility
  • Long-term internal forking becomes likely
  • Licensing (AGPL) implications must be reassessed carefully

For OEM vendors, this often forces a strategic re-evaluation rather than a tactical upgrade.

Forks and community reactions

At the time of writing:

  • Several community forks focus on preserving the MinIO Console / UI experience
  • No widely adopted, full replacement fork of the MinIO server exists
  • Community discussion, as summarized by InfoQ, reflects caution rather than rapid consolidation

The absence of a strong server-side fork suggests that most organizations are choosing migration over replacement-by-fork.

Best S3-Compatible Alternatives to MinIO: Ceph RGW, SeaweedFS, Garage

InfoQ highlights that the industry response is not about finding a single “new MinIO”, but about selecting storage systems whose governance and maintenance models better match long-term needs.

Ceph RGW

Best for: Enterprise-grade, highly available environments
Strengths: Mature ecosystem, large community, strong governance
Trade-offs: Operational complexity

SeaweedFS

Best for: Teams seeking simplicity and permissive licensing
Strengths: Apache-2.0 license, active development, integrated S3 API
Trade-offs: Partial S3 compatibility for advanced edge cases

Garage

Best for: Self-hosted and geo-distributed systems
Strengths: Resilience-first design, active open-source development
Trade-offs: AGPL license considerations

Zenko / CloudServer

Best for: Multi-cloud and Scality-aligned architectures
Strengths: Open-source S3 API implementation
Trade-offs: Different architectural assumptions than MinIO

Recommended strategies by scenario

If you need to reduce risk immediately

  • Freeze your current MinIO version
  • Build, scan, and sign your own images
  • Define and rehearse a migration path

If you operate Kubernetes on-prem with HA requirements

  • Ceph RGW is often the most future-proof option

If licensing flexibility is critical

  • Start evaluation with SeaweedFS

If operational UX matters

  • Shift toward automation-first workflows
  • Treat UI forks as secondary tooling, not core infrastructure

Conclusion

MinIO’s shift of the Community Edition into Maintenance Mode is less about short-term breakage and more about long-term sustainability and control.

As the InfoQ analysis makes clear, the real risk is not technical incompatibility but governance misalignment. Organizations that treat object storage as critical infrastructure should favor solutions with transparent roadmaps, active communities, and predictable maintenance models.

For many teams, this moment serves as a natural inflection point: either commit to self-maintaining MinIO, move to a commercially supported path, or migrate to a fully open-source alternative designed for the long run.

📚 Want to dive deeper into Kubernetes? This article is part of our , where you’ll find all fundamental and advanced concepts explained step by step.

Frequently Asked Questions

What does Maintenance Mode mean for MinIO Community Edition?

Maintenance Mode for MinIO Community Edition means the upstream codebase is effectively frozen. There will be no new features, only critical bug fixes, and community pull requests will not be actively reviewed. Furthermore, official pre-built binaries and container images are no longer provided; users must build from source.

Is MinIO Community Edition still safe to use?

The code itself isn’t inherently insecure, but the shared-responsibility model changes. Security patching for non-critical issues will be slower or non-existent. For production, especially in regulated environments, you must now actively monitor advisories and maintain your own built and scanned images, which increases operational risk.

What is the best open-source alternative to MinIO for Kubernetes?

The best alternative depends on your needs. For enterprise-grade, high-availability setups, Ceph RGW is the most robust choice. For simplicity and Apache 2.0 licensing, SeaweedFS is excellent. For geo-distributed, resilience-first designs, evaluate Garage. There is no direct drop-in replacement; each requires evaluation.

Should I fork MinIO or migrate to another solution?

For most organizations, migration is preferable to forking. Maintaining a full fork of a complex storage server involves significant long-term commitment to security, bug fixes, and potential feature backporting. The industry trend, as noted, is toward migration to systems with active upstream development and clear governance.

How does this affect products that embed MinIO (OEM use)?

The impact on OEMs is high. You inherit full maintenance and security responsibility. Long-term internal forking becomes likely, and the AGPL licensing implications must be carefully reassessed. This often forces a strategic re-evaluation of your embedded storage layer rather than a simple tactical update.

Related posts

Helm 4.0 Features, Breaking Changes & Migration Guide 2025

Helm 4.0 Features, Breaking Changes & Migration Guide 2025

Helm is one of the main utilities within the Kubernetes ecosystem, and therefore the release of a new major version, such as Helm 4.0, is something to consider because it is undoubtedly something that will need to be analyzed, evaluated, and managed in the coming months.

Helm 4.0 represents a major milestone in Kubernetes package management. For a complete understanding of Helm from basics to advanced features, explore our .

Due to this, we will see many comments and articles around this topic, so we will try to shed some light.

Helm 4.0 Key Features and Improvements

According to the project itself in its announcement, Helm 4 introduces three major blocks of changes: new plugin system, better integration with Kubernetes ** and internal modernization of SDK and performance**.

New Plugin System (includes WebAssembly)

The plugin system has been completely redesigned, with a special focus on security through the introduction of a new WebAssembly runtime that, while optional, is recommended as it runs in a “sandbox” mode that offers limits and guarantees from a security perspective.

In any case, there is no need to worry excessively, as the “classic” plugins continue to work, but the message is clear: for security and extensibility, the direction is Wasm.

Server-Side Apply and Better Integration with Other Controllers

From this version, Helm 4 supports Server-Side Apply (SSA) through the --server-side flag, which has already become stable since Kubernetes version v1.22 and allows updates on objects to be handled server-side to avoid conflicts between different controllers managing the same resources.

It also incorporates integration with kstatus to ensure the state of a component in a more reliable way than what currently happens with the use of the --wait parameter.

Other Additional Improvements

Additionally, there is another list of improvements that, while of lesser scope, are important qualitative leaps, such as the following:

  • Installation by digest in OCI registries: (helm install myapp oci://...@sha256:<digest>)
  • Multi-document values: you can pass multiple YAML values in a single multi-doc file, facilitating complex environments/overlays.
  • New --set-json argument that allows for easily passing complex structures compared to the current solution using the --set parameter

Why a Major (v4) and Not Another Minor of 3.x?

As explained in the official release post, there were features that the team could not introduce in v3 without breaking public SDK APIs and internal architecture:

  • Strong change in the plugin system (WebAssembly, new types, deep integration with the core).
  • Restructuring of Go packages and establishment of a stable SDK at helm.sh/helm/v4, code-incompatible with v3.
  • Introduction and future evolution of Charts v3, which require the SDK to support multiple versions of chart APIs.

With all this, continuing in the 3.x branch would have violated SemVer: the major number change is basically “paying” the accumulated technical debt to be able to move forward.

Additionally, a new evolution of the charts is expected in the future, moving from v2 to a future v3 that is not yet fully defined, and currently, v2 charts run correctly in this new version.

Is Helm 4.0 Migration Required?

The short answer is: yes. And possibly the long answer is: yes, and quickly. In the official Helm 4 announcement, they specify the support schedule for Helm 3:

  • Helm 3 bug fixes until July 8, 2026.
  • Helm 3 security fixes until November 11, 2026.
  • No new features will be backported to Helm 3 during this period; only Kubernetes client libraries will be updated to support new K8s versions.

Practical translation:

  • Organizations have approximately 1 year to plan a smooth Helm 4.0 migration with continued bug support for Helm 3.
  • After November 2026, continuing to use Helm 3 will become increasingly risky from a security and compatibility standpoint.

Best Practices for Migration

To carry out the migration, it is important to remember that it is perfectly possible and feasible to have both versions installed on the same machine or agent, so a “gradual” migration can be done to ensure that the end of support for version v3 is reached with everything migrated correctly, and for that, the following steps are recommended:

  • Conduct an analysis of all Helm commands and usage from the perspective of integration pipelines, upgrade scripts, or even the import of Helm client libraries in Helm-based developments.
  • Especially carefully review all uses of --post-renderer, helm registry login, --atomic, --force.
  • After the analysis, start testing Helm 4 first in non-production environments, reusing the same charts and values, reverting to Helm 3 if a problem is detected until it is resolved.
  • If you have critical plugins, explicitly test them with Helm 4 before making the global change.

What are the main new features in Helm 4.0?

Helm 4.0 introduces three major improvements: a redesigned plugin system with WebAssembly support for enhanced security, Server-Side Apply (SSA) integration for better conflict resolution, and internal SDK modernization for improved performance. Additional features include OCI digest installation and multi-document values support.

When does Helm 3 support end?

Helm 3 bug fixes end July 8, 2026 and security fixes end November 11, 2026. No new features will be backported to Helm 3. Organizations should plan migration to Helm 4.0 before November 2026 to avoid security and compatibility risks.

Are Helm 3 charts compatible with Helm 4.0?

Yes, Helm Chart API v2 charts work correctly with Helm 4.0. However, the Go SDK has breaking changes, so applications using Helm libraries need code updates. The CLI commands remain largely compatible for most use cases.

Can I run Helm 3 and Helm 4 simultaneously?

Yes, both versions can be installed on the same machine, enabling gradual migration strategies. This allows teams to test Helm 4.0 in non-production environments while maintaining Helm 3 for critical workloads during the transition period.

What should I test before migrating to Helm 4.0?

Focus on testing critical plugins, post-renderers, and specific flags like --atomic, --force, and helm registry login. Test all charts and values in non-production environments first, and review any custom integrations using Helm SDK libraries.

What is Server-Side Apply in Helm 4.0?

Server-Side Apply (SSA) is enabled with the --server-side flag and handles resource updates on the Kubernetes API server side. This prevents conflicts between different controllers managing the same resources and has been stable since Kubernetes v1.22.