Kubernetes HPA Best Practices: When CPU Works, Why Memory Almost Never Does

Kubernetes HPA best practices diagram showing CPU vs memory autoscaling behavior

There is a configuration that appears in virtually every Kubernetes cluster: a HorizontalPodAutoscaler targeting 70% CPU utilization and 70% memory utilization. It looks reasonable. It follows the examples in the official documentation. And in many cases, it silently causes more harm than good.

The problems surface in predictable ways: workloads that do nothing get scaled up because their memory footprint is naturally high. Latency-sensitive APIs scale too slowly because the CPU spike is already over by the time new pods are ready. Batch jobs oscillate between scaling up and down during normal operation. And teams spend hours debugging autoscaling behavior that should have been straightforward.

This article is about understanding why the default HPA configuration fails, under which exact conditions memory-based HPA is appropriate (and when it is not), and what alternative metrics — custom metrics, event-driven triggers, and external signals — produce autoscaling behavior that actually matches workload demand.

How HPA Actually Decides to Scale

Before diagnosing the problems, it is worth understanding the mechanics precisely. HPA computes a desired replica count using this formula:

desiredReplicas = ceil(currentReplicas × (currentMetricValue / desiredMetricValue))

For a CPU target of 70%, with 2 replicas currently consuming an average of 140% of their CPU request, HPA computes ceil(2 × (140 / 70)) = 4 replicas. This is conceptually simple but has a critical dependency that most configurations ignore: the metric value is expressed relative to the resource request, not the resource limit.

This distinction is fundamental to understanding every failure mode that follows. If a container has a CPU request of 100m and a limit of 2000m, and it is currently consuming 80m, HPA sees 80% utilization — even though the container is using only 4% of its allowed ceiling. Set an HPA threshold of 70% on a container with a CPU request of 100m and any nontrivial workload will trigger scaling immediately.

The HPA controller polls metrics every 15 seconds by default (--horizontal-pod-autoscaler-sync-period). Scale-up happens quickly — within one to three polling cycles when the threshold is consistently exceeded. Scale-down is deliberately slow: by default the controller waits 5 minutes (--horizontal-pod-autoscaler-downscale-stabilization) before reducing replicas, to avoid thrashing. This asymmetry matters when debugging oscillation.

CPU-Based HPA: When It Works and When It Doesn’t

CPU is a compressible resource. When a container hits its CPU limit, the kernel throttles it — the process slows down but does not crash or get evicted. This property makes CPU a reasonable proxy for load in many, but not all, scenarios.

Where CPU HPA Works Well

Stateless request-processing workloads are the sweet spot for CPU-based HPA. If your service does CPU-bound work per request — REST APIs performing data transformation, compute-heavy business logic, image processing — then CPU utilization correlates strongly with request volume. More requests means more CPU consumed, which means HPA adds replicas, which distributes the load.

The key prerequisites for CPU HPA to work correctly are:

  • Accurate CPU requests. Set requests to the actual sustained consumption of the workload under normal load, not a low placeholder. Use VPA in recommendation mode or historical Prometheus data to right-size requests before enabling HPA.
  • Reasonable request-to-limit ratio. A ratio of 1:4 or less keeps HPA thresholds meaningful. A container with request 100m and limit 4000m makes percentage-based thresholds nearly useless.
  • CPU consumption that tracks user load linearly. If your service does CPU-heavy background work independent of incoming requests, CPU utilization will trigger scaling regardless of actual demand.

Where CPU HPA Fails

Latency-sensitive services with sharp traffic spikes. HPA reacts to average CPU utilization measured over the polling window. For a service that handles traffic bursts — a flash sale, a cron-triggered batch of API calls, a notification broadcast — by the time the HPA controller detects the spike, queues new pods, and those pods pass readiness checks, the burst may already be over. The result is replicas added after the damage is done, with the added cost of a scale-down cycle afterward.

I/O-bound workloads. A service that spends most of its time waiting on database queries, external API calls, or message queue reads will show low CPU utilization even under heavy load. HPA will not add replicas while the service is degraded — it sees idle CPUs while goroutines or threads are blocked waiting on I/O.

Workloads with cold-start costs. If a new replica takes 30-60 seconds to warm up (loading ML models, establishing connection pools, populating caches), scaling decisions need to happen earlier — before CPU peaks — not in reaction to it.

Memory-Based HPA: Why It Almost Always Breaks

Memory is an incompressible resource. Unlike CPU — which can be throttled without killing a process — when a container exhausts its memory limit, the OOM killer terminates it. This single property cascades into a set of fundamental problems with using memory as an HPA trigger.

The Core Problem: Memory Doesn’t Naturally Correlate With Load

For most well-architected services, memory consumption is relatively stable. A Go service allocates memory at startup for its runtime structures, connection pools, and caches — and then maintains roughly that footprint regardless of traffic. A JVM application allocates a heap at startup and uses garbage collection to manage it. In both cases, memory usage under 10 requests per second and under 10,000 requests per second may be nearly identical.

This means a memory-based HPA with a 70% threshold will either:

  • Never trigger, because the workload’s memory is stable and always below the threshold — rendering the HPA useless.
  • Always trigger, because the workload’s baseline memory consumption is naturally above the threshold — causing the workload to scale out permanently and never scale back in.

Neither outcome corresponds to actual scaling need.

The Request Misconfiguration Trap

This is the failure mode the user mentioned, and it is the most common cause of “my workload scales up for no reason.” Consider a Java service that needs 512Mi of heap to run normally. The team sets memory request to 256Mi — too conservative, either to save cost or because the initial estimate was wrong. The service immediately consumes 200% of its memory request just by being alive. An HPA with a 70% memory target will scale this workload to maximum replicas within minutes of deployment, and it will stay there forever.

The fix is never “adjust the HPA threshold.” The fix is right-sizing the memory request. But this reveals the deeper issue: memory-based HPA is extremely sensitive to the accuracy of your resource requests, and most teams do not have accurate requests — especially for newer workloads or after code changes that alter memory footprint.

JVM and Go Runtime Memory Behavior

JVM workloads are particularly problematic. By default, the JVM allocates heap up to a maximum (-Xmx) and then holds that memory — it does not release heap back to the OS aggressively, even after garbage collection. A JVM service that handles one request per hour will show nearly the same memory footprint as one handling thousands of requests per minute. Furthermore, the JVM’s garbage collector introduces memory spikes during collection cycles that are unrelated to load.

In containerized JVM environments, you also need to account for the container memory limit aware flag (-XX:+UseContainerSupport, enabled by default since JDK 11) which affects how the JVM calculates its heap ceiling relative to the container limit. Without proper tuning, the JVM may allocate a heap that fills 80-90% of the container’s memory limit — immediately triggering any memory-based HPA.

Go workloads behave differently but also poorly with memory HPA. Go’s garbage collector is designed to maintain low latency rather than minimal memory use. The runtime may hold memory above what is strictly needed, and the memory footprint can vary based on GC tuning parameters (GOGC, GOMEMLIMIT) in ways that are not correlated with incoming request load.

When Memory HPA Is Actually Appropriate

There are narrow cases where memory-based HPA makes sense:

  • Workloads where memory consumption genuinely tracks with load linearly. Some data processing pipelines, in-memory caches that grow with request volume, or streaming applications that buffer data proportionally to throughput. If you can demonstrate from metrics that memory and load have a strong linear correlation, memory HPA is defensible.
  • As a safety valve alongside CPU HPA. Using memory as a secondary metric (not primary) to protect against memory leaks or runaway allocations in a service that normally scales on CPU. In this case, set the memory threshold high — 85-90% — so it only triggers in genuine overconsumption scenarios.
  • Caching services where eviction is not desirable. If a service uses memory as a performance cache and you want to scale out before memory pressure causes cache eviction, memory utilization can be a useful trigger — provided requests are accurately sized.

Outside these specific cases, removing memory from your HPA spec and relying on the signals below will produce better behavior in virtually every scenario.

Right-Sizing Requests Before You Add HPA

No HPA strategy works correctly without accurate resource requests. Before adding any autoscaler — CPU, memory, or custom metrics — run your workload under representative load and measure actual consumption. The easiest way to do this is with VPA in recommendation mode:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-service-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-service
  updatePolicy:
    updateMode: "Off"   # Recommendation only — don't auto-apply

After 24-48 hours of traffic, check the VPA recommendations:

kubectl describe vpa my-service-vpa

The lowerBound, target, and upperBound values give you a data-driven baseline for setting requests. Set your requests at or near the VPA target value before configuring HPA. This single step eliminates the most common cause of HPA misbehavior.

Note that VPA and HPA cannot both manage the same resource metric simultaneously. If VPA is set to auto-update CPU or memory, and HPA is also scaling on those metrics, the two controllers will fight each other. The safe combination is: HPA on CPU/memory + VPA in recommendation-only mode, or HPA on custom metrics + VPA on CPU/memory in auto mode. See the Kubernetes VPA guide for the full details.

Better Signals: What to Scale On Instead

The fundamental shift is moving from resource consumption metrics (which describe the past) to demand metrics (which describe what the workload is being asked to do right now or will be asked to do in seconds).

Requests Per Second (RPS)

For HTTP services, requests per second per replica is usually the most accurate proxy for load. Unlike CPU, it measures demand directly — not a side-effect of demand. An HPA that maintains 500 RPS per replica will scale predictably as traffic grows, regardless of whether the service is CPU-bound, memory-bound, or I/O-bound.

RPS is available as a custom metric from your service mesh (Istio exposes it as istio_requests_total), from your ingress controller (NGINX exposes request rates via Prometheus), or from your application’s own Prometheus metrics. Configuring HPA on custom metrics requires the Prometheus Adapter or a compatible custom metrics API implementation.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-service
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "500"   # 500 RPS per replica

Queue Depth and Lag

For consumer workloads — services reading from Kafka, RabbitMQ, SQS, or any message queue — the right scaling signal is consumer lag: how many messages are waiting to be processed. A lag of zero means consumers are keeping up; a growing lag means you need more consumers.

CPU will not give you this signal reliably. A consumer blocked on a slow database write will show low CPU but growing lag. An idle consumer will show low CPU even if the queue contains millions of unprocessed messages. Scaling on lag directly solves both problems.

This is precisely the use case that KEDA was built for. KEDA’s Kafka scaler, for example, reads consumer group lag directly and scales replicas to maintain a configurable lag threshold — no custom metrics pipeline required.

Latency

P99 latency per replica is an excellent scaling signal for latency-sensitive services. If your SLO is a 200ms P99 response time and latency starts climbing toward 400ms, that is a direct signal that the service is overloaded — regardless of what CPU or memory shows.

Latency-based autoscaling requires custom metrics from your service mesh or APM tool, but the added complexity is often justified for user-facing APIs where latency directly impacts experience.

Scheduled and Predictive Scaling

For workloads with predictable traffic patterns — business-hours services, weekly batch jobs, end-of-month processing peaks — proactive scaling outperforms reactive scaling by definition. Rather than waiting for CPU to spike and then scrambling to add replicas, you pre-scale before the expected load increase.

KEDA’s Cron scaler enables this pattern declaratively, defining scale rules based on time windows rather than observed metrics.

HPA Configuration Best Practices

Always Set minReplicas ≥ 2 for Production

A minReplicas: 1 HPA means your service has a single point of failure during scale-in events. When HPA scales down to 1 replica and that pod is evicted for node maintenance, your service has zero available instances for the duration of the new pod’s startup time. For any production workload, set minReplicas: 2 as a baseline.

Tune Stabilization Windows

The default 5-minute scale-down stabilization window is too aggressive for many workloads. A service that processes jobs in 3-minute batches will show a predictable CPU trough between batches — HPA will attempt to scale down, only to scale back up when the next batch arrives. Increase the stabilization window to match your workload’s natural cycle:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-service
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 600   # 10 minutes
      policies:
      - type: Percent
        value: 25                        # Scale down max 25% of replicas at once
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0     # Scale up immediately
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30

The behavior block (available in HPA v2, GA since Kubernetes 1.23) gives you independent control over scale-up and scale-down behavior. Aggressive scale-up with conservative scale-down is the right default for most production services.

Use a Lower CPU Threshold Than You Think

A CPU target of 70% sounds like it leaves headroom, but it does not account for the time required to scale. If your service takes 45 seconds to pass readiness checks after a new pod starts, and you scale at 70% CPU, the existing pods will be at 100%+ CPU (throttled) for 45 seconds before relief arrives. Set CPU targets at 50-60% for services where scale-up latency matters. This keeps more headroom available during the scaling reaction window.

Combine HPA with PodDisruptionBudgets

HPA scale-down terminates pods. Without a PodDisruptionBudget, HPA can terminate multiple replicas simultaneously during a scale-down event, potentially taking your service below its minimum healthy instance count during cluster maintenance. Always pair an HPA with a PDB:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-service-pdb
spec:
  minAvailable: "50%"
  selector:
    matchLabels:
      app: my-service

Don’t Mix VPA Auto-Update with HPA on the Same Metric

If VPA is set to auto-update CPU or memory requests, and HPA is also scaling on CPU or memory utilization, you create a control loop conflict. VPA changes the request (the denominator of the utilization calculation), which immediately changes the apparent utilization, which triggers HPA to change replica count, which changes the per-pod load, which triggers VPA again. Use VPA in Off or Initial mode when HPA is managing the same workload on resource metrics.

Decision Framework: Which Autoscaler for Which Workload

Use this as a starting point when configuring autoscaling for a new workload:

Workload typeRecommended signalTool
Stateless HTTP API, CPU-boundCPU utilization at 50-60%HPA
Stateless HTTP API, I/O-boundRPS per replica or P99 latencyHPA + custom metrics
Message queue consumerConsumer lag / queue depthKEDA
Event-driven / Kafka / SQSEvent rate or lagKEDA
Predictable traffic patternSchedule (time-based)KEDA Cron scaler
Workload with memory leak riskCPU primary + memory at 85% secondaryHPA (v2 multi-metric)
Right-sizing before HPAHistorical CPU/memory recommendationsVPA recommendation mode

Going Beyond HPA: KEDA and Custom Metrics

Once you outgrow what HPA v2 can express — particularly for event-driven architectures, external system triggers, or composite scaling conditions — KEDA provides a Kubernetes-native autoscaling framework that extends the HPA model without replacing it.

KEDA works by implementing a custom metrics API that HPA can consume, plus its own ScaledObject CRD that abstracts the configuration of over 60 built-in scalers: Kafka, RabbitMQ, Azure Service Bus, AWS SQS, Prometheus queries, Datadog metrics, HTTP request rate, and more. The important architectural point is that KEDA does not replace HPA — it feeds it. Under the hood, KEDA creates and manages an HPA resource targeting the scaled deployment. You get HPA’s stabilization windows, replica bounds, and Kubernetes-native behavior, driven by signals that HPA itself cannot access natively.

For a detailed walkthrough of KEDA scalers and real-world event-driven patterns, see Event-Driven Autoscaling in Kubernetes with KEDA.

For workloads where the right scaling signal comes from a Prometheus metric — request rates, custom business metrics, queue sizes exposed via exporters — the Kubernetes Autoscaling 1.26 and HPA v2 article covers how the custom metrics API pipeline works and how changes in Kubernetes 1.26 affected KEDA behavior.

❓ FAQ

Can I use both CPU and memory in the same HPA?

Yes. HPA v2 supports multiple metrics simultaneously — it scales to satisfy the most demanding metric. If CPU is at 40% (below threshold) but memory is at 80% (above threshold), HPA will scale up. This multi-metric capability is useful for using memory as a safety valve while CPU drives normal scaling behavior. Set the CPU threshold at 60% and the memory threshold at 85% so memory only triggers in genuine overconsumption scenarios.

Why does my workload scale up immediately after deployment?

Almost always a resource request misconfiguration. Check kubectl top pods immediately after deployment and compare the actual consumption to the configured request. If the workload is consuming 200% of its request by simply being alive, the request is set too low. Use VPA in recommendation mode for 24 hours and adjust the request to match actual usage before re-enabling HPA.

Why does HPA scale down too aggressively and cause latency spikes?

Increase the scaleDown.stabilizationWindowSeconds in the HPA behavior block. The default 300 seconds is too short for workloads with cyclical load patterns. Also add a Percent policy to scale down at most 25% of replicas per minute, preventing simultaneous termination of multiple pods during a rapid scale-down event.

Should I set HPA on every deployment?

No. HPA is appropriate for workloads where replica count meaningfully affects capacity — stateless services, consumers, request handlers. It is not appropriate for stateful workloads (databases, caches) where scaling requires more than just adding replicas, for singleton controllers that should never have more than one replica, or for batch jobs that should run to completion without scaling. Adding HPA to every deployment creates operational noise and potential instability without benefit.

What is the minimum CPU request I should set to use HPA reliably?

There is no absolute minimum, but requests below 100m make percentage thresholds very coarse-grained. At 50m CPU request and a 70% threshold, HPA triggers when the pod consumes 35m CPU — essentially any non-trivial activity. In practice, if your workload genuinely needs less than 100m CPU under load, it probably should not be using CPU-based HPA at all. Consider RPS or custom metrics instead.

How do I debug HPA scaling decisions?

Start with kubectl describe hpa <name> — it shows the current metric values, the computed desired replica count, and the last scaling event reason. For deeper inspection, check HPA events with kubectl get events --field-selector involvedObject.kind=HorizontalPodAutoscaler. If using custom metrics, verify the metrics server is returning expected values with kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1".


Further Reading

Istio ServiceEntry Explained: External Services, DNS, and Traffic Control

Istio ServiceEntry Explained: External Services, DNS, and Traffic Control

Every production Kubernetes cluster talks to the outside world. Your services call payment APIs, connect to managed databases, push events to SaaS analytics platforms, and reach legacy systems that will never run inside the mesh. By default, Istio lets all outbound traffic flow freely — or blocks it entirely if you flip outboundTrafficPolicy to REGISTRY_ONLY. Neither extreme gives you what you actually need: selective, observable, policy-controlled access to external services.

That is exactly what Istio ServiceEntry solves. It registers external endpoints in the mesh’s internal service registry so that Envoy sidecars can apply the same traffic management, security, and observability features to outbound calls that you already enjoy for east-west traffic. No new proxies, no egress gateways required for the basic case — just a YAML resource that tells the mesh “this external thing exists, and here is how to reach it.”

In this guide, I will walk through every field of the ServiceEntry spec, explain the four DNS resolution modes with real-world use cases, and show production-ready patterns for external APIs, databases, TCP services, and legacy workloads. We will also cover how to combine ServiceEntry with DestinationRule and VirtualService to get circuit breaking, retries, connection pooling, and even sticky sessions for external dependencies.

What Is a ServiceEntry

Istio maintains an internal service registry that merges Kubernetes Services with any additional entries you declare. When a sidecar proxy needs to decide how to route a request, it consults this registry. Services inside the mesh are automatically registered. Services outside the mesh are not — unless you create a ServiceEntry.

A ServiceEntry is a custom resource that adds an entry to the mesh’s service registry. Once registered, the external service becomes a first-class citizen: Envoy generates clusters, routes, and listeners for it, which means you get metrics (istio_requests_total), access logs, distributed traces, mTLS origination, retries, timeouts, circuit breaking — the full Istio feature set.

Without a ServiceEntry, outbound traffic to an external host either passes through as a raw TCP connection (in ALLOW_ANY mode) with no telemetry, or gets dropped with a 502/503 (in REGISTRY_ONLY mode). Both outcomes are undesirable in production. The ServiceEntry bridges that gap.

ServiceEntry Anatomy: All Fields Explained

Let us look at a complete ServiceEntry and then break down each field.

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: external-api
  namespace: production
spec:
  hosts:
    - api.stripe.com
  location: MESH_EXTERNAL
  ports:
    - number: 443
      name: https
      protocol: TLS
  resolution: DNS
  exportTo:
    - "."
    - "istio-system"

hosts

A list of hostnames associated with the service. For external services, this is typically the DNS name your application uses (e.g., api.stripe.com). For services using HTTP protocols, the hosts field is matched against the HTTP Host header. For non-HTTP protocols and services without a DNS name, you can use a synthetic hostname and pair it with addresses or static endpoints.

addresses

Optional virtual IP addresses associated with the service. Useful for TCP services where you want to assign a VIP that the sidecar will intercept. Not required for HTTP/HTTPS services that use hostname-based routing.

ports

The ports on which the external service is exposed. Each port needs a number, name, and protocol. The protocol matters: setting it to TLS tells Envoy to perform SNI-based routing without terminating TLS. Setting it to HTTPS means HTTP over TLS. For databases, you’ll typically use TCP.

location

MESH_EXTERNAL or MESH_INTERNAL. Use MESH_EXTERNAL for services outside your cluster (third-party APIs, managed databases). Use MESH_INTERNAL for services inside your infrastructure that are not part of the mesh — for example, VMs running in the same VPC that do not have a sidecar, or a Kubernetes Service in a namespace without injection enabled. The location affects how mTLS is applied and how metrics are labeled.

resolution

How the sidecar resolves the endpoint addresses. This is the most critical field and I will dedicate the next section to it. Options: NONE, STATIC, DNS, DNS_ROUND_ROBIN.

endpoints

An explicit list of network endpoints. Required when resolution is STATIC. Optional with DNS resolution to provide labels or locality information. Each endpoint can have an address, ports, labels, network, locality, and weight.

exportTo

Controls the visibility of this ServiceEntry across namespaces. Use "." for the current namespace only, "*" for all namespaces. In multi-team clusters, restrict exports to avoid namespace pollution.

Resolution Types: NONE vs STATIC vs DNS vs DNS_ROUND_ROBIN

The resolution field determines how Envoy discovers the IP addresses behind the service. Getting this wrong is the number one cause of ServiceEntry misconfigurations. Here is a clear breakdown.

ResolutionHow It WorksBest For
NONEEnvoy uses the original destination IP from the connection. No DNS lookup by the proxy.Wildcard entries, pass-through scenarios, services where the application already resolved the IP.
STATICEnvoy routes to the IPs listed in the endpoints field. No DNS involved.Services with stable, known IPs (e.g., on-prem databases, VMs with fixed IPs).
DNSEnvoy resolves the hostname at connection time and creates an endpoint per returned IP. Uses async DNS with health checking per IP.External APIs behind load balancers, managed databases with DNS endpoints (RDS, CloudSQL).
DNS_ROUND_ROBINEnvoy resolves the hostname and uses a single logical endpoint, rotating across returned IPs. No per-IP health checking.Simple external services, services where you do not need per-endpoint circuit breaking.

When to Use NONE

Use NONE when you want to register a range of external IPs or wildcard hosts without Envoy performing any address resolution. This is common for broad egress policies: “allow traffic to *.googleapis.com on port 443.” Envoy will simply forward traffic to whatever IP the application resolved via kube-dns. The downside: Envoy has limited ability to apply per-endpoint policies.

When to Use STATIC

Use STATIC when the external service has known, stable IP addresses that rarely change. This avoids DNS dependencies entirely. You define the IPs in the endpoints list. Classic use case: a legacy Oracle database on a fixed IP in your data center.

When to Use DNS

Use DNS for most external API integrations. Envoy performs asynchronous DNS resolution and creates a cluster endpoint for each returned IP address. This enables per-endpoint health checking and circuit breaking — critical for production reliability. This is the mode you want for services like api.stripe.com or your RDS instance endpoint.

When to Use DNS_ROUND_ROBIN

Use DNS_ROUND_ROBIN when the external hostname returns many IPs and you do not need per-IP circuit breaking. Envoy treats all resolved IPs as a single logical endpoint and round-robins across them. This is lighter weight than DNS mode and avoids creating a large number of endpoints in Envoy’s cluster configuration.

Practical Patterns

Pattern 1: External HTTP API (api.stripe.com)

The most common ServiceEntry pattern. Your application calls a third-party HTTPS API. You want Istio telemetry, and optionally retries and timeouts.

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: stripe-api
  namespace: payments
spec:
  hosts:
    - api.stripe.com
  location: MESH_EXTERNAL
  ports:
    - number: 443
      name: tls
      protocol: TLS
  resolution: DNS

Note the protocol is TLS, not HTTPS. Since your application initiates the TLS handshake directly, Envoy handles this as opaque TLS using SNI-based routing. If you were terminating TLS at the sidecar and doing TLS origination via a DestinationRule, you would set the protocol to HTTP and handle the upgrade separately — but for most external APIs, let the application manage its own TLS.

Pattern 2: External Managed Database (RDS / CloudSQL)

Managed databases expose a DNS endpoint that resolves to one or more IPs. During failover, the DNS record changes. You need Envoy to respect DNS TTLs and route to the current primary.

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: orders-database
  namespace: orders
spec:
  hosts:
    - orders-db.abc123.us-east-1.rds.amazonaws.com
  location: MESH_EXTERNAL
  ports:
    - number: 5432
      name: postgres
      protocol: TCP
  resolution: DNS

For TCP services, Envoy cannot use HTTP headers to route, so it relies on IP-based matching. The DNS resolution mode ensures Envoy periodically re-resolves the hostname and updates its endpoint list. This is critical for RDS multi-AZ failover scenarios where the DNS endpoint flips to a new IP.

Pattern 3: Legacy Internal Service Not in the Mesh

You have a monitoring service running on a set of VMs at known IP addresses inside your VPC. It is not part of the mesh, but your meshed services need to talk to it.

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: legacy-monitoring
  namespace: observability
spec:
  hosts:
    - legacy-monitoring.internal
  location: MESH_INTERNAL
  ports:
    - number: 8080
      name: http
      protocol: HTTP
  resolution: STATIC
  endpoints:
    - address: 10.0.5.10
    - address: 10.0.5.11
    - address: 10.0.5.12

Key differences: location is MESH_INTERNAL because the service lives inside your network, and resolution is STATIC because we know the IPs. The hostname legacy-monitoring.internal is synthetic — your application uses it, and Istio’s DNS proxy (or a CoreDNS entry) resolves it to one of the listed endpoints.

Pattern 4: TCP Services with Multiple Ports

Some external services expose multiple TCP ports — for example, an Elasticsearch cluster with both data (9200) and transport (9300) ports.

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: external-elasticsearch
  namespace: search
spec:
  hosts:
    - es.example.com
  location: MESH_EXTERNAL
  ports:
    - number: 9200
      name: http
      protocol: HTTP
    - number: 9300
      name: transport
      protocol: TCP
  resolution: DNS

Each port gets its own Envoy listener configuration. The HTTP port benefits from full Layer 7 telemetry and traffic management. The TCP port gets Layer 4 metrics and connection-level policies.

Combining ServiceEntry with DestinationRule

A ServiceEntry alone registers the external service. To apply traffic policies — connection pooling, circuit breaking, TLS origination, load balancing — you pair it with a DestinationRule. This is where things get powerful.

Connection Pooling and Circuit Breaking

External APIs have rate limits. Your managed database has a maximum connection count. Protecting these dependencies at the mesh level prevents cascading failures.

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: stripe-api
  namespace: payments
spec:
  hosts:
    - api.stripe.com
  location: MESH_EXTERNAL
  ports:
    - number: 443
      name: tls
      protocol: TLS
  resolution: DNS
---
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: stripe-api-dr
  namespace: payments
spec:
  host: api.stripe.com
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 50
        connectTimeout: 5s
      http:
        h2UpgradePolicy: DO_NOT_UPGRADE
        maxRequestsPerConnection: 100
    outlierDetection:
      consecutive5xxErrors: 3
      interval: 30s
      baseEjectionTime: 60s
      maxEjectionPercent: 100

This configuration caps outbound connections to Stripe at 50, sets a 5-second connection timeout, and ejects endpoints that return 3 consecutive 5xx errors. In production, this prevents a degraded third-party API from consuming all your connection slots and causing a domino effect across your services.

TLS Origination

Sometimes your application speaks plain HTTP, but the external service requires HTTPS. Instead of modifying application code, you can offload TLS origination to the sidecar.

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: external-api
  namespace: default
spec:
  hosts:
    - api.external-service.com
  location: MESH_EXTERNAL
  ports:
    - number: 80
      name: http
      protocol: HTTP
    - number: 443
      name: https
      protocol: TLS
  resolution: DNS
---
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: external-api-tls
  namespace: default
spec:
  host: api.external-service.com
  trafficPolicy:
    portLevelSettings:
      - port:
          number: 443
        tls:
          mode: SIMPLE

Your application sends HTTP to port 80. A VirtualService (shown in the next section) redirects that to port 443. The DestinationRule initiates TLS to the external endpoint. The application never knows TLS happened.

Combining ServiceEntry with VirtualService

VirtualService gives you Layer 7 traffic management for external services: retries, timeouts, fault injection, header-based routing, and traffic shifting. This is invaluable when you are migrating between API providers or need resilience policies for unreliable external dependencies.

Retries and Timeouts

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: stripe-api-vs
  namespace: payments
spec:
  hosts:
    - api.stripe.com
  http:
    - route:
        - destination:
            host: api.stripe.com
            port:
              number: 443
      timeout: 10s
      retries:
        attempts: 3
        perTryTimeout: 3s
        retryOn: connect-failure,refused-stream,unavailable,cancelled,retriable-status-codes
        retryRemoteLocalities: true

This applies a 10-second overall timeout with up to 3 retry attempts (3 seconds each) for specific failure conditions. Note that this only works for HTTP-protocol ServiceEntries. For TLS-protocol entries where Envoy cannot see the HTTP layer, you are limited to TCP-level connection retries configured via the DestinationRule.

Traffic Shifting Between External Providers

Migrating from one external API to another? Use weighted routing to shift traffic gradually.

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: geocoding-primary
  namespace: geo
spec:
  hosts:
    - geocoding.internal
  location: MESH_EXTERNAL
  ports:
    - number: 443
      name: tls
      protocol: TLS
  resolution: STATIC
  endpoints:
    - address: api.old-geocoding-provider.com
      labels:
        provider: old
    - address: api.new-geocoding-provider.com
      labels:
        provider: new
---
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: geocoding-dr
  namespace: geo
spec:
  host: geocoding.internal
  trafficPolicy:
    tls:
      mode: SIMPLE
  subsets:
    - name: old-provider
      labels:
        provider: old
    - name: new-provider
      labels:
        provider: new
---
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: geocoding-vs
  namespace: geo
spec:
  hosts:
    - geocoding.internal
  http:
    - route:
        - destination:
            host: geocoding.internal
            subset: old-provider
          weight: 80
        - destination:
            host: geocoding.internal
            subset: new-provider
          weight: 20

This sends 80% of geocoding traffic to the old provider and 20% to the new one. Adjust the weights as you gain confidence. Fully reversible — just set the old provider back to 100%.

DNS Resolution Patterns: Istio DNS Proxy vs kube-dns

Istio DNS resolution for external services involves two layers: how your application resolves the hostname (kube-dns / CoreDNS), and how the sidecar resolves the hostname (Envoy’s async DNS or Istio’s DNS proxy). Understanding the interplay is crucial for reliable Istio DNS behavior.

Default Flow (Without Istio DNS Proxy)

Your application calls api.stripe.com. kube-dns resolves it to an IP. The application opens a connection to that IP. The sidecar intercepts the connection and — if the ServiceEntry uses DNS resolution — Envoy independently resolves api.stripe.com to determine its endpoint list. Two separate DNS lookups happen, which can lead to inconsistencies if DNS records change between the two resolutions.

With Istio DNS Proxy (dns.istio.io)

Istio’s sidecar includes a DNS proxy that intercepts DNS queries from the application. When enabled (via meshConfig.defaultConfig.proxyMetadata.ISTIO_META_DNS_CAPTURE and ISTIO_META_DNS_AUTO_ALLOCATE), the proxy can:

  • Auto-allocate virtual IPs for ServiceEntry hosts that do not have addresses defined, which is critical for TCP ServiceEntries that need IP-based matching.
  • Resolve ServiceEntry hosts directly, avoiding the round-trip to kube-dns for known mesh services.
  • Ensure consistency between the application’s DNS resolution and the sidecar’s endpoint resolution.

In modern Istio installations (1.18+), DNS capture is enabled by default. Verify with:

istioctl proxy-config bootstrap <pod-name> -n <namespace> | grep -A2 "ISTIO_META_DNS"

When DNS Proxy Matters Most

The DNS proxy is especially important for TCP ServiceEntries without an explicit addresses field. Without a VIP, Envoy cannot match an incoming TCP connection to the correct ServiceEntry because there is no HTTP Host header to inspect. The DNS proxy solves this by auto-allocating a VIP from the 240.240.0.0/16 range and returning that VIP when the application resolves the hostname. The sidecar then intercepts traffic to that VIP and routes it to the correct external endpoint.

Sticky Sessions with ServiceEntry

Some external services require session affinity — for example, a legacy service that stores session state in memory, or a WebSocket endpoint that must maintain a persistent connection to the same backend. Istio supports sticky sessions for external services through consistent hashing in a DestinationRule.

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: legacy-session-service
  namespace: default
spec:
  hosts:
    - legacy-session.internal
  location: MESH_INTERNAL
  ports:
    - number: 8080
      name: http
      protocol: HTTP
  resolution: STATIC
  endpoints:
    - address: 10.0.1.10
    - address: 10.0.1.11
    - address: 10.0.1.12
---
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: legacy-session-dr
  namespace: default
spec:
  host: legacy-session.internal
  trafficPolicy:
    loadBalancer:
      consistentHash:
        httpCookie:
          name: SERVERID
          ttl: 3600s

This configuration hashes on an HTTP cookie named SERVERID. If the cookie does not exist, Envoy generates one and sets it on the response so that subsequent requests from the same client stick to the same endpoint. You can also hash on:

  • HTTP header: consistentHash.httpHeaderName: "x-user-id" — useful when your application sends a user identifier in every request.
  • Source IP: consistentHash.useSourceIp: true — simplest option but breaks in environments with NAT or shared egress IPs.
  • Query parameter: consistentHash.httpQueryParameterName: "session_id" — for REST APIs that include a session identifier in the URL.

Sticky sessions with ServiceEntry work identically to in-mesh sticky sessions. The key requirement is that the ServiceEntry must use STATIC or DNS resolution (not NONE) so that Envoy has multiple endpoints to hash across. With DNS_ROUND_ROBIN, there is only one logical endpoint, so consistent hashing has no effect.

Troubleshooting Common Issues

503 Errors When Calling External Services

The most common ServiceEntry issue. Start with this diagnostic sequence:

# Check if the ServiceEntry is applied and visible to the proxy
istioctl proxy-config cluster <pod-name> -n <namespace> | grep <external-host>

# Check the listeners
istioctl proxy-config listener <pod-name> -n <namespace> --port <port>

# Look at Envoy access logs for the specific request
kubectl logs <pod-name> -n <namespace> -c istio-proxy | grep <external-host>

Common causes of 503 errors:

  • Wrong protocol: Setting protocol: HTTPS when your application initiates TLS. Use TLS for pass-through; use HTTP only if the sidecar does TLS origination.
  • Missing ServiceEntry in REGISTRY_ONLY mode: If outboundTrafficPolicy is REGISTRY_ONLY, any host without a ServiceEntry is blocked.
  • exportTo restriction: The ServiceEntry is in namespace A, exported only to ".", and the calling pod is in namespace B.
  • DNS resolution failure: Envoy cannot resolve the hostname. Check that the DNS servers are reachable from the pod.

DNS Resolution Failures

When Envoy’s async DNS resolver fails, you will see UH (upstream unhealthy) or UF (upstream connection failure) flags in access logs.

# Verify DNS works from inside the sidecar
kubectl exec <pod-name> -n <namespace> -c istio-proxy -- \
  pilot-agent request GET /dns_resolve?proxyID=<pod-name>.<namespace>&host=api.stripe.com

# Check Envoy cluster health
istioctl proxy-config endpoint <pod-name> -n <namespace> | grep <external-host>

If the endpoint shows UNHEALTHY, Envoy resolved the DNS but the outlier detection ejected the host. If no endpoint appears at all, DNS resolution is failing. Common fix: ensure your pods can reach an external DNS server, or that CoreDNS is configured to forward queries for the external domain.

TLS Origination Not Working

If you configured TLS origination via a DestinationRule but traffic still fails:

  • Ensure the ServiceEntry port protocol is HTTP, not TLS. If you set it to TLS, Envoy treats the connection as opaque TLS pass-through and will not apply the DestinationRule’s TLS settings.
  • Verify the DestinationRule’s host field exactly matches the ServiceEntry’s hosts entry.
  • Check that the VirtualService (if used) routes to the correct port number.

TCP ServiceEntry Not Intercepting Traffic

For TCP-protocol ServiceEntries without the DNS proxy, Envoy cannot match traffic by hostname. You must either:

  • Set an explicit addresses field with a VIP that your application targets.
  • Enable Istio’s DNS proxy to auto-allocate VIPs.
  • Ensure the destination IP matches what the ServiceEntry resolves to.

Without one of these, TCP traffic goes through the PassthroughCluster and bypasses your ServiceEntry entirely.

Frequently Asked Questions

Do I need a ServiceEntry if outboundTrafficPolicy is set to ALLOW_ANY?

You do not need one for connectivity — your services can reach external hosts without it. But you should create ServiceEntries anyway. Without them, outbound traffic goes through the PassthroughCluster, which means no detailed metrics per destination, no access logging with the external hostname, no circuit breaking, no retries, and no timeout policies. A ServiceEntry is the difference between “it works” and “it works reliably with observability.”

What is the difference between protocol TLS and HTTPS in a ServiceEntry port?

TLS tells Envoy to treat the connection as opaque TLS. Envoy reads the SNI header to determine routing but does not decrypt the payload. Use this when your application initiates TLS directly. HTTPS tells Envoy the protocol is HTTP over TLS, which implies Envoy should handle TLS. In practice, for external services where the application manages its own TLS, use TLS. Use HTTP with a DestinationRule TLS origination when you want the sidecar to handle TLS.

Can I use wildcards in ServiceEntry hosts?

Yes, but with limitations. You can use *.example.com to match any subdomain of example.com. However, wildcard entries only work with resolution: NONE because Envoy cannot perform DNS lookups for wildcard hostnames. This means you lose the ability to apply per-endpoint traffic policies. Wildcard ServiceEntries are best used for broad egress access control rather than fine-grained traffic management.

How do I configure sticky sessions for an external service behind a ServiceEntry?

Create a ServiceEntry with STATIC or DNS resolution (so Envoy has multiple endpoints), then pair it with a DestinationRule that configures consistentHash under trafficPolicy.loadBalancer. You can hash on an HTTP cookie, header, source IP, or query parameter. The ServiceEntry must expose multiple endpoints for consistent hashing to have any effect. See the “Sticky Sessions with ServiceEntry” section above for a complete YAML example.

How does ServiceEntry interact with NetworkPolicy and Istio AuthorizationPolicy?

A ServiceEntry does not bypass Kubernetes NetworkPolicy. If a NetworkPolicy blocks egress to the external IP, traffic will be dropped at the CNI level before Envoy can route it. Istio AuthorizationPolicy can also restrict which workloads are allowed to call specific ServiceEntry hosts. For defense in depth, use ServiceEntry for traffic management and observability, AuthorizationPolicy for workload-level access control, and NetworkPolicy for network-level enforcement.

Wrapping Up

ServiceEntry is one of the most practical Istio resources you will use in production. It transforms opaque outbound connections into managed, observable, policy-controlled traffic — and it does so without requiring changes to your application code. Start with the basics: create a ServiceEntry for each external dependency, set the correct resolution type, and pair it with a DestinationRule for connection limits and circuit breaking. As you mature, add VirtualServices for retries and timeouts, configure sticky sessions where needed, and enable the DNS proxy for seamless TCP service integration.

The pattern is always the same: register the service, apply policies, observe the traffic. Every external dependency you formalize with a ServiceEntry is one fewer blind spot in your production mesh.