KEP-5679: Fallback for HPA on failure to retrieve metrics

KEP-5053: Fallback for HPA External Metrics on Retrieval Failure

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
- User Stories
  - Story 1: SaaS Application Scaling on Queue Depth
  - Story 2: E-commerce Site with Multiple External Metrics
- Risks and Mitigations
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

The Horizontal Pod Autoscaler’s reliance on external metrics creates a dependency on systems outside the Kubernetes cluster’s control. These external systems (cloud provider APIs, third-party monitoring systems, message brokers, etc.) may experience:

Network connectivity issues
Rate limiting
Service outages
Authentication/authorization failures
Degraded performance

When external metrics become unavailable, the HPA cannot make informed scaling decisions, which can lead to:

Workloads stuck at insufficient scale during traffic spikes
Inability to respond to critical business metrics (e.g., queue depth, error rates)
Over-dependence on external system reliability

Unlike in-cluster resource metrics (CPU, memory) served by metrics-server, which are part of the cluster’s core infrastructure, external metrics are inherently less reliable and outside the cluster operator’s direct control.

Motivation

The Horizontal Pod Autoscaler (HPA) supports scaling workloads based on external metrics—metrics that originate from systems outside the Kubernetes cluster’s control. These external systems include:

Cloud provider APIs (e.g., AWS CloudWatch, Azure Monitor, GCP Monitoring)
Third-party monitoring systems (e.g., Datadog, New Relic, Prometheus running externally)
Message brokers and queues (e.g., AWS SQS, RabbitMQ, Kafka)
Application-specific metrics services

Unlike in-cluster resource metrics (CPU, memory served by metrics-server) or custom/object metrics (served by in-cluster custom metrics APIs), external metrics are inherently less reliable because they depend on systems outside the cluster operator’s direct control. These external systems may experience:

Network connectivity issues between the cluster and the external service
Rate limiting or throttling
Service outages or degraded performance
Authentication/authorization failures
Regional or availability zone failures

When external metrics become unavailable, the HPA cannot make informed scaling decisions. Currently, the HPA simply maintains the current replica count and waits for metrics to become available again. This behavior can lead to:

Workloads stuck at insufficient scale during traffic spikes when metrics are unavailable
Inability to respond to critical business events (e.g., growing queue depth, increasing error rates)
Production incidents caused by external metrics provider outages
Over-dependence on the reliability of external systems for critical autoscaling functionality Other autoscalers in the ecosystem, such as KEDA , already provide fallback mechanisms for external metrics to mitigate these availability issues. By allowing users to configure fallback behavior for external metrics in HPA, this proposal aims to:
Reduce the criticality of external metrics providers on cluster workload scaling
Improve the overall robustness of autoscaling for workloads that depend on external signals
Enable users to define safe, conservative scaling actions when external metrics are temporarily unavailable
Maintain workload availability and performance during external metrics provider disruptions

Why Duration-Based Instead of Count-Based:

Different Kubernetes providers and configurations may poll external metrics at different frequencies. The HPA reconciliation loop typically runs every 15 seconds by default (configurable via --horizontal-pod-autoscaler-sync-period), but this can vary between clusters. A count-based threshold (e.g., “3 failures”) would result in inconsistent behavior:

In a cluster polling every 15s: 3 failures = 45 seconds
In a cluster polling every 30s: 3 failures = 90 seconds
If polling frequency changes, behavior changes unexpectedly

A duration-based threshold provides consistent, predictable behavior regardless of:

HPA controller reconciliation frequency
Kubernetes provider configurations
Cluster-specific settings

The duration is measured from the first consecutive failure, ensuring consistent and understandable semantics: “activate fallback if the metric has been failing for at least X minutes.”

This enhancement allows users to specify a desired replica count that the HPA should use after a configurable number of consecutive failures to retrieve an external metric. The fallback replica count is treated as the desired replica count from that metric and combined with other metrics using the HPA’s standard multi-metric algorithm (taking the maximum), respecting all configured constraints (min/max replicas, behavior policies, etc.), ensuring predictable and safe scaling decisions even when external metrics are unavailable.

The community has previously expressed interest in addressing this limitation #109214 .

Goals

Allow users to optionally define a fallback, static pod replica count value when retrieval of external metrics fails
Provide per-metric failure tracking and fallback behavior
Maintain the HPA’s scaling algorithm and respect min/max replica constraints
Ensure users can determine which specific metrics are using fallback values

Non-Goals

Fallback for resource metrics (CPU, memory from metrics-server) - these are in-cluster and should be addressed at the infrastructure level if unavailable
Fallback for pods/object metrics - these use in-cluster APIs
Fallback for custom metrics - may be considered in future based on alpha feedback
Last-known-good metric value caching
Automatic fallback value calculation
Changing the HPA scaling algorithm

Proposal

Add optional fallback configuration to the existing ExternalMetricSource type by introducing a new fallback field, allowing users to specify:

A failure duration (how long the metric must be continuously failing before activating fallback)
A desired replica count to use when the failure duration threshold is exceeded

This approach:

Works with the HPA algorithm: Fallback provides a desired replica count for that metric, which is combined with other metrics using the standard HPA multi-metric approach (taking the maximum)
Is per-metric: Each external metric can have its own fallback configuration
Provides visibility: Status shows which metrics are in fallback state
Is conservative: Only applies to external metrics, which are inherently out-of-cluster
Is consistent: Duration-based thresholds behave the same across different Kubernetes configurations and reconciliation frequencies

User Stories

Story 1: SaaS Application Scaling on Queue Depth

As an operator, I run a SaaS application that scales based on a cloud provider’s message queue depth (external metric). Occasionally, the cloud provider’s metrics API experiences brief outages (5-10 minutes). During these outages, I would like my HPA fallback to a manual configuration, ensuring sufficient capacity to handle the presumed backlog safely.

When the external API fails, the HPA treats this metric as requesting 10 replicas, ensuring sufficient capacity to handle the presumed backlog safely.

Story 2: E-commerce Site with Multiple External Metrics

As an operator, I want to configure different fallback replica counts for each external metric so my e-commerce site can continue autoscaling when one monitoring provider fails.

Risks and Mitigations

Risk: Users configure inappropriate fallback replica counts
- Mitigation: Documentation with best practices; validation ensures replicas > 0; HPA min/max constraints still apply; users should consider peak load scenarios when setting fallback values
Risk: Users configure failureDurationSeconds too short, causing premature fallback activation
- Mitigation: Minimum value of 180 seconds (3 minutes) provides reasonable buffer; validation enforces minimum values; documentation recommends considering normal metric provider latency and transient failures
Risk: Users configure failureDurationSeconds too long, delaying necessary scaling during outages
- Mitigation: Documentation provides guidance on balancing between avoiding false positives and responding quickly to genuine outages; recommend 180-300 seconds (3-5 minutes) for most use cases
Risk: Complexity in understanding which metric is in fallback and why
- Mitigation: Per-metric status clearly shows fallback state, firstFailureTime timestamp, and current fallbackStatus value; events are generated when fallback activates with clear messaging including duration and timestamp

Design Details

Introduce a new ExternalMetricFallback type and add a new fallback field to the existing ExternalMetricSource struct. Additionally, add new fallbackStatus and firstFailureTime fields to the existing ExternalMetricStatus struct.

// ExternalMetricFallback defines fallback behavior when an external metric cannot be retrieved
type ExternalMetricFallback struct {
  // failureDurationSeconds is the duration in seconds for which the external metric must be
  // continuously failing before the fallback value is used. The duration is measured from the
  // first consecutive failure. Must be greater than 0.
  // +optional
  // default=180
  // min=180
  FailureDurationSeconds *int64 `json:"failureDurationSeconds,omitempty"`

  // replicas is the desired replica count to use when the external metric cannot be retrieved.
  // This value is treated as the desired replica count from this metric.
  // When multiple metrics are configured, the HPA controller uses the maximum of all
  // desired replica counts (standard HPA multi-metric behavior).
  // Must be greater than 0.
  // +required
  Replicas int32 `json:"replicas"`
}

// ExternalMetricSource indicates how to scale on a metric not associated with
// any Kubernetes object (for example length of queue in cloud
// messaging service, or QPS from loadbalancer running outside of cluster).
type ExternalMetricSource struct {
  // metric identifies the target metric by name and selector
  Metric MetricIdentifier `json:"metric" protobuf:"bytes,1,name=metric"`

  // target specifies the target value for the given metric
  Target MetricTarget `json:"target" protobuf:"bytes,2,name=target"`

  // fallback defines the behavior when this external metric cannot be retrieved.
  // If not set, the HPA will not scale based on this metric when it's unavailable.
  // +optional
  Fallback *ExternalMetricFallback `json:"fallback,omitempty"`
}

Update MetricStatus to include per-metric fallback information:

// ExternalMetricStatus indicates the current value of a global metric not associated
// with any Kubernetes object.
type ExternalMetricStatus struct {
	// metric identifies the target metric by name and selector
	Metric MetricIdentifier `json:"metric" protobuf:"bytes,1,name=metric"`

	// current contains the current value for the given metric
	Current MetricValueStatus `json:"current" protobuf:"bytes,2,name=current"`
    
  // fallbackStatus indicates whether this metric is operating normally or in fallback mode.
  // Possible enum values:
  // - "Normal" indicates the metric is being retrieved successfully
  // - "Fallback" indicates the metric is using a fallback value due to retrieval failures
  // +optional
  FallbackStatus string `json:"fallbackStatus,omitempty"`
  
  // firstFailureTime is the timestamp of the first consecutive failure retrieving this metric.
  // Reset to nil on successful retrieval. Used to calculate if failureDurationSeconds has been exceeded.
  // +optional
  FirstFailureTime *metav1.Time `json:"firstFailureTime,omitempty"`
}

Add a new HorizontalPodAutoscalerConditionType:

const (
  // ExternalMetricFallbackActive indicates that one or more external metrics
  // are currently using fallback values due to retrieval failures.
  // Status will be:
  // - "True" if any external metric is in fallback state
  // - "False" if no external metrics are in fallback state
  // - "Unknown" if the controller cannot determine the state
  ExternalMetricFallbackActive ConditionType = "ExternalMetricFallbackActive"
)

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

None required.

Unit tests

Tests for Fallback Configuration:
- Verify failureDurationSeconds validation (must be > 180)
- Verify replicas validation (must be > 0)
Tests for Failure Tracking and Activation:
- Verify firstFailureTime is set on first failure and persists through consecutive failures
- Verify firstFailureTime is cleared on successful metric retrieval
- Verify fallback activates when current time exceeds firstFailureTime + failureDurationSeconds
- Verify fallbackStatus field updates correctly
Tests for Replica Calculation:
- Verify fallback returns the configured replica count when threshold is exceeded
- Verify fallback replica count is combined with other metrics using max() (standard multi-metric behavior)
- Verify replica calculations respect min/max constraints with fallback replica counts
- Verify correct behavior with multiple external metrics (independent failure tracking and max selection)
/pkg/controller/podautoscaler: 05 Nov 2025 - 89.1%
/pkg/controller/podautoscaler/metrics: 05 Nov 2025 - 89.9%

Integration tests

N/A, the feature is tested using unit tests and e2e tests.

e2e tests

We will add the following e2e autoscaling tests:

External metric failure triggers fallback after threshold is reached, using configured replica count
HPA status condition ExternalMetricFallbackActive is set to True when fallback activates
Success in retrieving external metric resets the failure count and resumes normal scaling
HPA uses max() of healthy metric calculations and fallback replica counts
Fallback respects HPA min/max replica constraints
Status correctly reflects which metrics are in fallback state and shows firstFailureTime
With multiple external metrics in fallback, HPA uses the maximum fallback replica count

Graduation Criteria

Alpha

Feature implemented behind HPAExternalMetricFallback feature gate
Unit and e2e tests passed as designed in TestPlan .

Beta

Unit and e2e tests passed as designed in TestPlan .
Gather feedback from developers and surveys
All functionality completed
All security enforcement completed
All monitoring requirements completed
All testing requirements completed
All known pre-release issues and gaps resolved

GA

No negative feedback.
All issues and gaps identified as feedback during beta are resolved

Upgrade / Downgrade Strategy

Upgrade

When the feature gate is enabled:

Existing HPAs continue to work unchanged
External metrics without fallback configuration behave as they do today (no scaling when unavailable)
Users can add fallback configuration to external metrics in their HPAs
The controller begins tracking per-metric firstFailureTime for external metrics with fallback configured
- On the first failure, firstFailureTime is set to the current timestamp
- On subsequent failures, the timestamp is preserved to track failure duration
- On success, firstFailureTime is cleared (set to nil)
The fallbackStatus and firstFailureTime status fields are populated for external metrics with fallback configured
Fallback activates when (current time - firstFailureTime) >= failureDurationSeconds

Downgrade

When the feature gate is disabled:

The fallback field in ExternalMetricSource is ignored by the controller
The fallbackStatus and firstFailureTime status fields are not updated (remain at last values but are not used)
All external metrics revert to current behavior: HPA cannot scale based on them when they’re unavailable
Any HPAs currently using fallback values will:
- Maintain their current replica count
- Stop using fallback values
- Resume normal metric-based scaling when external metrics become available again
No disruption to running workloads (pods are not restarted)
The firstFailureTime timestamp remains in the status but is not evaluated or updated

All logic related to fallback evaluation, failure counting, and status updates is gated by the HPAExternalMetricFallback feature gate.

Version Skew Strategy

kube-apiserver: More recent instances will accept and validate the new fallback field in ExternalMetricSource, While older instances will ignore it during validation and persist it as part of the HPA object.
kube-controller-manager: An older version could receive an HPA containing the new fallback field from a more recent API server, in which case it would ignore the field (i.e., continue with current behavior where external metrics that fail to retrieve prevent scaling)

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: HPAExternalMetricFallback
- Components depending on the feature gate: kube-controller-manager and kube-apiserver

Does enabling the feature change any default behavior?

No. By default, HPAs will continue to behave as they do today. The feature only activates when users explicitly configure the fallback field on external metrics in their HPA specifications. External metrics without fallback configuration will continue to prevent scaling when unavailable, which is the current behavior. When fallback is configured and activated, the failing metric contributes its configured replica count to the HPA’s decision, which is then combined with other metrics using the standard max() approach.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. If the feature gate is disabled:

All fallback configurations in HPA specs are ignored by the controller
External metrics revert to current behavior: HPA cannot scale based on them when they’re unavailable
The fallbackStatus and firstFailureTime status fields stop being updated
- These fields remain in the HPA status at their last values but are not evaluated or modified
HPAs maintain their current replica count at the time of rollback
No pods are restarted or disrupted

To disable, restart kube-controller-manager and kube-apiserver with the feature gate set to false.

What happens if we reenable the feature if it was previously rolled back?

When the feature is re-enabled:

Any HPAs with fallback configured on external metrics will resume fallback behavior
The controller clears any stale firstFailureTime timestamps and starts fresh
If external metrics are failing at re-enablement:
- On the first failure, firstFailureTime is set to the current timestamp
- The failure duration is calculated as (current time - firstFailureTime)
- Once the configured failureDurationSeconds has elapsed, fallback values are used
- The fallbackStatus field is set to “Fallback” for affected metrics
HPAs resume using the static replicas stanza for scaling decisions when external metrics are unavailable and thresholds are exceeded

Existing HPAs without fallback configuration are not affected by re-enabling the feature and continue with default behavior.

Are there any tests for feature enablement/disablement?

Yes. Unit tests will verify that HPAs with and without the fallback field are properly validated both when the feature gate is enabled or disabled, and that the HPA controller correctly applies fallback behavior based on the feature gate status.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

Rollout failures are unlikely to impact running workloads. If enabled during external metrics failures, HPAs with fallback configured might change scaling decisions after failureDurationSeconds (default: 180s / 3m) has elapsed. This is mitigated by:

The HPA’s min/max replica constraints
The failureDurationSeconds buffer before activation
Gradual HPA scaling behavior
Scale-up/scale-down stabilization windows

On rollback, HPAs maintain their current replica count and stop using fallback values. No pods are restarted.

What specific metrics should inform a rollback?

Unexpected scaling events after enabling the feature
Increased error rate in horizontal_pod_autoscaler_controller_metric_computation_total
High percentage of HPAs showing fallbackStatus: “Fallback” unexpectedly
Increased latency in horizontal_pod_autoscaler_controller_reconciliation_duration_seconds

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No. This feature only adds a new optional field to the HPA API and doesn’t deprecate or remove any existing functionality. All current HPA behaviors remain unchanged unless users explicitly opt into the fallback mode.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

The presence of the fallback field in ExternalMetricSource specifications indicates that the feature is in use.

How can someone using this feature know that it is working for their instance?

Users can confirm that the feature is active and functioning by inspecting the status fields exposed by the controller. Specifically:

Check the HPA condition to verify if ExternalMetricFallbackActive is currently active
Check .status.currentMetrics[].external.fallbackStatus to verify if fallback is currently active (value will be “Fallback”)
Check .status.currentMetrics[].external.firstFailureTime to see when failures started

Moreover, users can verify the feature is working properly through events on the HPA object:

When fallback activates: Normal ExternalMetricFallbackActivated “Fallback activated for external metric ‘queue_depth’ after 3m0s of consecutive failures, using fallback replica count: 10”

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

This feature utilizes the existing HPA controller metrics:

horizontal_pod_autoscaler_controller_reconciliation_duration_seconds
horizontal_pod_autoscaler_controller_metric_computation_duration_seconds
horizontal_pod_autoscaler_controller_metric_computation_total

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

This feature doesn’t fundamentally change how the HPA controller operates; it adds fallback handling when external metrics fail to be retrieved. Therefore, existing metrics for monitoring HPA controller health remain applicable:

horizontal_pod_autoscaler_controller_reconciliation_duration_seconds - monitors overall HPA reconciliation performance
horizontal_pod_autoscaler_controller_metric_computation_duration_seconds - tracks metric computation time including fallback evaluation
horizontal_pod_autoscaler_controller_metric_computation_total - counts metric computations with error status

Are there any missing metrics that would be useful to have to improve observability of this feature?

No.

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

No. The feature only adds logic to the existing HPA reconciliation loop. It doesn’t introduce new API calls. The feature tracks failure counts and applies fallback logic in-memory during existing reconciliation cycles.

Will enabling / using this feature result in introducing new API types?

No. The feature only adds new fields to existing API types:

New ExternalMetricFallback struct within ExternalMetricSource
New status fields in ExternalMetricStatus

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

Yes, HorizontalPodAutoscaler objects will increase in size when fallback is configured:

Spec increase: ~50 bytes per external metric with fallback configured:
- failureDurationSeconds: ~30 bytes (field name + int64 value)
- replicas: ~20 bytes (int32)
Status increase: ~110 bytes per external metric:
- fallbackStatus: ~40 bytes (string field with “Normal” or “Fallback” value)
- firstFailureTime: ~70 bytes (timestamp field + RFC3339 string like “2024-01-15T10:23:45Z”)

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No. The feature adds minimal computational overhead to the existing HPA reconciliation loop. The fallback logic is integrated into the existing metric retrieval and evaluation process:

Attempt to retrieve external metric (already happens)
On failure: check/update firstFailureTime (new, minimal overhead)
Evaluate if fallback should activate (new, simple comparison)
Return either real metric or fallback replica count (already happens for other metric types)

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No. Memory increase in kube-controller-manager is ~100 bytes per HPA for failure count tracking. For 1000 HPAs with 2 external metrics each: ~200 KB total, which is negligible.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

If the API server and/or etcd becomes unavailable, the entire HPA controller functionality will be impacted, not just this feature. The HPA controller will not be able to:

Retrieve HPA objects
Get external metrics (or any metrics)
Update HPA status (including fallbackStatus and firstFailureTime fields)
Apply scaling decisions

Therefore, no autoscaling decisions can be made during this period, regardless of whether fallback is configured. The feature itself doesn’t introduce any new failure modes with respect to API server or etcd availability - it’s dependent on these components being available just like the rest of the HPA controller’s functionality.

Once API server and etcd access is restored, the HPA controller will resume normal operation. The in-memory failure counts will reset, if external metrics are still failing and firstFailureTime is preserved the controller will use that timestamp to calculate whether the fallback should remain active.

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Check horizontal_pod_autoscaler_controller_reconciliation_duration_seconds to identify if issues correlate with HPAs using fallback. If problems are observed:

Check if the issue only affects HPAs with fallback configured
Review HPA events: kubectl describe hpa to see fallback activation events
Check external metrics provider health and connectivity

For problematic HPAs, you can:

Temporarily remove the fallback field to revert to default behavior (HPA holds current scale on metric failure)
Adjust failureDurationSeconds to prevent premature fallback activation
Review and adjust fallback values if scaling behavior is inappropriate