KEP-5679: Fallback for HPA on failure to retrieve metrics
KEP-5053: Fallback for HPA External Metrics on Retrieval Failure
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
The Horizontal Pod Autoscaler’s reliance on external metrics creates a dependency on systems outside the Kubernetes cluster’s control. These external systems (cloud provider APIs, third-party monitoring systems, message brokers, etc.) may experience:
- Network connectivity issues
- Rate limiting
- Service outages
- Authentication/authorization failures
- Degraded performance
When external metrics become unavailable, the HPA cannot make informed scaling decisions, which can lead to:
- Workloads stuck at insufficient scale during traffic spikes
- Inability to respond to critical business metrics (e.g., queue depth, error rates)
- Over-dependence on external system reliability
Unlike in-cluster resource metrics (CPU, memory) served by metrics-server, which are part of the cluster’s core infrastructure, external metrics are inherently less reliable and outside the cluster operator’s direct control.
Motivation
The Horizontal Pod Autoscaler (HPA) supports scaling workloads based on external metrics—metrics that originate from systems outside the Kubernetes cluster’s control. These external systems include:
- Cloud provider APIs (e.g., AWS CloudWatch, Azure Monitor, GCP Monitoring)
- Third-party monitoring systems (e.g., Datadog, New Relic, Prometheus running externally)
- Message brokers and queues (e.g., AWS SQS, RabbitMQ, Kafka)
- Application-specific metrics services
Unlike in-cluster resource metrics (CPU, memory served by metrics-server) or custom/object metrics (served by in-cluster custom metrics APIs), external metrics are inherently less reliable because they depend on systems outside the cluster operator’s direct control. These external systems may experience:
- Network connectivity issues between the cluster and the external service
- Rate limiting or throttling
- Service outages or degraded performance
- Authentication/authorization failures
- Regional or availability zone failures
When external metrics become unavailable, the HPA cannot make informed scaling decisions. Currently, the HPA simply maintains the current replica count and waits for metrics to become available again. This behavior can lead to:
Workloads stuck at insufficient scale during traffic spikes when metrics are unavailable
Inability to respond to critical business events (e.g., growing queue depth, increasing error rates)
Production incidents caused by external metrics provider outages
Over-dependence on the reliability of external systems for critical autoscaling functionality Other autoscalers in the ecosystem, such as KEDA , already provide fallback mechanisms for external metrics to mitigate these availability issues. By allowing users to configure fallback behavior for external metrics in HPA, this proposal aims to:
Reduce the criticality of external metrics providers on cluster workload scaling
Improve the overall robustness of autoscaling for workloads that depend on external signals
Enable users to define safe, conservative scaling actions when external metrics are temporarily unavailable
Maintain workload availability and performance during external metrics provider disruptions
Why Duration-Based Instead of Count-Based:
Different Kubernetes providers and configurations may poll external metrics at different frequencies. The HPA reconciliation loop typically runs every 15 seconds by default (configurable via --horizontal-pod-autoscaler-sync-period), but this can vary between clusters. A count-based threshold (e.g., “3 failures”) would result in inconsistent behavior:
- In a cluster polling every 15s: 3 failures = 45 seconds
- In a cluster polling every 30s: 3 failures = 90 seconds
- If polling frequency changes, behavior changes unexpectedly
A duration-based threshold provides consistent, predictable behavior regardless of:
- HPA controller reconciliation frequency
- Kubernetes provider configurations
- Cluster-specific settings
The duration is measured from the first consecutive failure, ensuring consistent and understandable semantics: “activate fallback if the metric has been failing for at least X minutes.”
This enhancement allows users to specify a desired replica count that the HPA should use after a configurable number of consecutive failures to retrieve an external metric. The fallback replica count is treated as the desired replica count from that metric and combined with other metrics using the HPA’s standard multi-metric algorithm (taking the maximum), respecting all configured constraints (min/max replicas, behavior policies, etc.), ensuring predictable and safe scaling decisions even when external metrics are unavailable.
The community has previously expressed interest in addressing this limitation #109214 .
Goals
- Allow users to optionally define a fallback, static pod replica count value when retrieval of external metrics fails
- Provide per-metric failure tracking and fallback behavior
- Maintain the HPA’s scaling algorithm and respect min/max replica constraints
- Ensure users can determine which specific metrics are using fallback values
Non-Goals
- Fallback for resource metrics (CPU, memory from metrics-server) - these are in-cluster and should be addressed at the infrastructure level if unavailable
- Fallback for pods/object metrics - these use in-cluster APIs
- Fallback for custom metrics - may be considered in future based on alpha feedback
- Last-known-good metric value caching
- Automatic fallback value calculation
- Changing the HPA scaling algorithm
Proposal
Add optional fallback configuration to the existing ExternalMetricSource
type by introducing a new fallback field, allowing users to specify:
- A failure duration (how long the metric must be continuously failing before activating fallback)
- A desired replica count to use when the failure duration threshold is exceeded
This approach:
- Works with the HPA algorithm: Fallback provides a desired replica count for that metric, which is combined with other metrics using the standard HPA multi-metric approach (taking the maximum)
- Is per-metric: Each external metric can have its own fallback configuration
- Provides visibility: Status shows which metrics are in fallback state
- Is conservative: Only applies to external metrics, which are inherently out-of-cluster
- Is consistent: Duration-based thresholds behave the same across different Kubernetes configurations and reconciliation frequencies
User Stories
Story 1: SaaS Application Scaling on Queue Depth
As an operator, I run a SaaS application that scales based on a cloud provider’s message queue depth (external metric). Occasionally, the cloud provider’s metrics API experiences brief outages (5-10 minutes). During these outages, I would like my HPA fallback to a manual configuration, ensuring sufficient capacity to handle the presumed backlog safely.
When the external API fails, the HPA treats this metric as requesting 10 replicas, ensuring sufficient capacity to handle the presumed backlog safely.
Story 2: E-commerce Site with Multiple External Metrics
As an operator, I want to configure different fallback replica counts for each external metric so my e-commerce site can continue autoscaling when one monitoring provider fails.
Risks and Mitigations
Risk: Users configure inappropriate fallback replica counts
- Mitigation: Documentation with best practices; validation ensures replicas > 0; HPA min/max constraints still apply; users should consider peak load scenarios when setting fallback values
Risk: Users configure failureDurationSeconds too short, causing premature fallback activation
- Mitigation: Minimum value of 180 seconds (3 minutes) provides reasonable buffer; validation enforces minimum values; documentation recommends considering normal metric provider latency and transient failures
Risk: Users configure failureDurationSeconds too long, delaying necessary scaling during outages
- Mitigation: Documentation provides guidance on balancing between avoiding false positives and responding quickly to genuine outages; recommend 180-300 seconds (3-5 minutes) for most use cases
Risk: Complexity in understanding which metric is in fallback and why
- Mitigation: Per-metric status clearly shows fallback state,
firstFailureTimetimestamp, and currentfallbackStatusvalue; events are generated when fallback activates with clear messaging including duration and timestamp
- Mitigation: Per-metric status clearly shows fallback state,
Design Details
Introduce a new ExternalMetricFallback type and add a new fallback field to the existing ExternalMetricSource struct. Additionally, add new fallbackStatus and firstFailureTime fields to the existing ExternalMetricStatus struct.
// ExternalMetricFallback defines fallback behavior when an external metric cannot be retrieved
type ExternalMetricFallback struct {
// failureDurationSeconds is the duration in seconds for which the external metric must be
// continuously failing before the fallback value is used. The duration is measured from the
// first consecutive failure. Must be greater than 0.
// +optional
// default=180
// min=180
FailureDurationSeconds *int64 `json:"failureDurationSeconds,omitempty"`
// replicas is the desired replica count to use when the external metric cannot be retrieved.
// This value is treated as the desired replica count from this metric.
// When multiple metrics are configured, the HPA controller uses the maximum of all
// desired replica counts (standard HPA multi-metric behavior).
// Must be greater than 0.
// +required
Replicas int32 `json:"replicas"`
}
// ExternalMetricSource indicates how to scale on a metric not associated with
// any Kubernetes object (for example length of queue in cloud
// messaging service, or QPS from loadbalancer running outside of cluster).
type ExternalMetricSource struct {
// metric identifies the target metric by name and selector
Metric MetricIdentifier `json:"metric" protobuf:"bytes,1,name=metric"`
// target specifies the target value for the given metric
Target MetricTarget `json:"target" protobuf:"bytes,2,name=target"`
// fallback defines the behavior when this external metric cannot be retrieved.
// If not set, the HPA will not scale based on this metric when it's unavailable.
// +optional
Fallback *ExternalMetricFallback `json:"fallback,omitempty"`
}
Update MetricStatus to include per-metric fallback information:
// ExternalMetricStatus indicates the current value of a global metric not associated
// with any Kubernetes object.
type ExternalMetricStatus struct {
// metric identifies the target metric by name and selector
Metric MetricIdentifier `json:"metric" protobuf:"bytes,1,name=metric"`
// current contains the current value for the given metric
Current MetricValueStatus `json:"current" protobuf:"bytes,2,name=current"`
// fallbackStatus indicates whether this metric is operating normally or in fallback mode.
// Possible enum values:
// - "Normal" indicates the metric is being retrieved successfully
// - "Fallback" indicates the metric is using a fallback value due to retrieval failures
// +optional
FallbackStatus string `json:"fallbackStatus,omitempty"`
// firstFailureTime is the timestamp of the first consecutive failure retrieving this metric.
// Reset to nil on successful retrieval. Used to calculate if failureDurationSeconds has been exceeded.
// +optional
FirstFailureTime *metav1.Time `json:"firstFailureTime,omitempty"`
}
Add a new HorizontalPodAutoscalerConditionType:
const (
// ExternalMetricFallbackActive indicates that one or more external metrics
// are currently using fallback values due to retrieval failures.
// Status will be:
// - "True" if any external metric is in fallback state
// - "False" if no external metrics are in fallback state
// - "Unknown" if the controller cannot determine the state
ExternalMetricFallbackActive ConditionType = "ExternalMetricFallbackActive"
)
Test Plan
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
None required.
Unit tests
Tests for Fallback Configuration:
- Verify failureDurationSeconds validation (must be > 180)
- Verify replicas validation (must be > 0)
Tests for Failure Tracking and Activation:
- Verify
firstFailureTimeis set on first failure and persists through consecutive failures - Verify
firstFailureTimeis cleared on successful metric retrieval - Verify fallback activates when current time exceeds
firstFailureTime+failureDurationSeconds - Verify fallbackStatus field updates correctly
- Verify
Tests for Replica Calculation:
- Verify fallback returns the configured replica count when threshold is exceeded
- Verify fallback replica count is combined with other metrics using max() (standard multi-metric behavior)
- Verify replica calculations respect min/max constraints with fallback replica counts
- Verify correct behavior with multiple external metrics (independent failure tracking and max selection)
/pkg/controller/podautoscaler: 05 Nov 2025 - 89.1%/pkg/controller/podautoscaler/metrics: 05 Nov 2025 - 89.9%
Integration tests
N/A, the feature is tested using unit tests and e2e tests.
e2e tests
We will add the following e2e autoscaling tests:
- External metric failure triggers fallback after threshold is reached, using configured replica count
- HPA status condition
ExternalMetricFallbackActiveis set to True when fallback activates - Success in retrieving external metric resets the failure count and resumes normal scaling
- HPA uses max() of healthy metric calculations and fallback replica counts
- Fallback respects HPA min/max replica constraints
- Status correctly reflects which metrics are in fallback state and shows
firstFailureTime - With multiple external metrics in fallback, HPA uses the maximum fallback replica count
Graduation Criteria
Alpha
- Feature implemented behind
HPAExternalMetricFallbackfeature gate - Unit and e2e tests passed as designed in TestPlan .
Beta
- Unit and e2e tests passed as designed in TestPlan .
- Gather feedback from developers and surveys
- All functionality completed
- All security enforcement completed
- All monitoring requirements completed
- All testing requirements completed
- All known pre-release issues and gaps resolved
GA
- No negative feedback.
- All issues and gaps identified as feedback during beta are resolved
Upgrade / Downgrade Strategy
Upgrade
When the feature gate is enabled:
- Existing HPAs continue to work unchanged
- External metrics without
fallbackconfiguration behave as they do today (no scaling when unavailable) - Users can add
fallbackconfiguration to external metrics in their HPAs - The controller begins tracking per-metric
firstFailureTimefor external metrics with fallback configured- On the first failure,
firstFailureTimeis set to the current timestamp - On subsequent failures, the timestamp is preserved to track failure duration
- On success,
firstFailureTimeis cleared (set to nil)
- On the first failure,
- The
fallbackStatusandfirstFailureTimestatus fields are populated for external metrics with fallback configured - Fallback activates when
(current time - firstFailureTime) >= failureDurationSeconds
Downgrade
When the feature gate is disabled:
- The
fallbackfield inExternalMetricSourceis ignored by the controller - The
fallbackStatusandfirstFailureTimestatus fields are not updated (remain at last values but are not used) - All external metrics revert to current behavior: HPA cannot scale based on them when they’re unavailable
- Any HPAs currently using fallback values will:
- Maintain their current replica count
- Stop using fallback values
- Resume normal metric-based scaling when external metrics become available again
- No disruption to running workloads (pods are not restarted)
- The
firstFailureTimetimestamp remains in the status but is not evaluated or updated
All logic related to fallback evaluation, failure counting, and status updates is gated by the HPAExternalMetricFallback feature gate.
Version Skew Strategy
kube-apiserver: More recent instances will accept and validate the newfallbackfield inExternalMetricSource, While older instances will ignore it during validation and persist it as part of the HPA object.kube-controller-manager: An older version could receive an HPA containing the newfallbackfield from a more recent API server, in which case it would ignore the field (i.e., continue with current behavior where external metrics that fail to retrieve prevent scaling)
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name: HPAExternalMetricFallback
- Components depending on the feature gate:
kube-controller-managerandkube-apiserver
Does enabling the feature change any default behavior?
No. By default, HPAs will continue to behave as they do today. The feature only activates when users explicitly configure the fallback field on external metrics in their HPA specifications.
External metrics without fallback configuration will continue to prevent scaling when unavailable, which is the current behavior.
When fallback is configured and activated, the failing metric contributes its configured replica count to the HPA’s decision, which is then combined with other metrics using the standard max() approach.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
Yes. If the feature gate is disabled:
- All
fallbackconfigurations in HPA specs are ignored by the controller - External metrics revert to current behavior: HPA cannot scale based on them when they’re unavailable
- The
fallbackStatusandfirstFailureTimestatus fields stop being updated- These fields remain in the HPA status at their last values but are not evaluated or modified
- HPAs maintain their current replica count at the time of rollback
- No pods are restarted or disrupted
To disable, restart kube-controller-manager and kube-apiserver with the feature gate set to false.
What happens if we reenable the feature if it was previously rolled back?
When the feature is re-enabled:
- Any HPAs with
fallbackconfigured on external metrics will resume fallback behavior - The controller clears any stale
firstFailureTimetimestamps and starts fresh - If external metrics are failing at re-enablement:
- On the first failure,
firstFailureTimeis set to the current timestamp - The failure duration is calculated as
(current time - firstFailureTime) - Once the configured
failureDurationSecondshas elapsed, fallback values are used - The
fallbackStatusfield is set to “Fallback” for affected metrics
- On the first failure,
- HPAs resume using the static replicas stanza for scaling decisions when external metrics are unavailable and thresholds are exceeded
Existing HPAs without fallback configuration are not affected by re-enabling the feature and continue with default behavior.
Are there any tests for feature enablement/disablement?
Yes. Unit tests will verify that HPAs with and without the fallback field are properly validated both when the feature gate is enabled or disabled, and that the HPA controller correctly applies fallback behavior based on the feature gate status.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
Rollout failures are unlikely to impact running workloads. If enabled during external metrics failures, HPAs with fallback configured might change scaling decisions after failureDurationSeconds (default: 180s / 3m) has elapsed. This is mitigated by:
- The HPA’s min/max replica constraints
- The
failureDurationSecondsbuffer before activation - Gradual HPA scaling behavior
- Scale-up/scale-down stabilization windows
On rollback, HPAs maintain their current replica count and stop using fallback values. No pods are restarted.
What specific metrics should inform a rollback?
- Unexpected scaling events after enabling the feature
- Increased error rate in horizontal_pod_autoscaler_controller_metric_computation_total
- High percentage of HPAs showing fallbackStatus: “Fallback” unexpectedly
- Increased latency in horizontal_pod_autoscaler_controller_reconciliation_duration_seconds
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No. This feature only adds a new optional field to the HPA API and doesn’t deprecate or remove any existing functionality. All current HPA behaviors remain unchanged unless users explicitly opt into the fallback mode.
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
The presence of the fallback field in ExternalMetricSource specifications indicates that the feature is in use.
How can someone using this feature know that it is working for their instance?
Users can confirm that the feature is active and functioning by inspecting the status fields exposed by the controller. Specifically:
- Check the HPA condition to verify if
ExternalMetricFallbackActiveis currently active - Check
.status.currentMetrics[].external.fallbackStatusto verify if fallback is currently active (value will be “Fallback”) - Check
.status.currentMetrics[].external.firstFailureTimeto see when failures started
Moreover, users can verify the feature is working properly through events on the HPA object:
- When fallback activates: Normal
ExternalMetricFallbackActivated“Fallback activated for external metric ‘queue_depth’ after 3m0s of consecutive failures, using fallback replica count: 10”
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
This feature utilizes the existing HPA controller metrics:
horizontal_pod_autoscaler_controller_reconciliation_duration_secondshorizontal_pod_autoscaler_controller_metric_computation_duration_secondshorizontal_pod_autoscaler_controller_metric_computation_total
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
This feature doesn’t fundamentally change how the HPA controller operates; it adds fallback handling when external metrics fail to be retrieved. Therefore, existing metrics for monitoring HPA controller health remain applicable:
horizontal_pod_autoscaler_controller_reconciliation_duration_seconds- monitors overall HPA reconciliation performancehorizontal_pod_autoscaler_controller_metric_computation_duration_seconds- tracks metric computation time including fallback evaluationhorizontal_pod_autoscaler_controller_metric_computation_total- counts metric computations with error status
Are there any missing metrics that would be useful to have to improve observability of this feature?
No.
Dependencies
Does this feature depend on any specific services running in the cluster?
Scalability
Will enabling / using this feature result in any new API calls?
No. The feature only adds logic to the existing HPA reconciliation loop. It doesn’t introduce new API calls. The feature tracks failure counts and applies fallback logic in-memory during existing reconciliation cycles.
Will enabling / using this feature result in introducing new API types?
No. The feature only adds new fields to existing API types:
- New
ExternalMetricFallbackstruct withinExternalMetricSource - New status fields in
ExternalMetricStatus
Will enabling / using this feature result in any new calls to the cloud provider?
No.
Will enabling / using this feature result in increasing size or count of the existing API objects?
Yes, HorizontalPodAutoscaler objects will increase in size when fallback is configured:
- Spec increase: ~50 bytes per external metric with fallback configured:
failureDurationSeconds: ~30 bytes (field name + int64 value)replicas: ~20 bytes (int32)
- Status increase: ~110 bytes per external metric:
fallbackStatus: ~40 bytes (string field with “Normal” or “Fallback” value)firstFailureTime: ~70 bytes (timestamp field + RFC3339 string like “2024-01-15T10:23:45Z”)
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No. The feature adds minimal computational overhead to the existing HPA reconciliation loop. The fallback logic is integrated into the existing metric retrieval and evaluation process:
- Attempt to retrieve external metric (already happens)
- On failure: check/update
firstFailureTime(new, minimal overhead) - Evaluate if fallback should activate (new, simple comparison)
- Return either real metric or fallback replica count (already happens for other metric types)
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
No. Memory increase in kube-controller-manager is ~100 bytes per HPA for failure count tracking. For 1000 HPAs with 2 external metrics each: ~200 KB total, which is negligible.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No.
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
If the API server and/or etcd becomes unavailable, the entire HPA controller functionality will be impacted, not just this feature. The HPA controller will not be able to:
- Retrieve HPA objects
- Get external metrics (or any metrics)
- Update HPA status (including
fallbackStatusandfirstFailureTimefields) - Apply scaling decisions
Therefore, no autoscaling decisions can be made during this period, regardless of whether fallback is configured. The feature itself doesn’t introduce any new failure modes with respect to API server or etcd availability - it’s dependent on these components being available just like the rest of the HPA controller’s functionality.
Once API server and etcd access is restored, the HPA controller will resume normal operation. The in-memory failure counts will reset, if external metrics are still failing and firstFailureTime is preserved the controller will use that timestamp to calculate whether the fallback should remain active.
What are other known failure modes?
What steps should be taken if SLOs are not being met to determine the problem?
Check horizontal_pod_autoscaler_controller_reconciliation_duration_seconds to identify if issues correlate with HPAs using fallback. If problems are observed:
- Check if the issue only affects HPAs with fallback configured
- Review HPA events: kubectl describe hpa
to see fallback activation events - Check external metrics provider health and connectivity
For problematic HPAs, you can:
- Temporarily remove the fallback field to revert to default behavior (HPA holds current scale on metric failure)
- Adjust
failureDurationSecondsto prevent premature fallback activation - Review and adjust fallback values if scaling behavior is inappropriate