KEP-4951: Configurable tolerance for HPA
KEP-4951: Configurable tolerance for HPA
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
Horizontal Pod Autoscaler
(HPA) regularly estimates how many replicas a given Deployment (or other resource with a /scale subresource) should instantiate.
HPAs define one (or more) metrics (e.g. CPU utilization) on which autoscaling is based. The number of replicas is derived from the ratio between the expected and current value of this metric (Algorithm details
).
For example, for a workload with 100 currentReplicas and a usage ratio
(currentMetricValue/desiredMetricValue) of 1.07, the calculated desiredReplicas
would be 107 (100 * 1.07).
However, to avoid flapping, scaling actions are skipped if the usage ratio is approximately 1, within a globally-configurable tolerance, set to 10% by default. In the example above, no scaling action would take place, since the ratio is within this tolerance.
This proposal adds a parameter to HPAs allowing users to configure this tolerance per HPA resource. For the example above, we could configure the tolerance in the workload’s HPA to 5%, which would allow the scale-up to 107 replicas to proceed.
Motivation
Today the horizontal autoscaling tolerance is a cluster-wide parameter set using the Kube Control Manager
--horizontal-pod-autoscaler-tolerance parameter. It is by default set to 10%. While this value is often appropriate, it is considered too coarse grained in a number of scenario.
This issue has been raised multiple times (#116984 , #125987 , #62013 , #aks-3068 , #keda-1100 ), with users commenting that:
- For large deployments, a 10% tolerance translates into very significant resources (i.e. hundreds of pods).
- This tolerance can slow down scaling operations, hindering responsiveness in case of surges.
- Scale-ups are more a problem than scale-downs since typically pods are slower to initialize than to shut down, and since responding to load increase is typically more critical than freeing resources.
Since appropriate tolerance values are workload-dependent, this KEP proposes to let users add custom tolerance values to HorizontalPodAutoscaler resources, overriding the existing default value when present.
This solution integrates seamlessly with the existing HPA API since it already allows users to fine-tune the autoscaler behavior . The exact API recommended here has been previously proposed in kep-853 (see here ), but it was then decided to implement it separately.
Goals
- Allow users to optionally override the default workload autoscaling tolerance on a per-HPA bases.
Non-Goals
- Allow to customize the cluster-wise tolerance given by Kube Control Manager
--horizontal-pod-autoscaler-toleranceparameter.
Proposal
We propose to add a new field to the existing [HPAScalingRules][] object:
tolerance: (float) the minimum change (from 1.0) in the desired-to-actual metrics ratio for the horizontal pod autoscaler to consider scaling. Must be greater than or equal to 0.
The tolerance field is optional, and when not specified the HPA will continue to use the
value of the global --horizontal-pod-autoscaler-tolerance as the tolerance for scaling
calculations.
Since there are separate HPAScalingRules objects defined for an HPA’s
spec.behavior.scaleUp and spec.behavior.scaleDown, it is possible to specify different
tolerance values for scaling up vs. scaling down.
Risks and Mitigations
There should be minimal risk introduced by the proposed changes:
- The new field is optional, and its absence results in no changes to the current autoscaling behavior
- When specified, the new value doesn’t change the autoscaling algorithm used, but just overrides a single value used during the calculation. This value can already be changed via the
--horizontal-pod-autoscaler-toleranceoption of thekube-controller-manager. - If a change to the new field results in undesirable behavior, the change can be reverted by deploying the previous version of the HPA resource, or removing the
tolerancefield entirely.
Design Details
The HorizontalPodAutoscaler API is updated to add a new tolerance field to the HPAScalingRules object:
type HPAScalingRules struct {
// tolerance is the tolerance on the ratio between the current and desired
// metric value under which no updates are made to the desired number of
// replicas.
// +optional
Tolerance *resource.Quantity
// Existing fields.
StabilizationWindowSeconds *int32
SelectPolicy *ScalingPolicySelect
Policies []HPAScalingPolicy
}
This new tolerance will be used in the autoscaling controller replica_calculator.go . The current logic is:
if math.Abs(1.0-usageRatio) <= c.tolerance { /* ... */ }
It will be replaced by:
- if math.Abs(1.0-usageRatio) <= c.tolerance { /* ... */ }
+ // Down and Up scaling tolerances default to c.tolerance if unset.
+ downTolerance, upTolerance := c.tolerance, c.tolerance
+ if scaleDown.tolerance != nil {
+ downTolerance = scaleDown.tolerance.AsApproximateFloat64()
+ }
+ if scaleUp.tolerance != nil {
+ upTolerance = scaleUp.tolerance.AsApproximateFloat64()
+ }
+
+ if (1.0-downTolerance) <= usageRatio && usageRatio <= (1.0+upTolerance) { /* ... */ }
Since the added field is optional and its omission does not change the existing
autoscaling behavior, this feature will only be added to the latest stable API
version pkg/apis/autoscaling/v2. Older versions (i.e. v1, v2beta1,
v2beta2) will not include the new field, but converters will be updated where
needed to comply with round-trip requirements
.
The feature presented in this KEP only allows users to tune an existing parameter, and
as such doesn’t require any new HPA Events or modify any Status. The validation logic
will be updated to ensure that the tolerance field cannot be set to a negative value.
Test Plan
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
Unit tests
/apis/autoscaling/validation:2024-11-13-95.6/pkg/controller/podautoscaler:2024-11-13-96.4
Integration tests
N/A, the feature is tested using unit tests and e2e tests.
e2e tests
Existing e2e tests ensure the autoscaling behavior uses the default tolerance when no configurable tolerance is specified.
The new e2e autoscaling tests covering this feature are:
Before the graduation to beta, we will add an integration test verifying the autoscaling behavior when smaller and larger than default tolerances are set on an HPA.
Graduation Criteria
Alpha
- Feature implemented behind a
HPAConfigurableTolerancefeature flag - Initial e2e tests completed and enabled
Beta
- All tests described in the
e2e testssection are implemented and linked in this KEP. - We have monitored for negative user feedback and addressed relevant concerns.
Upgrade / Downgrade Strategy
Upgrade
Existing HPAs will continue to work as they do today, using the global horizontal-pod-autoscaler-tolerance
value from the kube-controller-manager. Users can use the new feature by enabling the Feature
Gate (alpha only) and setting the new tolerance field on an HPA.
Downgrade
On downgrade, all HPAs will revert to using the global horizontal-pod-autoscaler-tolerance
value from the kube-controller-manager, regardless of any configured tolerance value on the HPA
itself.
Version Skew Strategy
kube-apiserver: More recent instances will accept the new ’tolerance' field, while older will ignore it.kube-controller-manager: An older version could receive an HPA containing the newtolerancefield from a more recent API server, in which case it would ignore it (i.e. scale as if it was not present).
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name: HPAConfigurableTolerance
- Components depending on the feature gate:
kube-controller-managerandkube-apiserver.
Does enabling the feature change any default behavior?
No.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
The feature can be disabled by restarting the kube-controller-manager with the feature gate set to false.
Any tolerance values set on existing HPAs will be ignored by the
kube-controller-manager and kube-apiserver when the feature gate is off.
What happens if we reenable the feature if it was previously rolled back?
When the feature is re-enabled, any HPAs with configured tolerance values will use those when calculating replica counts, rather than the global tolerance from the kube-controller-manager.
Are there any tests for feature enablement/disablement?
Unit tests have been added to verify that HPAs with and without the new fields are properly validated, both when the feature gate is enabled or not.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
This feature does not introduce new failure modes: during rollout/rollback, some API servers will allow or disallow setting the new ’tolerance’ field. The new field is possibly ignored until the controller manager is fully updated.
What specific metrics should inform a rollback?
A high horizontal_pod_autoscaler_controller_metric_computation_duration_seconds
metric can indicate a problem related to this feature.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
The upgrade→downgrade→upgrade testing was done manually using a 1.33 cluster with the following steps:
Start the cluster with the HPA enabled:
kind create cluster --name configurable-tolerance --image kindest/node:v1.33.0 --config config.yamlwith the following
config.yamlfile content:kind: Cluster apiVersion: kind.x-k8s.io/v1alpha4 featureGates: "HPAConfigurableTolerance": true nodes: - role: control-plane - role: workerInstall metrics-server:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/download/v0.7.2/components.yaml kubectl patch -n kube-system deployment metrics-server --type=json -p '[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]'Create a deployment starting Pods that consume a 50% CPU utilization, and an associated HPA with a very large tolerance:
kubectl apply -f configurable-tolerance-test.yamlwith the following
configurable-tolerance-test.yamlfile content:apiVersion: apps/v1 kind: Deployment metadata: name: cpu-stress-deployment labels: app: cpu-stressor spec: replicas: 1 selector: matchLabels: app: cpu-stressor template: metadata: labels: app: cpu-stressor spec: containers: - name: cpu-stressor image: alpine:latest command: ["/bin/sh"] args: # Load: 1% (10 milliCPU) - "-c" - "apk add --no-cache stress-ng && stress-ng --cpu 1 --cpu-load 1 --cpu-method=crc16 --timeout 3600s" resources: requests: cpu: "20m" --- apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: cpu-stress-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: cpu-stress-deployment minReplicas: 1 maxReplicas: 5 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 10 behavior: scaleUp: tolerance: 20. # 2000%Check that, after a 5 minutes,
kubectl describe hpa cpu-stress-hpadisplaysScalingLimited: False(i.e. the HPA doesn’t recommend to scale up because of the large tolerance).Simulate downgrade by disabling the feature for api server and control-plane (update the
config.yamlfile to set it to false). Follow the procedure described in step 1, and observe that this timekubectl describe hpa cpu-stress-hpadisplaysScalingLimited: True.Simulate downgrade by re-enabling the feature for api server and control-plane. Follow the procedure described in step 1, and observe that the HPA description mentions
ScalingLimited: False, demonstrates that the feature is working again.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
The presence of the new tolerance HPA field indicates that the feature is
used.
How can someone using this feature know that it is working for their instance?
- Events
- Event Reason:
SuccessfulRescale
- Event Reason:
The tolerance is applied on the ratio between the current and desired metric
values. Users can get both values using
kubectl describe
and use them to verify that scaling events are triggered when their ratio is out
of tolerance.
The controller-manager logs have been updated to help users understand the behavior of the autoscaler. The data added to the logs includes the tolerance used for each scaling decision.
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
Although the absolute value of the horizontal_pod_autoscaler_controller_metric_computation_duration_seconds
metric depends on HPAs configuration, it should be unimpacted by this feature. This metric should not vary
by more than 5%.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
This KEP is not expected to have any impact on SLIs/SLOs as it doesn’t introduce a new HPA behavior, but merely allows users to easily change the value of a parameter that’s otherwise difficult to update.
The standard HPA metric horizontal_pod_autoscaler_controller_metric_computation_duration_seconds can
be used to verify the HPA controller health.
Are there any missing metrics that would be useful to have to improve observability of this feature?
Users may want to see a signal that autoscaling isn’t happening because of the tolerance, but this is not directly related to this KEP (this problem already exists today with the hard-coded 10% tolerance), and taking this KEP as an opportunity to improve the situation is difficult (see this thread ).
Dependencies
Does this feature depend on any specific services running in the cluster?
No, this feature does not depend on any specific service.
Scalability
Will enabling / using this feature result in any new API calls?
No.
Will enabling / using this feature result in introducing new API types?
No.
Will enabling / using this feature result in any new calls to the cloud provider?
No.
Will enabling / using this feature result in increasing size or count of the existing API objects?
- This feature adds two new optional integer fields to
HorizontalPodAutoscalerv2objects. Users should expect this object to increase in size (5 bytes) each time they set this new field.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
No.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No.
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
API server or etcd issues do not impact this feature.
What are other known failure modes?
We do not expect any new failure mode. (While setting tolerance below 10% can cause HPAs
to scale up and down as frequently as every 30s, and higher values might stop scaling altogether
if the metric remains within the tolerance band, the feature is still working as intended.
To make HPAs respond faster, decrease the tolerance value. Conversely, to make them respond
slower, increase the tolerance value.)
What steps should be taken if SLOs are not being met to determine the problem?
If possible increase the log level for kube-controller-manager and check controller logs:
- Search for “Proposing desired replicas”, verify that the tolerance is set as expected,
and check (using
kubectl describe hpa) if the ratio between the current and desired metric values is in tolerance. - Look for warnings and errors which might point where the problem lies.
Implementation History
2025-01-21: KEP PR merged. 2025-03-24: Implementation PR merged. 2025-05-15: Kubernetes v1.33 released (includes this feature). 2025-05-16: This KEP updated for beta graduation.
Drawbacks
No major drawbacks have been identified.
Alternatives
On non-managed Kubernetes instances, users can update the cluster-wide
--horizontal-pod-autoscaler-tolerance tolerance parameter,
Infrastructure Needed (Optional)
N/A.