KEP-1610: Container Resource based Autoscaling
KEP-1610: Container Resource based Autoscaling
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
The Horizontal Pod Autoscaler supports scaling of targets based on the resource usage of the pods in the target. The resource usage of pods is calculated as the sum of the individual container usage values of the pod. This is unsuitable for workloads where the usage of the containers are not strongly correlated or do not change in lockstep. This KEP suggests that when scaling based on resource usage the HPA also provide an option to consider the usages of individual containers to make scaling decisions.
Motivation
An HPA is used to ensure that a scaling target is scaled up or down in such a way that the
specificed current metric values are always maintained at a certain level. Resource based
autoscaling is the most basic approach to autoscaling and has been present in the HPA spec since v1.
In this mode the HPA controller fetches the current resource metrics for all the pods of a scaling
target and then computes how many pods should be added or removed based on the current usage to
achieve the target average usage.
For performance critical applications where the resource usage of individual containers needs to be configured individually the default behavior of the HPA controller may be unsuitable. When there are multiple containers in the pod their individual resource usages may not have a direct correlation or may grow at different rates as the load changes. There are several reasons for this:
- A sidecar container is only providing an auxiliary service such as log shipping. If the application does not log very frequently or does not produce logs in its hotpath then the usage of the log shipper will not grow.
- A sidecar container which provides authentication. Due to heavy caching the usage will only increase slightly when the load on the main container increases. In the current blended usage calculation approach this usually results in the the HPA not scaling up the deployment because the blended usage is still low.
- A sidecar may be injected without resources set which prevents scaling based on utilization. In the current logic the HPA controller can only scale on absolute resource usage of the pod when the resource requests are not set.
The optimum usage of the containers may also be at different levels. Hence the HPA should offer a way to specify the target usage in a more fine grained manner.
Goals
- Make HPA scale based on individual container resources usage
Non-Goals
- Configurable aggregation for containers resources in pods.
- Optimization of the calls to the
metrics-server
Proposal
Currently the HPA accepts multiple metric sources to calculate the number of replicas in the target,
one of which is called Resource. The Resource type represents the resource usage of the
pods in the scaling target. The resource metric source has the following structure:
type ResourceMetricSource struct {
Name v1.ResourceName
Target MetricTarget
}
Here the Name is the name of the resource. Currently only cpu and memory are supported
for this field. The other field is used to specify the target at which the HPA should maintain
the resource usage by adding or removing pods. For instance if the target is 60% CPU utilization,
and the current average of the CPU resources across all the pods of the target is 70% then
the HPA will add pods to reduce the CPU utilization. If it’s less than 60% then the HPA will
remove pods to increase utilization.
It should be noted here that when a pod has multiple containers the HPA gets the resource
usage of all the containers and sums them to get the total usage. This is then divided
by the total requested resources to get the average utilizations. For instance if there is
a pods with 2 containers: application and log-shipper requesting 250m and 250m of
CPU resources then the total requested resources of the pod as calculated by the HPA is 500m.
If then the first container is currently using 200m and the second only 50m then
the usage of the pod is 250m which in utilization is 50%. Although individually
the utilization of the containers are 80% and 20%. In such a situation the performance
of the application container might be affected significantly. There is no way to specify
in the HPA to keep the utilization of the first container below a certain threshold. This also
affects memory resource based autocaling scaling.
We propose that the a new metric source called ContainerResourceMetricSource be introduced
with the following structure:
type ContainerResourceMetricSource struct {
Container string
Name v1.ResourceName
Target MetricTarget
}
The only new field is Container which is the name of the container for which the resource
usage should be tracked.
User Stories
Multiple containers with different scaling thresholds
Assume the user has a deployment with multiple pods, each of which have multiple containers. A main
container called application and 2 others called log-shipping and authnz-proxy. Two
of the containers are critical to provide the application functionality, application and
authnz-proxy. The user would like to prevent OOMKill of these containers and also keep
their CPU utilization low to ensure the highest performance. The other container
log-shipping is less critical and can tolerate failures and restarts. In this case the
user would create an HPA with the following configuration:
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: mission-critical
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: mission-critical
minReplicas: 1
maxReplicas: 10
metrics:
- type: ContainerResource
resource:
name: cpu
container: application
target:
type: Utilization
averageUtilization: 30
- type: ContainerResource
resource:
name: memory
container: application
target:
type: Utilization
averageUtilization: 80
- type: ContainerResource
resource:
name: cpu
container: authnz-proxy
target:
type: Utilization
averageUtilization: 30
- type: ContainerResource
resource:
name: memory
container: authnz-proxy
target:
type: Utilization
averageUtilization: 80
- type: ContainerResource
resource:
name: cpu
container: log-shipping
target:
type: Utilization
averageUtilization: 80
The HPA specifies that the HPA controller should maintain the CPU utilization of the containers
application and authnz-proxy at 30% and the memory utilization at 80%. The log-shipping
container is scaled to keep the cpu utilization at 80% and is not scaled on memory.
Multiple containers but only scaling for one.
Assume the user has a deployment where the pod spec has multiple containers but scaling should be performed based only on the utilization of one of the containers. There could be several reasons for such a strategy: Disruptions due to scaling of sidecars may be expensive and should be avoided or the resource usage of the sidecars could be erratic because it has a different work characteristics to the main container.
In such a case the user creates an HPA as follows:
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: mission-critical
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: mission-critical
minReplicas: 1
maxReplicas: 10
metrics:
- type: ContainerResource
resource:
name: cpu
container: application
target:
type: Utilization
averageUtilization: 30
The HPA controller will then completely ignore the resource usage in other containers.
Add container metrics to existing pod resource metric.
A user who is already using an HPA to scale their application can add the container metric source to the HPA in addition to the existing pod metric source. If there is a single container in the pod then the behavior will be exactly the same as before. If there are multiple containers in the application pods then the deployment might scale out more than before. This happens when the resource usage of the specified container is more than the blended usage as calculated by the pod metric source. If however in the unlikely case, the usage of all the containers in the pod change in tandem by the same amount then the behavior will remain as before.
For example consider the HPA object which targets a Deployment with pods that have two containers application
and log-shipper:
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: mission-critical
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: mission-critical
minReplicas: 1
maxReplicas: 10
metrics:
- type: ContainerResource
resource:
name: cpu
container: application
target:
type: Utilization
averageUtilization: 50
- type: PodResource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
If the resource usage of the application container increases then the target would be scaled out even if
the usage of the log-shipper container does not increase much. If the resource usage of log-shipper container
increases then the deployment would only be scaled out if the combined resource usage of both containers increases
above the target.
Risks and Mitigations
Since the new field container in the container resource metric source is not validated against the target it is
possible that the user could specify an invalid value, i.e. a container name which is not part of the pod. The HPA
controller would treat this as invalid configuration and prevent scale down. However scale up would still be possible
based on recommendations from other metric sources.
A similar problem is possible when renaming container names in the HPA. To mitigate this the recommended procedure is to have both the old and new container names during the deployment. The old container name can be removed from the HPA when the migration is complete.
Design Details
Test Plan
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
Most of the tests will follow the same pattern as the tests for the pod resource metric source. The following unit tests will be added:
- Replica Calculator: Verify that the number of replicas calculated is based on the metrics of individual containers when a container metric source is specified.
- REST Metrics Client: Verify that the resources returned from the REST metric client is the metrics for only the containers specified in the metric source.
- API server validation: Verify that only valid container metric sources are accepted.
- kubectl: Verify that the new metric sources are displayed correctly.
Unit tests
k8s.io/kubernetes/pkg/controller/podautoscaler:2023/02/02-87.8%k8s.io/kubernetes/pkg/controller/podautoscaler/metrics:2023/02/02-90.2%k8s.io/kubernetes/pkg/apis/autoscaling/validation:2023/02/02-95.2%k8s.io/kubectl/pkg/describe/describe.go:2023/02/02-68.4%
Integration tests
N/A
The HPA behaviors are tested thoroughly in the e2e tests described below, and the integration tests doesn’t add extra value to those e2e tests.
e2e tests
k8s-triage
tests
- https://github.com/kubernetes/kubernetes/blob/d4750857760ae55802f69989dc2451feeb9a29e5/test/e2e/autoscaling/horizontal_pod_autoscaling.go#L61
- https://github.com/kubernetes/kubernetes/blob/d4750857760ae55802f69989dc2451feeb9a29e5/test/e2e/autoscaling/horizontal_pod_autoscaling.go#L163
- https://github.com/kubernetes/kubernetes/blob/d4750857760ae55802f69989dc2451feeb9a29e5/test/e2e/autoscaling/horizontal_pod_autoscaling.go#L120
- https://github.com/kubernetes/kubernetes/blob/d4750857760ae55802f69989dc2451feeb9a29e5/test/e2e/autoscaling/custom_metrics_stackdriver_autoscaling.go#L323
Graduation Criteria
Alpha
- Feature implemented behind a feature gate
- Initial e2e tests completed and enabled
Beta
- The feature gate is enabled by default.
- No negative feedback during alpha for a long-enough time.
- No bug issues reported during alpha.
- Implementing/exposing metrics in HPA so that users can monitor the HPA controller for this feature.
GA
- No negative feedback during beta for a long-enough time.
- No bug issues reported during beta.
Upgrade / Downgrade Strategy
Upgrade
The previous HPA behavior will not be broken. Users can continue to use their HPA specs as it is.
To use this enhancement,
- [only alpha] users need to enable the feature gate
HPAContainerMetrics - add
ContainerResourcetype metric on their HPA.
Downgrade
For newly created HPAs, kube-apiserver will drop ContainerResource metric
and thus, HPA controller will also do nothing with it.
For existing HPAs, the current implementation will continue to work on autoscaling based on ContainerResource.
This behavior will be changed to ignore ContainerResource when the feature gate is disabled by the beta.
(issue
)
Version Skew Strategy
N/A
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name:
HPAContainerMetrics - Components depending on the feature gate:
kube-apiserver,kube-controller-manager
- Feature gate name:
Does enabling the feature change any default behavior?
No.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
The feature can be disabled in Alpha and Beta versions by restarting kube-apiserver and kube-controller-manager with the feature-gate off.
As described in Upgrade / Downgrade Strategy
,
during the feature-gate off, all existing ContainerResource will be ignored by the HPA controller.
In terms of Stable versions, users can choose to opt-out by not setting the
ContainerResource type metric in their HPA.
What happens if we reenable the feature if it was previously rolled back?
HPA with ContainerResource type metric can be created and can be handled by HPA controller.
If there have been HPAs with the ContainerResource type metric created before the roll back,
those ContainerResource is ignored during the feature gate off, but will be handled by the HPA controller again after reenabling.
Are there any tests for feature enablement/disablement?
No. But, the tests to confirm the behavior on switching the feature gate will be added by beta. (issue )
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
When a rollout fail, it shouldn’t impact already running HPAs because it’s an opt-in feature,
and users need to set ContainerResource metric to use this feature.
When a rollback fail for kube-controller-manager, HPA controller will continue to handle ContainerResource metric in HPAs.
When a rollback fail for kube-apiserver, but success kube-controller-manager,
HPA controller will just ignore ContainerResource metric in HPAs.
What specific metrics should inform a rollback?
reconciliation_duration_seconds: The time(seconds) that the HPA controller takes to reconcile once.- You should rollback if you see an increase in the overall performance of HPA controller
metric_computation_duration_seconds{metric_type=ContainerResource}: The time(seconds) that the HPA controller takes to calculate one metric.- You should rollback if you see the container resource metric takes much longer time compared to other metrics.
reconciliations_total{error=internal}: Number of internal errors in reconciliation of HPA controller.- You should rollback if you see many error occurrence on the reconciliation.
metric_computation_total{error=internal,{metric_type=ContainerResource}: Number of internal errors in the calculation oftype: ContainerResource.- You should rollback if you see many error occurrence on the container resource metrics
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
But, as described in Are there any tests for feature enablement/disablement? , the tests to confirm the behavior on switching the feature gate will be added. (issue )
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
- The operator can observe the execution of the computation for the container metrics through the 1st metrics described in What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? section.
- The operator can query HPAs with
hpa.spec.metrics.containerResourcefield set.
How can someone using this feature know that it is working for their instance?
- Events
SuccessfulRescaleevent withmemory/cpu/etc resource utilization (percentage of request) above/below target
- API .status
- When something wrong with the container metrics,
ScalingActivecondition will be false withFailedGetContainerResourceMetricreason.
- When something wrong with the container metrics,
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
N/A
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
metric_computation_duration_seconds: The time(seconds) that the HPA controller takes to calculate one metric.metric_computation_total: Number of metric computations.reconciliations_total: Number of reconciliation of HPA controller.reconciliation_duration_seconds: The time(seconds) that the HPA controller takes to reconcile once.
Are there any missing metrics that would be useful to have to improve observability of this feature?
Yes. We’re planning to implement the metrics described in What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? section.
Dependencies
Does this feature depend on any specific services running in the cluster?
Yes.
The HPA requires the metrics.k8s.io APIs to be available in the cluster to operate. This API is served by the Metrics Server,
without Metrics Server autoscaling on container resource metrics will not work.
If there are multiple metrics defined and one is not available, scale up will
continue but scale down will not (for safety).
Scalability
Will enabling / using this feature result in any new API calls?
No.
Will enabling / using this feature result in introducing new API types?
No.
Will enabling / using this feature result in any new calls to the cloud provider?
No.
Will enabling / using this feature result in increasing size or count of the existing API objects?
No.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
No.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No.
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
The autoscaling based on ContainerResource is unavailable
because the HPA controller cannot get HPA object.
What are other known failure modes?
- Failed to get container resource metric.
- Detection:
ScalingActive: falsecondition withFailedGetContainerResourceMetricreason. - Mitigations: remove failed
ContainerResourcein HPAs. - Diagnostics: Related errors should be printed as the messages of
ScalingActive: false. - Testing: https://github.com/kubernetes/kubernetes/blob/0e3818e02760afa8ed0bea74c6973f605ca4683c/pkg/controller/podautoscaler/replica_calculator_test.go#L451
- Detection:
What steps should be taken if SLOs are not being met to determine the problem?
Check metric_computation_duration_seconds or reconciliation_duration_seconds to see which metric encountered the latency issue.
And, if it is a latency problem only specific in type: ContainerResource,
you can opt-out this feature by removing the type: ContainerResource metric from HPA(s).
Implementation History
- 2020-04-03 Initial KEP merged
- 2020-10-23 Implementation merged
Drawbacks
Alternatives
There’s an alternative way to scale on container-level metrics without introducing ContainerResource metrics.
Users can export resource consumption metrics from containers on their own to an external metrics source and then configure HPA based on this external metric. However this is cumbersome and results in delayed scaling decisions as using the external metrics path typically adds latency compared to in-cluster resource metrics path.