KEP-1610: Container Resource based Autoscaling

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
- User Stories
- Risks and Mitigations
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

The Horizontal Pod Autoscaler supports scaling of targets based on the resource usage of the pods in the target. The resource usage of pods is calculated as the sum of the individual container usage values of the pod. This is unsuitable for workloads where the usage of the containers are not strongly correlated or do not change in lockstep. This KEP suggests that when scaling based on resource usage the HPA also provide an option to consider the usages of individual containers to make scaling decisions.

Motivation

An HPA is used to ensure that a scaling target is scaled up or down in such a way that the specificed current metric values are always maintained at a certain level. Resource based autoscaling is the most basic approach to autoscaling and has been present in the HPA spec since v1. In this mode the HPA controller fetches the current resource metrics for all the pods of a scaling target and then computes how many pods should be added or removed based on the current usage to achieve the target average usage.

For performance critical applications where the resource usage of individual containers needs to be configured individually the default behavior of the HPA controller may be unsuitable. When there are multiple containers in the pod their individual resource usages may not have a direct correlation or may grow at different rates as the load changes. There are several reasons for this:

A sidecar container is only providing an auxiliary service such as log shipping. If the application does not log very frequently or does not produce logs in its hotpath then the usage of the log shipper will not grow.
A sidecar container which provides authentication. Due to heavy caching the usage will only increase slightly when the load on the main container increases. In the current blended usage calculation approach this usually results in the the HPA not scaling up the deployment because the blended usage is still low.
A sidecar may be injected without resources set which prevents scaling based on utilization. In the current logic the HPA controller can only scale on absolute resource usage of the pod when the resource requests are not set.

The optimum usage of the containers may also be at different levels. Hence the HPA should offer a way to specify the target usage in a more fine grained manner.

Goals

Make HPA scale based on individual container resources usage

Non-Goals

Configurable aggregation for containers resources in pods.
Optimization of the calls to the metrics-server

Proposal

Currently the HPA accepts multiple metric sources to calculate the number of replicas in the target, one of which is called Resource. The Resource type represents the resource usage of the pods in the scaling target. The resource metric source has the following structure:

type ResourceMetricSource struct {
	Name v1.ResourceName
	Target MetricTarget
}

Here the Name is the name of the resource. Currently only cpu and memory are supported for this field. The other field is used to specify the target at which the HPA should maintain the resource usage by adding or removing pods. For instance if the target is 60% CPU utilization, and the current average of the CPU resources across all the pods of the target is 70% then the HPA will add pods to reduce the CPU utilization. If it’s less than 60% then the HPA will remove pods to increase utilization.

It should be noted here that when a pod has multiple containers the HPA gets the resource usage of all the containers and sums them to get the total usage. This is then divided by the total requested resources to get the average utilizations. For instance if there is a pods with 2 containers: application and log-shipper requesting 250m and 250m of CPU resources then the total requested resources of the pod as calculated by the HPA is 500m. If then the first container is currently using 200m and the second only 50m then the usage of the pod is 250m which in utilization is 50%. Although individually the utilization of the containers are 80% and 20%. In such a situation the performance of the application container might be affected significantly. There is no way to specify in the HPA to keep the utilization of the first container below a certain threshold. This also affects memory resource based autocaling scaling.

We propose that the a new metric source called ContainerResourceMetricSource be introduced with the following structure:

type ContainerResourceMetricSource struct {
	Container string
	Name v1.ResourceName
	Target MetricTarget
}

The only new field is Container which is the name of the container for which the resource usage should be tracked.

User Stories

Multiple containers with different scaling thresholds

Assume the user has a deployment with multiple pods, each of which have multiple containers. A main container called application and 2 others called log-shipping and authnz-proxy. Two of the containers are critical to provide the application functionality, application and authnz-proxy. The user would like to prevent OOMKill of these containers and also keep their CPU utilization low to ensure the highest performance. The other container log-shipping is less critical and can tolerate failures and restarts. In this case the user would create an HPA with the following configuration:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: mission-critical
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mission-critical
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: ContainerResource
    resource:
      name: cpu
      container: application
      target:
        type: Utilization
        averageUtilization: 30
  - type: ContainerResource
    resource:
      name: memory
      container: application
      target:
        type: Utilization
        averageUtilization: 80
  - type: ContainerResource
    resource:
      name: cpu
      container: authnz-proxy
      target:
        type: Utilization
        averageUtilization: 30
  - type: ContainerResource
    resource:
      name: memory
      container: authnz-proxy
      target:
        type: Utilization
        averageUtilization: 80
  - type: ContainerResource
    resource:
      name: cpu
      container: log-shipping
      target:
        type: Utilization
        averageUtilization: 80

The HPA specifies that the HPA controller should maintain the CPU utilization of the containers application and authnz-proxy at 30% and the memory utilization at 80%. The log-shipping container is scaled to keep the cpu utilization at 80% and is not scaled on memory.

Multiple containers but only scaling for one.

Assume the user has a deployment where the pod spec has multiple containers but scaling should be performed based only on the utilization of one of the containers. There could be several reasons for such a strategy: Disruptions due to scaling of sidecars may be expensive and should be avoided or the resource usage of the sidecars could be erratic because it has a different work characteristics to the main container.

In such a case the user creates an HPA as follows:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: mission-critical
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mission-critical
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: ContainerResource
    resource:
      name: cpu
      container: application
      target:
        type: Utilization
        averageUtilization: 30

The HPA controller will then completely ignore the resource usage in other containers.

Add container metrics to existing pod resource metric.

A user who is already using an HPA to scale their application can add the container metric source to the HPA in addition to the existing pod metric source. If there is a single container in the pod then the behavior will be exactly the same as before. If there are multiple containers in the application pods then the deployment might scale out more than before. This happens when the resource usage of the specified container is more than the blended usage as calculated by the pod metric source. If however in the unlikely case, the usage of all the containers in the pod change in tandem by the same amount then the behavior will remain as before.

For example consider the HPA object which targets a Deployment with pods that have two containers application and log-shipper:


apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: mission-critical
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mission-critical
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: ContainerResource
    resource:
      name: cpu
      container: application
      target:
        type: Utilization
        averageUtilization: 50
  - type: PodResource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

If the resource usage of the application container increases then the target would be scaled out even if the usage of the log-shipper container does not increase much. If the resource usage of log-shipper container increases then the deployment would only be scaled out if the combined resource usage of both containers increases above the target.

Risks and Mitigations

Since the new field container in the container resource metric source is not validated against the target it is possible that the user could specify an invalid value, i.e. a container name which is not part of the pod. The HPA controller would treat this as invalid configuration and prevent scale down. However scale up would still be possible based on recommendations from other metric sources.

A similar problem is possible when renaming container names in the HPA. To mitigate this the recommended procedure is to have both the old and new container names during the deployment. The old container name can be removed from the HPA when the migration is complete.

Design Details

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Most of the tests will follow the same pattern as the tests for the pod resource metric source. The following unit tests will be added:

Replica Calculator: Verify that the number of replicas calculated is based on the metrics of individual containers when a container metric source is specified.
REST Metrics Client: Verify that the resources returned from the REST metric client is the metrics for only the containers specified in the metric source.
API server validation: Verify that only valid container metric sources are accepted.
kubectl: Verify that the new metric sources are displayed correctly.

Unit tests

k8s.io/kubernetes/pkg/controller/podautoscaler: 2023/02/02 - 87.8%
k8s.io/kubernetes/pkg/controller/podautoscaler/metrics: 2023/02/02 - 90.2%
k8s.io/kubernetes/pkg/apis/autoscaling/validation: 2023/02/02 - 95.2%
k8s.io/kubectl/pkg/describe/describe.go: 2023/02/02 - 68.4%

Integration tests

N/A

The HPA behaviors are tested thoroughly in the e2e tests described below, and the integration tests doesn’t add extra value to those e2e tests.

e2e tests

k8s-triage

https://storage.googleapis.com/k8s-triage/index.html?sig=autoscaling&job=ci-kubernetes-e2e-gci-gce-autoscaling&test=Container%20Resource

tests

Graduation Criteria

Alpha

Feature implemented behind a feature gate
Initial e2e tests completed and enabled

Beta

The feature gate is enabled by default.
No negative feedback during alpha for a long-enough time.
No bug issues reported during alpha.
Implementing/exposing metrics in HPA so that users can monitor the HPA controller for this feature.

GA

No negative feedback during beta for a long-enough time.
No bug issues reported during beta.

Upgrade / Downgrade Strategy

Upgrade

The previous HPA behavior will not be broken. Users can continue to use their HPA specs as it is.

To use this enhancement,

[only alpha] users need to enable the feature gate HPAContainerMetrics
add ContainerResource type metric on their HPA.
- https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#container-resource-metrics

Downgrade

For newly created HPAs, kube-apiserver will drop ContainerResource metric and thus, HPA controller will also do nothing with it.

For existing HPAs, the current implementation will continue to work on autoscaling based on ContainerResource. This behavior will be changed to ignore ContainerResource when the feature gate is disabled by the beta. (issue )

Version Skew Strategy

N/A

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: HPAContainerMetrics
- Components depending on the feature gate: kube-apiserver, kube-controller-manager

Does enabling the feature change any default behavior?

No.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

The feature can be disabled in Alpha and Beta versions by restarting kube-apiserver and kube-controller-manager with the feature-gate off.

As described in Upgrade / Downgrade Strategy , during the feature-gate off, all existing ContainerResource will be ignored by the HPA controller.

In terms of Stable versions, users can choose to opt-out by not setting the ContainerResource type metric in their HPA.

What happens if we reenable the feature if it was previously rolled back?

HPA with ContainerResource type metric can be created and can be handled by HPA controller.

If there have been HPAs with the ContainerResource type metric created before the roll back, those ContainerResource is ignored during the feature gate off, but will be handled by the HPA controller again after reenabling.

Are there any tests for feature enablement/disablement?

No. But, the tests to confirm the behavior on switching the feature gate will be added by beta. (issue )

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

When a rollout fail, it shouldn’t impact already running HPAs because it’s an opt-in feature, and users need to set ContainerResource metric to use this feature.

When a rollback fail for kube-controller-manager, HPA controller will continue to handle ContainerResource metric in HPAs. When a rollback fail for kube-apiserver, but success kube-controller-manager, HPA controller will just ignore ContainerResource metric in HPAs.

What specific metrics should inform a rollback?

reconciliation_duration_seconds: The time(seconds) that the HPA controller takes to reconcile once.
- You should rollback if you see an increase in the overall performance of HPA controller
metric_computation_duration_seconds{metric_type=ContainerResource}: The time(seconds) that the HPA controller takes to calculate one metric.
- You should rollback if you see the container resource metric takes much longer time compared to other metrics.
reconciliations_total{error=internal}: Number of internal errors in reconciliation of HPA controller.
- You should rollback if you see many error occurrence on the reconciliation.
metric_computation_total{error=internal,{metric_type=ContainerResource}: Number of internal errors in the calculation of type: ContainerResource.
- You should rollback if you see many error occurrence on the container resource metrics

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

But, as described in Are there any tests for feature enablement/disablement? , the tests to confirm the behavior on switching the feature gate will be added. (issue )

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

The operator can observe the execution of the computation for the container metrics through the 1st metrics described in What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? section.
The operator can query HPAs with hpa.spec.metrics.containerResource field set.

How can someone using this feature know that it is working for their instance?

Events
- SuccessfulRescale event with memory/cpu/etc resource utilization (percentage of request) above/below target
API .status
- When something wrong with the container metrics, ScalingActive condition will be false with FailedGetContainerResourceMetric reason.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

N/A

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- metric_computation_duration_seconds: The time(seconds) that the HPA controller takes to calculate one metric.
- metric_computation_total: Number of metric computations.
- reconciliations_total: Number of reconciliation of HPA controller.
- reconciliation_duration_seconds: The time(seconds) that the HPA controller takes to reconcile once.

Are there any missing metrics that would be useful to have to improve observability of this feature?

Yes. We’re planning to implement the metrics described in What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? section.

Dependencies

Does this feature depend on any specific services running in the cluster?

Yes. The HPA requires the metrics.k8s.io APIs to be available in the cluster to operate. This API is served by the Metrics Server, without Metrics Server autoscaling on container resource metrics will not work. If there are multiple metrics defined and one is not available, scale up will continue but scale down will not (for safety).

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

The autoscaling based on ContainerResource is unavailable because the HPA controller cannot get HPA object.

What are other known failure modes?

Failed to get container resource metric.
- Detection: ScalingActive: false condition with FailedGetContainerResourceMetric reason.
- Mitigations: remove failed ContainerResource in HPAs.
- Diagnostics: Related errors should be printed as the messages of ScalingActive: false.
- Testing: https://github.com/kubernetes/kubernetes/blob/0e3818e02760afa8ed0bea74c6973f605ca4683c/pkg/controller/podautoscaler/replica_calculator_test.go#L451

What steps should be taken if SLOs are not being met to determine the problem?

Check metric_computation_duration_seconds or reconciliation_duration_seconds to see which metric encountered the latency issue. And, if it is a latency problem only specific in type: ContainerResource, you can opt-out this feature by removing the type: ContainerResource metric from HPA(s).

Implementation History

2020-04-03 Initial KEP merged
2020-10-23 Implementation merged

Drawbacks

Alternatives

There’s an alternative way to scale on container-level metrics without introducing ContainerResource metrics.

Users can export resource consumption metrics from containers on their own to an external metrics source and then configure HPA based on this external metric. However this is cumbersome and results in delayed scaling decisions as using the external metrics path typically adds latency compared to in-cluster resource metrics path.

KEP-1610: Container Resource based Autoscaling

KEP-1610: Container Resource based Autoscaling

Release Signoff Checklist

Summary

Motivation

Goals

Non-Goals

Proposal

User Stories

Multiple containers with different scaling thresholds

Multiple containers but only scaling for one.

Add container metrics to existing pod resource metric.

Risks and Mitigations

Design Details

Test Plan

Prerequisite testing updates

Unit tests

Integration tests

e2e tests

Graduation Criteria

Alpha

Beta

GA

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

Infrastructure Needed (Optional)