KEP-5882: Deployment Pod Replacement Policy
KEP-5882: Deployment Pod Replacement Policy
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
Deployments have inconsistent behavior in how they handle terminating pods, depending on the rollout
strategy and when scaling the Deployments. In some scenarios it may be advantageous to wait for
terminating pods to terminate before spinning new ones. In other scenarios it might be beneficial
to spin them as soon as possible. This KEP proposes to add a new field .spec.podReplacementPolicy
to Deployments to allow users to specify the desired behavior.
This KEP builds on top of the KEP-3973
that helped to introduce status.terminatingReplicas fields to Deployments and ReplicaSets.
Motivation
In certain cases, deployment can momentarily have more pods than described by the deployment definition.
For example during a rollout with a RollingUpdate deployment strategy the following inequation
should hold true:
(.spec.replicas - .spec.strategy.rollingUpdate.maxUnavailable =< .status.replicas =< .spec.replicas + .spec.strategy.rollingUpdate.maxSurge)
But the actual number of replicas (pods) can be higher due to the terminating (marked with a
deletionTimestamp) pods being present which are not accounted for in .status.replicas.
This happens not only in a rollout, but also in other cases where pods are deleted by an actor other than the deployment controller (e.g. eviction).
Terminating pods can stay up for a considerable amount of time (driven by pod’s
.spec.terminationGracePeriodSeconds). Although terminating pods are not considered part of a
deployment and are not counted in its status, this can cause problems with resource usage and
scheduling:
- Unnecessary autoscaling of nodes in tight environments and driving up cloud costs. This can hurt
especially if multiple deployments are rolled out at the same time, or if a large
.spec.terminationGracePeriodSecondsvalue is requested. See the following issues for more details: kubernetes/kubernetes#95498 , kubernetes/kubernetes#99513 , kubernetes/kubernetes#41596 , kubernetes/kubernetes#97227 . - A problem also arises in contentious environments where pods are fighting over resources. This can bring up exponential backoff for not yet started pods into big numbers and unnecessarily delay start of such pods until they pop from the queue when there are computing resources to run them. This can slow down the deployment considerably. This is described in issue kubernetes/kubernetes#98656 . In that issue, the resources were limited by a quota, but this can be due to other reasons as well. This can occur also in high availability scenarios where pods are expected to run only on certain nodes, and pod anti-affinity forbids to run two pods on the same node.
- Terminating pods can still do useful work or hold old connections. Users would like to track this work through the deployment’s status. See kubernetes/kubernetes#110171 for more details.
kubernetes/kubernetes#107920 issue is covering this as well.
Goals
- Deployments should allow an option to either wait for its pods to terminate before creating new pods, or to create the pods immediately. This should take into consideration the Deployment strategy.
Non-Goals
Proposal
This KEP proposes to introduce a new .spec.podReplacementPolicy field (similar to Job’s
.spec.podReplacementPolicy in kubernetes/enhancements#3939
)
that would control how many pods should be present at any given time.
The termination of a Deployment/ReplicaSet pod is always triggered by a pod deletion due to
an enforced pod field restartPolicy: Always.
We are distinguishing between terminating and terminated pods.
- Terminating pods are running pods with a
deletionTimestamp. - Terminated pods are pods with a
deletionTimestampthat have reached theSucceededorFailedphase and are subsequently removed from etcd.
Unfortunately, the current behavior is inconsistent with how we treat terminating and terminated pods in the deployment controller.
- The Recreate Deployment strategy waits for terminating pods to terminate before creating (scheduling) new pods.
- The RollingUpdate deployment strategy does not wait for terminating pods and creates (schedules) new pods immediately.
- Scaling up a Deployment also does not wait for terminating pods and creates (schedules) new pods right away.
Unfortunately, in Deployments with a Recreate strategy we can get mixed behavior. The
deployment will wait for old pods to terminate during a rollout, but will ignore the terminating
pods when scaling the pods. So it is still possible to end up with a larger number of pods than
.spec.replicas.
User Stories (Optional)
Story 1 (Optional)
As an application user, I would prefer predictable number of pods in my cluster to prevent any scheduling issues and unnecessary autoscaling of nodes. I would also like to achieve consistent allocation of other scarce resources to pods.
Story 2 (Optional)
As an application user, I would like to keep the old behavior of fast scaling of pods and do not mind the higher utilization of resources.
Notes/Constraints/Caveats (Optional)
Consideration for Other Controllers
This feature is not considered for standalone ReplicaSets. The reason for this is that ReplicaSet behavior is meant to be simple and used as a building block by other high-level controllers. If we included the PodReplacementPolicy in both ReplicaSets and Deployments, it would be hard to reconcile these fields because a ReplicaSet only has the local view of its own pods. The Deployment has the complete picture of all the pods (through ReplicaSet’s status) in its ReplicaSets and can make the correct balancing decision. Adding such a feature to ReplicaSets could also pose a threat to third-party controllers that embed ReplicaSets in their resource definitions, as this could alter their behavior.
This feature is also not desirable for StatefulSets and DaemonSets, because by design we wait until old pods terminate before creating new pods.
This feature is already implemented for Jobs (KEP-3939 ).
Risks and Mitigations
Feature Impact
Deployment rollouts might be slower when using the TerminationComplete PodReplacementPolicy.
Deployment rollouts might consume excessive resources when using the TerminationStarted PodReplacementPolicy.
This is mitigated by making this feature opt-in.
kubectl Skew
The deployment.kubernetes.io/replicaset-replicas-before-scale annotation should be removed during
deployment rollback when annotations are copied from the ReplicaSet to the Deployment. Support for
this removal will be added to kubectl in the same release as this feature. Therefore, rollback using
an older kubectl will not be supported until one minor release after the feature first reaches
alpha. The documentation for Deployments will include a notice about this.
If an older kubectl version is used, the impact should be minimal. The deployment may end up with an
unnecessary deployment.kubernetes.io/replicaset-replicas-before-scale annotation. The deployment
controller then synchronizes Deployment annotations back to the ReplicaSet. This is done by the
Deployment controller, which will ignore this new annotations if the feature gate is on.
The bug should be mainly visual (extra annotation in the Deployment), unless the feature is turned on and off in a succession. In this case, incorrect annotations could end up on a ReplicaSet, which would affect the scaling proportions during a rollout.
Design Details
Deployment Behavior Changes
Recreate rollout logic:
- Terminating (TerminationStarted):
- Scale down old ReplicaSet(s) to 0.
- Wait until all the pods are at least terminating.
- Create new replica set.
- Terminated (TerminationComplete): Current behaviour.
RollingUpdate rollout logic:
- Terminating (TerminationStarted): Current behaviour.
- Terminated (TerminationComplete): When checking if a new replica set can be scaled up during a rollout, we should
consider terminating pods of all ReplicaSets as well and not spawn an amount of replicas that
would be higher than Deployment’s
.spec.replicas + .spec.strategy.rollingUpdate.maxSurge. This will be implemented by checking ReplicaSet’s.spec.replicas,.status.replicasand.status.terminatingReplicasto determine the number of pods.
Scaling logic:
- Terminating (TerminationStarted): Current behaviour.
- Terminated (TerminationComplete):
- When scaling up across one or more ReplicaSets, we should consider terminating pods of all
ReplicaSets as well and not spawn replicas that would be higher than Deployment’s
.spec.replicas + .spec.strategy.rollingUpdate.maxSurge. This will be implemented by checking ReplicaSet’s.spec.replicas,.status.replicasand.status.terminatingReplicasto determine the number of pods. See Deployment Scaling Changes and a New Annotation for ReplicaSets for more details. - Changing scaling down logic is not necessary, and we can scale down as many pods as we want because the policy does not affect this since we are not replacing the pods.
- When scaling up across one or more ReplicaSets, we should consider terminating pods of all
ReplicaSets as well and not spawn replicas that would be higher than Deployment’s
Deployment Completion and Progress Changes
Currently, when the latest ReplicaSet is fully saturated and all of its pods become available, the Deployment is declared complete. However, there may still be old terminating pods. These pods can still be ready and hold/accept connections, meaning that the transition to the latest revision is not fully complete.
To avoid unexpected behavior, we should not declare the deployment complete until all of its
terminating replicas have been fully terminated. We will therefore delay setting a NewRSAvailable
reason to the DeploymentProgressing condition, when TerminationComplete policy is used.
We will also update the LastUpdateTime of the DeploymentProgressing condition when the number of
terminating pods decreases to reset the progress deadline.
Deployment Scaling Changes and a New Annotation for ReplicaSets
Currently, scaling is done proportionally over all ReplicaSets to mitigate the risk of losing availability during a rolling update.
To calculate the new ReplicaSet size, we need to know
replicasBeforeScale: The.spec.replicasof the ReplicaSet before the scaling began.deploymentMaxReplicas: Equals to.spec.replicas + .spec.strategy.rollingUpdate.maxSurgeof the current Deployment.deploymentMaxReplicasBeforeScale: Equals to.spec.replicas + .spec.strategy.rollingUpdate.maxSurgeof the old Deployment. This information is stored in thedeployment.kubernetes.io/max-replicasannotation in each ReplicaSet.
Then we can calculate a new size for each ReplicaSet proportionally as follows:
$$ newReplicaSetReplicas = replicasBeforeScale * \frac{deploymentMaxReplicas}{deploymentMaxReplicasBeforeScale} $$
This is currently done in the getReplicaSetFraction function. The leftover pods are added to the largest ReplicaSet (or newest if more than one ReplicaSet has the largest number of pods).
This results in the following scaling behavior.
The first scale operation occurs at T2 and the second scale at T3.
| Time | Terminating Pods | RS1 Replicas | RS2 Replicas | RS3 Replicas | All RS Total | Deployment .spec.replicas | Deployment .spec.replicas + MaxSurge | Scale ratio |
|---|---|---|---|---|---|---|---|---|
| T1 | any amount | 60 | 30 | 20 | 110 | 100 | 110 | - |
| T2 | any amount | 71 | 35 | 24 | 130 | 120 | 130 | 1.182 |
| T3 | any amount | 76 | 38 | 26 | 140 | 130 | 140 | 1.077 |
With the TerminationComplete PodReplacementPolicy, scaling cannot proceed immediately if there
are terminating pods present, in order to adhere to the Deployment constraints. We need to scale
some ReplicaSets fully and some partially. And we have to postpone scaling to the future when
terminating pods disappear.
A single scale operation occurs at T2.
| Time | Terminating Pods | RS1 Replicas | RS2 Replicas | RS3 Replicas | All RS Total | Deployment .spec.replicas | Deployment .spec.replicas + MaxSurge | Scale ratio |
|---|---|---|---|---|---|---|---|---|
| T1 | 15 | 50 | 30 | 20 | 100 | 100 | 110 | - |
| T2 | 15 | 59 | 35 | 21 | 115 | 120 | 130 | 1.182 |
| T3 | 5 | 66 | 35 | 24 | 125 | 120 | 130 | - |
| T4 | 0 | 71 | 35 | 24 | 130 | 120 | 130 | - |
To proceed with the scaling in the future (T3), we need to remember both replicasBeforeScale and
deploymentMaxReplicasBeforeScale to calculate the original scale ratio. The terminating pods can
take a long time to terminate and there can be many steps and ReplicaSet updates between T2 and T3.
If we were to use the current number of ReplicaSet or Deployment replicas in any of these steps
(including T3), we would calculate an incorrect scale ratio.
deploymentMaxReplicasBeforeScaleis already stored in thedeployment.kubernetes.io/max-replicasReplicaSet annotation. The main change is that we need to keep the old Deployment max replicas value in the annotation until the partial scale for a ReplicaSet is complete.- To remember
replicasBeforeScale, we will introduce a new annotation calleddeployment.kubernetes.io/replicaset-replicas-before-scale, which will be added to the Deployment’s ReplicaSets that are being partially scaled. This annotation will be removed once the partial scaling is complete. This annotation will be added and managed by the deployment controller.
These two ReplicaSet annotation will be used to calculate the original scale ratio for the partial scaling.
The following example shows a first scale at T2 and a second scale at T3.
| Time | Terminating Pods | RS1 Replicas | RS2 Replicas | RS3 Replicas | All RS Total | Deployment .spec.replicas | Deployment .spec.replicas + MaxSurge | Scale ratio |
|---|---|---|---|---|---|---|---|---|
| T1 | 15 | 50 | 30 | 20 | 100 | 100 | 110 | - |
| T2 | 15 | 59 | 35 | 21 | 115 | 120 | 130 | 1.182 |
| T3 | 15 | 66 | 38 | 21 | 125 | 130 | 140 | 1.077 (1.273 from T1) |
| T4 | 5 | 72 | 38 | 25 | 135 | 130 | 140 | - |
| T5 | 0 | 77 | 38 | 25 | 140 | 130 | 140 | - |
- At T2, a ful scale was done for RS1 with a ratio of 1.182. RS1 can then use the new scale ratio at T3 with a value of 1.077.
- RS2 has been partially scaled (1.182 ratio) and RS3 has not been scaled at all at T2 due to the terminating pods. When a new scale occurs at T3, RS2 and RS3 have not yet completed the first scale. So their annotations still point to the T1 state. A new ratio of 1.273 is calculated and used for the second scale.
As we can see, we will get a slightly different result when compared to the first table. This is due to the consecutive scales and the fact that the last scale is not yet fully completed.
The consecutive partial scaling behavior is a best effort. We still adhere to all deployment constraints and have a bias toward scaling the largest ReplicaSet. To implement this properly we would have to introduce a full scaling history, which is probably not worth the added complexity.
kubectl Changes
Similar to deployment.kubernetes.io/max-replicas, we have to remove
deployment.kubernetes.io/replicaset-replicas-before-scale annotations from annotationsToSkip
to support rollbacks.
See kubectl Skew
for more details.
API
// DeploymentPodReplacementPolicy specifies the policy for creating Deployment Pod replacements.
// Default is a mixed behavior depending on the DeploymentStrategy
// +enum
type DeploymentPodReplacementPolicy string
const (
// TerminationStarted policy creates replacement Pods when the old Pods start
// terminating (have a non-null .metadata.deletionTimestamp). The total number
// of Deployment Pods can be greater than specified by the Deployment's
// .spec.replicas and the DeploymentStrategy.
TerminationStarted DeploymentPodReplacementPolicy = "TerminationStarted"
// TerminationComplete policy creates replacement Pods only when the old Pods
// are fully terminated (reach Succeeded or Failed phase). The old Pods are
// subsequently removed. The total number of the Deployment Pods is
// limited by the Deployment's .spec.replicas and the DeploymentStrategy.
//
// This policy will also delay declaring the deployment as complete until all
// of its terminating replicas have been fully terminated.
TerminationComplete DeploymentPodReplacementPolicy = "TerminationComplete"
)
type DeploymentSpec struct {
...
// podReplacementPolicy specifies when to create replacement Pods.
// Possible values are:
// - TerminationStarted policy creates replacement Pods when the old Pods start
// terminating (have a non-null .metadata.deletionTimestamp). The total number
// of Deployment Pods can be greater than specified by the Deployment's
// .spec.replicas and the DeploymentStrategy.
// - TerminationComplete policy creates replacement Pods only when the old Pods
// are fully terminated (reach Succeeded or Failed phase). The old Pods are
// subsequently removed. The total number of the Deployment Pods is
// limited by the Deployment's .spec.replicas and the DeploymentStrategy.
// This policy will also delay declaring the deployment as complete until all
// of its terminating replicas have been fully terminated.
//
// The default behavior when the policy is not specified depends on the DeploymentStrategy:
// - Recreate strategy uses TerminationComplete behavior when recreating the deployment,
// but uses TerminationStarted when scaling the deployment.
// - RollingUpdate strategy uses TerminationStarted behavior for both rolling out and
// scaling the deployments.
//
// This is an alpha field. Enable DeploymentPodReplacementPolicy and
// DeploymentReplicaSetTerminatingReplicas to be able to use this field.
// +optional
PodReplacementPolicy *DeploymentPodReplacementPolicy `json:"podReplacementPolicy,omitempty" protobuf:"bytes,10,opt,name=podReplacementPolicy,casttype=podReplacementPolicy"`
...
}
Test Plan
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
We assess that the deployment and replicaset controllers have adequate test coverage for places which might be impacted by this enhancement. Thus, no additional tests prior implementing this enhancement are needed.
Unit tests
Unit tests covering:
Deployment
- The current behavior remains unchanged when the DeploymentReplicaSetTerminatingReplicas and DeploymentPodReplacementPolicy feature gate is
disabled or PodReplacementPolicy is nil. The
.status.terminatingReplicasfield should be 0 in that case. - Add a test wrapper for any relevant tests, to ensure that they are run with all possible PodReplacementPolicy values correctly. The relevant tests are those that expect some behavior on Pod deletion, and are affected by this change.
- New unit tests should be added for any new helper functions.
- Test that the status is computed correctly.
- Test feature gate enablement and disablement.
The core packages (with their unit test coverage) which are going to be modified during the implementation:
k8s.io/kubernetes/pkg/apis/apps/v1:9 December 2023-71.4%k8s.io/kubernetes/pkg/apis/apps/validation:9 December 2023-92.3%k8s.io/kubernetes/pkg/controller/deployment:9 December 2023-61.7%k8s.io/kubernetes/pkg/controller/deployment/util:9 December 2023-50.1%k8s.io/kubernetes/pkg/controller/replicaset:9 December 2023-78.9%k8s.io/kubernetes/pkg/controller:9 December 2023-71.2%
Integration tests
Deployment
- The current behavior remains unchanged when the DeploymentReplicaSetTerminatingReplicas and DeploymentPodReplacementPolicy feature gate is disabled or PodReplacementPolicy is nil.
- Add a test wrapper for any relevant tests, to ensure that they are run with all possible PodReplacementPolicy values correctly. The relevant tests are those that expect some behavior on Pod deletion, and are affected by this change.
- Add new tests that observe rollout and scaling transitions for all possible PodReplacementPolicy values and
ensure that
.status.terminatingReplicasis correctly counted when the DeploymentReplicaSetTerminatingReplicas and DeploymentPodReplacementPolicy feature gate is enabled.
- TestRecreateDeploymentForPodReplacement : https://storage.googleapis.com/k8s-triage/index.html?test=TestRecreateDeploymentForPodReplacement
- TestRollingUpdateAndProportionalScalingForDeploymentPodReplacement : https://storage.googleapis.com/k8s-triage/index.html?test=TestRollingUpdateAndProportionalScalingForDeploymentPodReplacement
e2e tests
- Test that a Deployment with
RollingUpdatestrategy and aTerminationCompletePodReplacementPolicy does not exceed the amount of pods specified byspec.replicas + .spec.strategy.rollingUpdate.maxSurgewhen rolling out new revisions and/or scaling the deployment at any point in time. - Test scaling of Deployments that are in the middle of a rollout (even with more than 2 revisions). Verify that scaling is done proportionally across all ReplicaSets when terminating pods are present. Scale these deployments in a succession, even when the previous scale has not yet completed.
Graduation Criteria
Alpha
- Feature gates disabled by default.
- Unit, enablement/disablement, e2e, and integration tests implemented and passing.
- Document kubectl Skew for alpha.
Beta
- Feature gates enabled by default.
.spec.podReplacementPolicyis nil by default and preserves the original behavior.- Explore and try to resolve Protional scaling in Deployments in not fully re-entrant issue.
- E2e and integration tests are in Testgrid and linked in the KEP.
- add new metrics to
kube-state-metrics - Remove documentation for kubectl Skew that was introduced in alpha.
GA
- Every bug report is fixed.
- Confirm the stability of e2e and integration tests.
- DeploymentPodReplacementPolicy feature gate is ignored.
Upgrade / Downgrade Strategy
No changes required for existing cluster to use the enhancement.
Version Skew Strategy
We need to consider the version skew between kube-controller-manager and the apiserver.
If the feature is enabled on the apiserver, but not in the kube-controller-manager, then the .spec.podReplacementPolicy
field can be set, but the feature will not function.
If the feature is not enabled on the apiserver, and it is enabled in the kube-controller-manager, then
- The feature cannot be used for new workloads.
- Workloads that have the
.spec.podReplacementPolicyfield set will use the new behavior.
Also, as mentioned in kubectl Skew , kubectl skew is not supported in the alpha version.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name: DeploymentPodReplacementPolicy
- Components depending on the feature gate:
- kube-apiserver
- kube-controller-manager
Does enabling the feature change any default behavior?
No, the behavior is only changed when users specify the podReplacementPolicy in
the Deployment spec.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
Yes.
By disabling the feature:
- Extra pods can appear during a deployment rollout or scaling. This can increase the number of pods that need to be scheduled, and it can have an impact on the resource consumption.
As mentioned in kubectl Skew
, kubectl skew is not supported in alpha. If an older
unsupported version of kubectl was used, it is important to remove the
deployment.kubernetes.io/replicaset-replicas-before-scale annotation from all Deployments and
ReplicaSets after disabling this feature. This should prevent any unexpected behavior on the next
enablement.
What happens if we reenable the feature if it was previously rolled back?
The ReplicaSet and Deployment controllers will start reconciling and fulfilling the
.spec.podReplacementPolicy contract.
Similar to the section above, it is important to make sure that the
deployment.kubernetes.io/replicaset-replicas-before-scale annotation is removed from all
Deployments and ReplicaSets before the re-enablement.
Are there any tests for feature enablement/disablement?
Appropriate enablement/disablement tests will be added to the replicaset and deployment strategy_test.go
and unit tests in alpha.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
The rollout should not fail as the feature is hidden behind a feature gate and new optional field.
During a rollback, the .spec.podReplacementPolicy field will be ignored. This will cause
workloads that use this field to fall back to the original deployment rollout and scaling behaviour.
This can be problematic for workloads that are not expecting:
- excessive number of pods
- excessive resource consumption
- slower or faster deployment rollout or scaling speed
This can also affect other workloads, for example by exhausting resources on a node.
What specific metrics should inform a rollback?
kube-controller-manager’s deployment workqueue metrics such as workqueue_retries_total,
workqueue_depth, workqueue_work_duration_seconds_bucket can be observed. A sudden increase in
these metrics can indicate a problem with the DeploymentPodReplacementPolicy feature.
Deployment pods can be watched for incorrect number of pods during a deployment rollout or scaling.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
TBD: Manual upgrade->downgrade->upgrade path will be tested.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
The operator can observe .status.terminatingReplicas on both ReplicaSets and Deployments.
The same field is being added as a metric and can be observed there as well:
kube_replicaset_status_terminating_replicas and kube_deployment_status_replicas_terminating.
How can someone using this feature know that it is working for their instance?
When using the TerminationComplete PodReplacementPolicy, the user should not see an excess of running and
terminating pods created that is greater than the deployment’s .spec.replicas and its deployment
strategy.
When using the TerminationStarted PodReplacementPolicy, the user should see an excess of running and
terminating pods created that is greater than the deployment’s .spec.replicas and its deployment
strategy. This will in turn make the deployment rollout faster.
The terminating pods can be observed in .status.terminatingReplicas on both ReplicaSets and Deployments.
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
We do not propose any SLO/SLI for this feature.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
workqueue_retries_totalcan be used to see if there is a sudden increase in sync retries after enabling the feature- Aggregation method: name = “deployment”
- Components exposing the metric:
kube-controller-manager
- Metric name:
workqueue_depthcan be used to see if there is a sudden increase in unprocessed deployment objects after enabling the feature- Aggregation method: name = “deployment”
- Components exposing the metric:
kube-controller-manager
- Metric name:
workqueue_work_duration_seconds_bucketcan be used to see if there is a sudden increase in duration of syncing deployment objects after enabling the feature- Aggregation method: name = “deployment”
- Components exposing the metric:
kube-controller-manager
- Metric name:
kube_replicaset_status_terminating_replicas- Components exposing the metric:
kube-state-metrics
- Components exposing the metric:
- Metric name:
kube_deployment_status_replicas_terminating- Components exposing the metric:
kube-state-metrics
- Components exposing the metric:
- Metric name:
Are there any missing metrics that would be useful to have to improve observability of this feature?
kube_replicaset_status_terminating_replicas and kube_deployment_status_replicas_terminating
were added by the KEP-3973
.
Dependencies
Does this feature depend on any specific services running in the cluster?
No
Scalability
Will enabling / using this feature result in any new API calls?
No, it will use the existing calls for creating and reconciling Deployments, ReplicaSets, and Pods.
The number of calls may be higher in some scenarios if a large spec.strategy.rollingUpdate.maxSurge
value is specified. But the maximum number of calls per deployment should be similar to when
spec.strategy.rollingUpdate.maxSurge value is set to 1 when the feature is disabled.
Will enabling / using this feature result in introducing new API types?
No.
Will enabling / using this feature result in any new calls to the cloud provider?
No.
Will enabling / using this feature result in increasing size or count of the existing API objects?
Yes.
- API: Deployment
Estimated increase in size:
- New field in Deployment spec about 11 bytes.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
No.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No.
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
No change in behavior. Deployment and ReplicaSet controllers might fail in reconciling their objects and in turn stop deployment rollout or scaling.
What are other known failure modes?
N/A
TBD
What steps should be taken if SLOs are not being met to determine the problem?
Inspecting the kube-controller-manager logs at an increased log level for any failures in
deployment and replicaset controllers.
Implementation History
- 2023-05-01: First version of the KEP opened (https://github.com/kubernetes/enhancements/pull/3974) .
- 2023-12-12: Second version of the KEP opened (https://github.com/kubernetes/enhancements/pull/4357) .
- 2024-05-29: Added a Deployment Scaling Changes and a New Annotation for ReplicaSets section (https://github.com/kubernetes/enhancements/pull/4670) .
- 2024-11-22: Added a Deployment Completion and Progress Changes section (https://github.com/kubernetes/enhancements/pull/4976) .
- 2025-04-01: Introduced DeploymentReplicaSetTerminatingReplicas FG to split .status.terminatingReplicas feature from DeploymentPodReplacementPolicy (https://github.com/kubernetes/kubernetes/pull/131088 )
- 2025-06-11: Fixed ReplicationController reconciliation when the DeploymentReplicaSetTerminatingReplicas feature gate is enabled (https://github.com/kubernetes/kubernetes/issues/131821 )
- 2026-02-03: KEP-3973 was split into a KEP-5882 , which focuses on the DeploymentPodReplacementPolicy feature.
Drawbacks
Deployment might be slower when using the TerminationComplete PodReplacementPolicy.
Deployment might consume excessive resources when using the TerminationStarted PodReplacementPolicy.
Alternatives
This feature could be implemented by a different controller that manages ReplicaSets, but, there is a need for this feature to be implemented by a Deployment controller, because many existing workloads can benefit from this feature.