KEP-5882: Deployment Pod Replacement Policy

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Deployments have inconsistent behavior in how they handle terminating pods, depending on the rollout strategy and when scaling the Deployments. In some scenarios it may be advantageous to wait for terminating pods to terminate before spinning new ones. In other scenarios it might be beneficial to spin them as soon as possible. This KEP proposes to add a new field .spec.podReplacementPolicy to Deployments to allow users to specify the desired behavior.

This KEP builds on top of the KEP-3973 that helped to introduce status.terminatingReplicas fields to Deployments and ReplicaSets.

Motivation

In certain cases, deployment can momentarily have more pods than described by the deployment definition.

For example during a rollout with a RollingUpdate deployment strategy the following inequation should hold true: (.spec.replicas - .spec.strategy.rollingUpdate.maxUnavailable =< .status.replicas =< .spec.replicas + .spec.strategy.rollingUpdate.maxSurge) But the actual number of replicas (pods) can be higher due to the terminating (marked with a deletionTimestamp) pods being present which are not accounted for in .status.replicas.

This happens not only in a rollout, but also in other cases where pods are deleted by an actor other than the deployment controller (e.g. eviction).

Terminating pods can stay up for a considerable amount of time (driven by pod’s .spec.terminationGracePeriodSeconds). Although terminating pods are not considered part of a deployment and are not counted in its status, this can cause problems with resource usage and scheduling:

Unnecessary autoscaling of nodes in tight environments and driving up cloud costs. This can hurt especially if multiple deployments are rolled out at the same time, or if a large .spec.terminationGracePeriodSeconds value is requested. See the following issues for more details: kubernetes/kubernetes#95498 , kubernetes/kubernetes#99513 , kubernetes/kubernetes#41596 , kubernetes/kubernetes#97227 .
A problem also arises in contentious environments where pods are fighting over resources. This can bring up exponential backoff for not yet started pods into big numbers and unnecessarily delay start of such pods until they pop from the queue when there are computing resources to run them. This can slow down the deployment considerably. This is described in issue kubernetes/kubernetes#98656 . In that issue, the resources were limited by a quota, but this can be due to other reasons as well. This can occur also in high availability scenarios where pods are expected to run only on certain nodes, and pod anti-affinity forbids to run two pods on the same node.
Terminating pods can still do useful work or hold old connections. Users would like to track this work through the deployment’s status. See kubernetes/kubernetes#110171 for more details.

kubernetes/kubernetes#107920 issue is covering this as well.

Goals

Deployments should allow an option to either wait for its pods to terminate before creating new pods, or to create the pods immediately. This should take into consideration the Deployment strategy.

Non-Goals

Proposal

This KEP proposes to introduce a new .spec.podReplacementPolicy field (similar to Job’s .spec.podReplacementPolicy in kubernetes/enhancements#3939 ) that would control how many pods should be present at any given time.

The termination of a Deployment/ReplicaSet pod is always triggered by a pod deletion due to an enforced pod field restartPolicy: Always.

We are distinguishing between terminating and terminated pods.

Terminating pods are running pods with a deletionTimestamp.
Terminated pods are pods with a deletionTimestamp that have reached the Succeeded or Failed phase and are subsequently removed from etcd.

Unfortunately, the current behavior is inconsistent with how we treat terminating and terminated pods in the deployment controller.

The Recreate Deployment strategy waits for terminating pods to terminate before creating (scheduling) new pods.
The RollingUpdate deployment strategy does not wait for terminating pods and creates (schedules) new pods immediately.
Scaling up a Deployment also does not wait for terminating pods and creates (schedules) new pods right away.

Unfortunately, in Deployments with a Recreate strategy we can get mixed behavior. The deployment will wait for old pods to terminate during a rollout, but will ignore the terminating pods when scaling the pods. So it is still possible to end up with a larger number of pods than .spec.replicas.

User Stories (Optional)

Story 1 (Optional)

As an application user, I would prefer predictable number of pods in my cluster to prevent any scheduling issues and unnecessary autoscaling of nodes. I would also like to achieve consistent allocation of other scarce resources to pods.

Story 2 (Optional)

As an application user, I would like to keep the old behavior of fast scaling of pods and do not mind the higher utilization of resources.

Notes/Constraints/Caveats (Optional)

Consideration for Other Controllers

This feature is not considered for standalone ReplicaSets. The reason for this is that ReplicaSet behavior is meant to be simple and used as a building block by other high-level controllers. If we included the PodReplacementPolicy in both ReplicaSets and Deployments, it would be hard to reconcile these fields because a ReplicaSet only has the local view of its own pods. The Deployment has the complete picture of all the pods (through ReplicaSet’s status) in its ReplicaSets and can make the correct balancing decision. Adding such a feature to ReplicaSets could also pose a threat to third-party controllers that embed ReplicaSets in their resource definitions, as this could alter their behavior.

This feature is also not desirable for StatefulSets and DaemonSets, because by design we wait until old pods terminate before creating new pods.

This feature is already implemented for Jobs (KEP-3939 ).

Risks and Mitigations

Feature Impact

Deployment rollouts might be slower when using the TerminationComplete PodReplacementPolicy.

Deployment rollouts might consume excessive resources when using the TerminationStarted PodReplacementPolicy.

This is mitigated by making this feature opt-in.

kubectl Skew

The deployment.kubernetes.io/replicaset-replicas-before-scale annotation should be removed during deployment rollback when annotations are copied from the ReplicaSet to the Deployment. Support for this removal will be added to kubectl in the same release as this feature. Therefore, rollback using an older kubectl will not be supported until one minor release after the feature first reaches alpha. The documentation for Deployments will include a notice about this.

If an older kubectl version is used, the impact should be minimal. The deployment may end up with an unnecessary deployment.kubernetes.io/replicaset-replicas-before-scale annotation. The deployment controller then synchronizes Deployment annotations back to the ReplicaSet. This is done by the Deployment controller, which will ignore this new annotations if the feature gate is on.

The bug should be mainly visual (extra annotation in the Deployment), unless the feature is turned on and off in a succession. In this case, incorrect annotations could end up on a ReplicaSet, which would affect the scaling proportions during a rollout.

Design Details

Deployment Behavior Changes

Recreate rollout logic:

Terminating (TerminationStarted):
1. Scale down old ReplicaSet(s) to 0.
2. Wait until all the pods are at least terminating.
3. Create new replica set.
Terminated (TerminationComplete): Current behaviour.

RollingUpdate rollout logic:

Terminating (TerminationStarted): Current behaviour.
Terminated (TerminationComplete): When checking if a new replica set can be scaled up during a rollout, we should consider terminating pods of all ReplicaSets as well and not spawn an amount of replicas that would be higher than Deployment’s .spec.replicas + .spec.strategy.rollingUpdate.maxSurge. This will be implemented by checking ReplicaSet’s .spec.replicas, .status.replicas and .status.terminatingReplicas to determine the number of pods.

Scaling logic:

Terminating (TerminationStarted): Current behaviour.
Terminated (TerminationComplete):
- When scaling up across one or more ReplicaSets, we should consider terminating pods of all ReplicaSets as well and not spawn replicas that would be higher than Deployment’s .spec.replicas + .spec.strategy.rollingUpdate.maxSurge. This will be implemented by checking ReplicaSet’s .spec.replicas, .status.replicas and .status.terminatingReplicas to determine the number of pods. See Deployment Scaling Changes and a New Annotation for ReplicaSets for more details.
- Changing scaling down logic is not necessary, and we can scale down as many pods as we want because the policy does not affect this since we are not replacing the pods.

Deployment Completion and Progress Changes

Currently, when the latest ReplicaSet is fully saturated and all of its pods become available, the Deployment is declared complete. However, there may still be old terminating pods. These pods can still be ready and hold/accept connections, meaning that the transition to the latest revision is not fully complete.

To avoid unexpected behavior, we should not declare the deployment complete until all of its terminating replicas have been fully terminated. We will therefore delay setting a NewRSAvailable reason to the DeploymentProgressing condition, when TerminationComplete policy is used.

We will also update the LastUpdateTime of the DeploymentProgressing condition when the number of terminating pods decreases to reset the progress deadline.

Deployment Scaling Changes and a New Annotation for ReplicaSets

Currently, scaling is done proportionally over all ReplicaSets to mitigate the risk of losing availability during a rolling update.

To calculate the new ReplicaSet size, we need to know

replicasBeforeScale: The .spec.replicas of the ReplicaSet before the scaling began.
deploymentMaxReplicas: Equals to .spec.replicas + .spec.strategy.rollingUpdate.maxSurge of the current Deployment.
deploymentMaxReplicasBeforeScale: Equals to .spec.replicas + .spec.strategy.rollingUpdate.maxSurge of the old Deployment. This information is stored in the deployment.kubernetes.io/max-replicas annotation in each ReplicaSet.

Then we can calculate a new size for each ReplicaSet proportionally as follows:

$$ newReplicaSetReplicas = replicasBeforeScale * \frac{deploymentMaxReplicas}{deploymentMaxReplicasBeforeScale} $$

This is currently done in the getReplicaSetFraction function. The leftover pods are added to the largest ReplicaSet (or newest if more than one ReplicaSet has the largest number of pods).

This results in the following scaling behavior.

The first scale operation occurs at T2 and the second scale at T3.

Time	Terminating Pods	RS1 Replicas	RS2 Replicas	RS3 Replicas	All RS Total	Deployment .spec.replicas	Deployment .spec.replicas + MaxSurge	Scale ratio
T1	any amount	60	30	20	110	100	110	-
T2	any amount	71	35	24	130	120	130	1.182
T3	any amount	76	38	26	140	130	140	1.077

With the TerminationComplete PodReplacementPolicy, scaling cannot proceed immediately if there are terminating pods present, in order to adhere to the Deployment constraints. We need to scale some ReplicaSets fully and some partially. And we have to postpone scaling to the future when terminating pods disappear.

A single scale operation occurs at T2.

Time	Terminating Pods	RS1 Replicas	RS2 Replicas	RS3 Replicas	All RS Total	Deployment .spec.replicas	Deployment .spec.replicas + MaxSurge	Scale ratio
T1	15	50	30	20	100	100	110	-
T2	15	59	35	21	115	120	130	1.182
T3	5	66	35	24	125	120	130	-
T4	0	71	35	24	130	120	130	-

To proceed with the scaling in the future (T3), we need to remember both replicasBeforeScale and deploymentMaxReplicasBeforeScale to calculate the original scale ratio. The terminating pods can take a long time to terminate and there can be many steps and ReplicaSet updates between T2 and T3. If we were to use the current number of ReplicaSet or Deployment replicas in any of these steps (including T3), we would calculate an incorrect scale ratio.

deploymentMaxReplicasBeforeScale is already stored in the deployment.kubernetes.io/max-replicas ReplicaSet annotation. The main change is that we need to keep the old Deployment max replicas value in the annotation until the partial scale for a ReplicaSet is complete.
To remember replicasBeforeScale, we will introduce a new annotation called deployment.kubernetes.io/replicaset-replicas-before-scale, which will be added to the Deployment’s ReplicaSets that are being partially scaled. This annotation will be removed once the partial scaling is complete. This annotation will be added and managed by the deployment controller.

These two ReplicaSet annotation will be used to calculate the original scale ratio for the partial scaling.

The following example shows a first scale at T2 and a second scale at T3.

Time	Terminating Pods	RS1 Replicas	RS2 Replicas	RS3 Replicas	All RS Total	Deployment .spec.replicas	Deployment .spec.replicas + MaxSurge	Scale ratio
T1	15	50	30	20	100	100	110	-
T2	15	59	35	21	115	120	130	1.182
T3	15	66	38	21	125	130	140	1.077 (1.273 from T1)
T4	5	72	38	25	135	130	140	-
T5	0	77	38	25	140	130	140	-

At T2, a ful scale was done for RS1 with a ratio of 1.182. RS1 can then use the new scale ratio at T3 with a value of 1.077.
RS2 has been partially scaled (1.182 ratio) and RS3 has not been scaled at all at T2 due to the terminating pods. When a new scale occurs at T3, RS2 and RS3 have not yet completed the first scale. So their annotations still point to the T1 state. A new ratio of 1.273 is calculated and used for the second scale.

As we can see, we will get a slightly different result when compared to the first table. This is due to the consecutive scales and the fact that the last scale is not yet fully completed.

The consecutive partial scaling behavior is a best effort. We still adhere to all deployment constraints and have a bias toward scaling the largest ReplicaSet. To implement this properly we would have to introduce a full scaling history, which is probably not worth the added complexity.

kubectl Changes

Similar to deployment.kubernetes.io/max-replicas, we have to remove deployment.kubernetes.io/replicaset-replicas-before-scale annotations from annotationsToSkip to support rollbacks. See kubectl Skew for more details.

API

// DeploymentPodReplacementPolicy specifies the policy for creating Deployment Pod replacements.
// Default is a mixed behavior depending on the DeploymentStrategy
// +enum 
type DeploymentPodReplacementPolicy string
const (
// TerminationStarted policy creates replacement Pods when the old Pods start
// terminating (have a non-null .metadata.deletionTimestamp). The total number
// of Deployment Pods can be greater than specified by the Deployment's
// .spec.replicas and the DeploymentStrategy.
TerminationStarted DeploymentPodReplacementPolicy = "TerminationStarted"
// TerminationComplete policy creates replacement Pods only when the old Pods
// are fully terminated (reach Succeeded or Failed phase). The old Pods are
// subsequently removed. The total number of the Deployment Pods is
// limited by the Deployment's .spec.replicas and the DeploymentStrategy.
//
// This policy will also delay declaring the deployment as complete until all
// of its terminating replicas have been fully terminated.
TerminationComplete DeploymentPodReplacementPolicy = "TerminationComplete"
)

type DeploymentSpec struct {
    ...
    // podReplacementPolicy specifies when to create replacement Pods. 
	// Possible values are:
    // - TerminationStarted policy creates replacement Pods when the old Pods start
	//   terminating (have a non-null .metadata.deletionTimestamp). The total number
	//   of Deployment Pods can be greater than specified by the Deployment's
	//   .spec.replicas and the DeploymentStrategy.
    // - TerminationComplete policy creates replacement Pods only when the old Pods
	//   are fully terminated (reach Succeeded or Failed phase). The old Pods are
	//   subsequently removed. The total number of the Deployment Pods is
	//   limited by the Deployment's .spec.replicas and the DeploymentStrategy.
	//   This policy will also delay declaring the deployment as complete until all
	//   of its terminating replicas have been fully terminated.
    //
    // The default behavior when the policy is not specified depends on the DeploymentStrategy:
	// - Recreate strategy uses TerminationComplete behavior when recreating the deployment,
	//   but uses TerminationStarted when scaling the deployment.
	// - RollingUpdate strategy uses TerminationStarted behavior for both rolling out and
	//   scaling the deployments.
	//
	// This is an alpha field. Enable DeploymentPodReplacementPolicy and 
	// DeploymentReplicaSetTerminatingReplicas to be able to use this field.
    // +optional
    PodReplacementPolicy *DeploymentPodReplacementPolicy `json:"podReplacementPolicy,omitempty" protobuf:"bytes,10,opt,name=podReplacementPolicy,casttype=podReplacementPolicy"`
    ...
}

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

We assess that the deployment and replicaset controllers have adequate test coverage for places which might be impacted by this enhancement. Thus, no additional tests prior implementing this enhancement are needed.

Unit tests

Unit tests covering:

Deployment

The current behavior remains unchanged when the DeploymentReplicaSetTerminatingReplicas and DeploymentPodReplacementPolicy feature gate is disabled or PodReplacementPolicy is nil. The .status.terminatingReplicas field should be 0 in that case.
Add a test wrapper for any relevant tests, to ensure that they are run with all possible PodReplacementPolicy values correctly. The relevant tests are those that expect some behavior on Pod deletion, and are affected by this change.
New unit tests should be added for any new helper functions.
Test that the status is computed correctly.
Test feature gate enablement and disablement.

The core packages (with their unit test coverage) which are going to be modified during the implementation:

k8s.io/kubernetes/pkg/apis/apps/v1: 9 December 2023 - 71.4%
k8s.io/kubernetes/pkg/apis/apps/validation: 9 December 2023 - 92.3%
k8s.io/kubernetes/pkg/controller/deployment: 9 December 2023 - 61.7%
k8s.io/kubernetes/pkg/controller/deployment/util: 9 December 2023 - 50.1%
k8s.io/kubernetes/pkg/controller/replicaset: 9 December 2023 - 78.9%
k8s.io/kubernetes/pkg/controller: 9 December 2023 - 71.2%

Integration tests

Deployment

The current behavior remains unchanged when the DeploymentReplicaSetTerminatingReplicas and DeploymentPodReplacementPolicy feature gate is disabled or PodReplacementPolicy is nil.
Add a test wrapper for any relevant tests, to ensure that they are run with all possible PodReplacementPolicy values correctly. The relevant tests are those that expect some behavior on Pod deletion, and are affected by this change.
Add new tests that observe rollout and scaling transitions for all possible PodReplacementPolicy values and ensure that .status.terminatingReplicas is correctly counted when the DeploymentReplicaSetTerminatingReplicas and DeploymentPodReplacementPolicy feature gate is enabled.

e2e tests

Test that a Deployment with RollingUpdate strategy and a TerminationComplete PodReplacementPolicy does not exceed the amount of pods specified by spec.replicas + .spec.strategy.rollingUpdate.maxSurge when rolling out new revisions and/or scaling the deployment at any point in time.
Test scaling of Deployments that are in the middle of a rollout (even with more than 2 revisions). Verify that scaling is done proportionally across all ReplicaSets when terminating pods are present. Scale these deployments in a succession, even when the previous scale has not yet completed.

Graduation Criteria

Alpha

Feature gates disabled by default.
Unit, enablement/disablement, e2e, and integration tests implemented and passing.
Document kubectl Skew for alpha.

Beta

Feature gates enabled by default.
.spec.podReplacementPolicy is nil by default and preserves the original behavior.
Explore and try to resolve Protional scaling in Deployments in not fully re-entrant issue.
E2e and integration tests are in Testgrid and linked in the KEP.
add new metrics to kube-state-metrics
Remove documentation for kubectl Skew that was introduced in alpha.

GA

Every bug report is fixed.
Confirm the stability of e2e and integration tests.
DeploymentPodReplacementPolicy feature gate is ignored.

Upgrade / Downgrade Strategy

No changes required for existing cluster to use the enhancement.

Version Skew Strategy

We need to consider the version skew between kube-controller-manager and the apiserver.

If the feature is enabled on the apiserver, but not in the kube-controller-manager, then the .spec.podReplacementPolicy field can be set, but the feature will not function.

If the feature is not enabled on the apiserver, and it is enabled in the kube-controller-manager, then

The feature cannot be used for new workloads.
Workloads that have the .spec.podReplacementPolicy field set will use the new behavior.

Also, as mentioned in kubectl Skew , kubectl skew is not supported in the alpha version.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: DeploymentPodReplacementPolicy
- Components depending on the feature gate:
  - kube-apiserver
  - kube-controller-manager

Does enabling the feature change any default behavior?

No, the behavior is only changed when users specify the podReplacementPolicy in the Deployment spec.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes.

By disabling the feature:

Extra pods can appear during a deployment rollout or scaling. This can increase the number of pods that need to be scheduled, and it can have an impact on the resource consumption.

As mentioned in kubectl Skew , kubectl skew is not supported in alpha. If an older unsupported version of kubectl was used, it is important to remove the deployment.kubernetes.io/replicaset-replicas-before-scale annotation from all Deployments and ReplicaSets after disabling this feature. This should prevent any unexpected behavior on the next enablement.

What happens if we reenable the feature if it was previously rolled back?

The ReplicaSet and Deployment controllers will start reconciling and fulfilling the .spec.podReplacementPolicy contract.

Similar to the section above, it is important to make sure that the deployment.kubernetes.io/replicaset-replicas-before-scale annotation is removed from all Deployments and ReplicaSets before the re-enablement.

Are there any tests for feature enablement/disablement?

Appropriate enablement/disablement tests will be added to the replicaset and deployment strategy_test.go and unit tests in alpha.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

The rollout should not fail as the feature is hidden behind a feature gate and new optional field.

During a rollback, the .spec.podReplacementPolicy field will be ignored. This will cause workloads that use this field to fall back to the original deployment rollout and scaling behaviour. This can be problematic for workloads that are not expecting:

excessive number of pods
excessive resource consumption
slower or faster deployment rollout or scaling speed

This can also affect other workloads, for example by exhausting resources on a node.

What specific metrics should inform a rollback?

kube-controller-manager’s deployment workqueue metrics such as workqueue_retries_total, workqueue_depth, workqueue_work_duration_seconds_bucket can be observed. A sudden increase in these metrics can indicate a problem with the DeploymentPodReplacementPolicy feature.

Deployment pods can be watched for incorrect number of pods during a deployment rollout or scaling.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

TBD: Manual upgrade->downgrade->upgrade path will be tested.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

The operator can observe .status.terminatingReplicas on both ReplicaSets and Deployments. The same field is being added as a metric and can be observed there as well: kube_replicaset_status_terminating_replicas and kube_deployment_status_replicas_terminating.

How can someone using this feature know that it is working for their instance?

When using the TerminationComplete PodReplacementPolicy, the user should not see an excess of running and terminating pods created that is greater than the deployment’s .spec.replicas and its deployment strategy.

When using the TerminationStarted PodReplacementPolicy, the user should see an excess of running and terminating pods created that is greater than the deployment’s .spec.replicas and its deployment strategy. This will in turn make the deployment rollout faster.

The terminating pods can be observed in .status.terminatingReplicas on both ReplicaSets and Deployments.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

We do not propose any SLO/SLI for this feature.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name: workqueue_retries_total can be used to see if there is a sudden increase in sync retries after enabling the feature
  - Aggregation method: name = “deployment”
  - Components exposing the metric: kube-controller-manager
- Metric name: workqueue_depth can be used to see if there is a sudden increase in unprocessed deployment objects after enabling the feature
  - Aggregation method: name = “deployment”
  - Components exposing the metric: kube-controller-manager
- Metric name: workqueue_work_duration_seconds_bucket can be used to see if there is a sudden increase in duration of syncing deployment objects after enabling the feature
  - Aggregation method: name = “deployment”
  - Components exposing the metric: kube-controller-manager
- Metric name: kube_replicaset_status_terminating_replicas
  - Components exposing the metric: kube-state-metrics
- Metric name: kube_deployment_status_replicas_terminating
  - Components exposing the metric: kube-state-metrics

Are there any missing metrics that would be useful to have to improve observability of this feature?

kube_replicaset_status_terminating_replicas and kube_deployment_status_replicas_terminating were added by the KEP-3973 .

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

No, it will use the existing calls for creating and reconciling Deployments, ReplicaSets, and Pods. The number of calls may be higher in some scenarios if a large spec.strategy.rollingUpdate.maxSurge value is specified. But the maximum number of calls per deployment should be similar to when spec.strategy.rollingUpdate.maxSurge value is set to 1 when the feature is disabled.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

Yes.

API: Deployment Estimated increase in size:
- New field in Deployment spec about 11 bytes.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

No change in behavior. Deployment and ReplicaSet controllers might fail in reconciling their objects and in turn stop deployment rollout or scaling.

What are other known failure modes?

N/A

TBD

What steps should be taken if SLOs are not being met to determine the problem?

Inspecting the kube-controller-manager logs at an increased log level for any failures in deployment and replicaset controllers.

Implementation History

2023-05-01: First version of the KEP opened (https://github.com/kubernetes/enhancements/pull/3974) .
2023-12-12: Second version of the KEP opened (https://github.com/kubernetes/enhancements/pull/4357) .
2024-05-29: Added a Deployment Scaling Changes and a New Annotation for ReplicaSets section (https://github.com/kubernetes/enhancements/pull/4670) .
2024-11-22: Added a Deployment Completion and Progress Changes section (https://github.com/kubernetes/enhancements/pull/4976) .
2025-04-01: Introduced DeploymentReplicaSetTerminatingReplicas FG to split .status.terminatingReplicas feature from DeploymentPodReplacementPolicy (https://github.com/kubernetes/kubernetes/pull/131088 )
2025-06-11: Fixed ReplicationController reconciliation when the DeploymentReplicaSetTerminatingReplicas feature gate is enabled (https://github.com/kubernetes/kubernetes/issues/131821 )
2026-02-03: KEP-3973 was split into a KEP-5882 , which focuses on the DeploymentPodReplacementPolicy feature.

Drawbacks

Deployment might be slower when using the TerminationComplete PodReplacementPolicy.

Deployment might consume excessive resources when using the TerminationStarted PodReplacementPolicy.

Alternatives

This feature could be implemented by a different controller that manages ReplicaSets, but, there is a need for this feature to be implemented by a Deployment controller, because many existing workloads can benefit from this feature.

KEP-5882: Deployment Pod Replacement Policy

KEP-5882: Deployment Pod Replacement Policy

Release Signoff Checklist

Summary

Motivation

Goals

Non-Goals

Proposal

User Stories (Optional)

Story 1 (Optional)

Story 2 (Optional)

Notes/Constraints/Caveats (Optional)

Consideration for Other Controllers

Risks and Mitigations

Feature Impact

kubectl Skew

Design Details

Deployment Behavior Changes

Deployment Completion and Progress Changes

Deployment Scaling Changes and a New Annotation for ReplicaSets

kubectl Changes

API

Test Plan

Prerequisite testing updates

Unit tests

Integration tests

e2e tests

Graduation Criteria

Alpha

Beta

GA

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

Infrastructure Needed (Optional)