KEP-3939: Allow Replacement of Pods in a Job when fully terminating

KEP-3939: Allow replacement of Pods in a Job when fully terminated

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Currently, Jobs start replacement Pods as soon as previously created Pods are terminating (have a deletionTimestamp) or fail (phase=Failed). Terminating pods are currently counted as failed in the Job status. However, terminating pods are actually in a transitory state where they are neither active nor really fully terminated.
This KEP proposes a new field for the Job API that allows for users to specify if they want replacement Pods as soon as the previous Pods are terminating (existing behavior) or only once the existing pods are fully terminated (new behavior).

Motivation

Existing Issues:

Many common machine learning frameworks, such as Tensorflow and JAX, require unique pods per Index. Currently, if a pod enters a terminating state (due to preemption, eviction or other external factors), a replacement pod is created and immediately fail to start.

Having a replacement Pod before the previous one fully terminates can also cause problems in clusters with scarce resources or with tight budgets. These resources can be difficult to obtain so pods can take a long time to find resources and they may only be able to find nodes once the existing pods have been terminated. If cluster autoscaler is enabled, the replacement Pods might produce undesired scale ups.

On the other hand, if a replacement Pod is not immediately created, the Job status would show that the number of active pods doesn’t match the desired parallelism. To provide better visibility, the job status can have a new field to track the number of Pods currently terminating.

This new field can also be used by queueing controllers, such as Kueue, to track the number of terminating pods to calculate quotas.

Goals

Job controller should allow for flexibility in waiting for pods to be fully terminated before creating replacement Pods
Job controller will have a new status field where we include the number of terminating pods.

Non-Goals

Other workload APIs are not included in this proposal.

Proposal

The Job controller gets a list of active pods. Active pods are pods that don’t have a terminal phase (Succeeded or Failed) and are not terminating (have a deletionTimestamp) In this KEP, we will consider terminating pods to be separate from active and failed.
As an opt-in behavior, the job controller can use the active and terminating pods to determine whether replacement Pods are needed.

We propose two new API fields:

A field in Spec that allows for opt-in behavior of whether to wait for terminating pods to finish before creating replacement pods.
A new field in Status for tracking the number of terminating pods.

User Stories (Optional)

Story 1

As a machine learning user, ML frameworks allow scheduling of multiple pods.
The Job controller does not typically wait for terminating pods to be marked as failed.
Tensorflow and other ML frameworks may have a requirement that they only want Pods to be started once the other pods are fully terminated.

This case was added due to a bug discovered with running IndexedJobs with Tensorflow.
See Jobs create replacement Pods as soon as a Pod is marked for deletion for more details.

Story 2

As a cloud user, users would want to guarantee that the number of pods that are running is exactly the amount that they specify.
Terminating pods do not relinquish resources so scarce compute resource are still scheduled to those pods. Replacement pods do not produce unnecessary scale ups.

Story 3

As a Job-level quota controller, I want to track the number of terminating pods, in addition to the active pods.

See Kueue: Account for terminating pods when doing preemption for an example of this.

Notes/Constraints/Caveats (Optional)

The default job controller behavior

Based on the proposed API below, the behavior of the job controller prior to this KEP is equivalent to podReplacementPolicy: TerminatingOrFailed.

This behavior has the following semantic problems:

A terminating Pod might gracefully terminate as Succeeded, but it counts towards .status.failed as soon as it’s terminating and it’s not reclassified upon termination.
When using podFailurePolicy, the controller might create a replacement Pod before being able to evaluate the terminal state of the Pod. The replacement Pod might be terminated due to the policy.

In a Job v2 API, we should consider having the default behavior equivalent to podReplacementPolicy: Failed, given the above problems. We could even consider removing the proposed field podReplacementPolicy.

But for backwards compatibility, in v1, we have to introduce a change of behavior as opt-in.

When Pods enter a terminating state

Pods can be marked for termination by several controllers, which we typically refer to as disruptions, such as: kubelet eviction, scheduler preemption, API eviction, etc.

The job controller itself can delete running Pods, in the following scenarios:

A job is over the activeDeadlineSeconds.
When the number of Pod failures reaches the backoffLimit.
With PodFailurePolicy active and FailJob is set as the action.

In all these situations, the Pod initially gets a deletionTimestamp and we interpret the pod as “terminating”. Once the pod terminates, it gets a terminal phase (Succeeded or Failed).

Exponential Backoff for Pod Failures

The job controller implements backoff delays to prevent fast recreation of continuously failing Pods.

This behavior is internal (not configurable through the API) and it’s orthogonal to this KEP. The behavior will be preserved as follows:

When podReplacementPolicy: TerminatingOrFailed, the backoff period counts from the time the Pod is terminating or Failed.
When podReplacementPolicy: Failed, the backoff period counts from the time the Pod is Failed.

Risks and Mitigations

Pods are not guaranteed to transition to a terminal phase

One area of contention is how this KEP will work with 3329-retriable-and-non-retriable-failures .

In 3329, there was a decision to make kubelet transition pods to failed before deleting them. This is feature toggled guarded by PodDisruptionCondition, which in addition to setting the phase to Failed, it adds a DisruptionTarget condition. This means that when this feature is turned on, the job controller is able to count pods as failed only when they are fully terminated, as it is guaranteed that all pods will reach a terminal state (Failed or Succeeded). Note that a terminating pod is not considered active either. If PodDisruptionCondition is turned off, then the job controller considers the pod as failed as soon as it is terminating (has a deletion timestamp), because there is no guarantee that the pod will transition to phase=Failed.

Another issue is described here . If PodDisruptionConditions is disabled, a pod bound to a no-longer-existing node may be stuck in the Running phase. As a consequence, it will never be replaced, so the whole job will be stuck from making progress. When PodDisruptionConditions is enabled, the PodGC transitions the Pod to phase Failed in this scenario.

Due to the above issues, we propose the following mitigation:

If PodDisruptionConditions OR JobPodReplacementPolicy are enabled, set phase=Failed in kubelet and podGC before deleting a Pod.
If JobPodReplacmentPolicy is enabled, but PodDisruptionConditions is disabled, the kubelet and podGC only set the phase, but do not add a DisruptionTarget condition.

Design Details

Job API Definition

At the JobSpec level, we are adding a new enum field:

// This field controls when we recreate pods
// Default will be TerminatingOrFailed ie recreate pods when they are failed
// +enum 
type PodReplacementPolicy string
const (
 // TerminatingOrFailed is a policy that creates replacement pods when they are
 // marked as terminating (have a deletion timestamp) or reach the terminal
 // phase `Failed`.
 // Terminating pods count towards `.status.failed`, even if they later reach
 // the terminal phase `Succeeded`.
 TerminatingOrFailed PodReplacementPolicy = "TerminatingOrFailed"
 // Failed is a policy that creates replacement Pods only when the previously
 // created Pods reach the terminal phase `Failed`.
 Failed PodReplacementPolicy = "Failed"
)

type JobSpec struct{
  ...
 // podReplacementPolicy specifies when to create replacement Pods. Possible values are:
 // - TerminatingOrFailed means to create a replacement Pod when the previously
 //   created Pod is terminating or failed.
 // - Failed means to wait until a previously created Pod is fully terminated
 //   before creating a replacement Pod.
 //
 // When using podFailurePolicy, the default value is Failed and this is the
 // only allowed policy.
 // When not using podFailurePolicy, the default value is TerminatingOrFailed.
 // +optional
 PodReplacementPolicy *PodReplacementPolicy
}

In order to offer visibility of the number of terminating pods, we include a new field in the JobStatus.

type JobStatus struct {
  ...
  // Number of terminating pods
  // +optional
  terminating *int32
}

Defaulting and validation

Defaulting of podReplacementPolicy will depend on whether podFailurePolicy is in use:

when podFailurePolicy is in use, the default value is Failed.
when podFailurePolicy is not in use, the default value is TerminatingOrFailed.

When podFailurePolicy is in use, the only allowed value for podFailurePolicy is Failed.

Tracking the terminating pods

In order to allow the quota management for Job-level controllers story 3 we introduced the .status.terminating field which tracks the number of terminating pods. However, in the initial Beta implementation the field stops tracking the number of terminating pods as soon as the Job is marked as Failed with the Failed condition (see (issue #123775)[https://github.com/kubernetes/kubernetes/issues/123775]). The remaining pods may be occupying resources for an arbitrary amount of time.

In 1.31 we are going to fix this issue by delaying the addition of the Failed or Complete conditions until all pods are fully terminated. To indicate that a Job is doomed to fail or succeed, as soon as possible, we extend the scope of pre-existing conditions: FailureTarget, and SuccessCriteriaMet, respectively, See more details in Job API managed-by mechanism .

Implementation

As part of this KEP, we need to track pods that are terminating (deletionTimestamp != nil and phase is Pending or Running).

The following algorithm could be used:

Count the number of pods that are active and not terminating.
Count the number of terminating pods.
In manageJob we will count expected pods as:

when podReplacementPolicy: Failed then expectedPods = active + terminating.
when podReplacementPolicy: TerminatingOrFailed then expectedPods = active.

Use the expected number of pods to decide whether to recreate.

In Indexed completion mode, the tracking of pods is per index.

The controller updates the field Status.terminating with the number of terminating pods. For backwards compatibility, when podReplacementPolicy: TerminatingOrFailed, the number of failed pods includes the terminating pods.

The controller updates the terminating field in the same API call where it updates other counters, so it should not require any extra API calls.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

controller_utils: April 3rd 2023 - 56.6
- Adding tests to help determine if pods are terminating.
job: April 3rd 2023 - 90.4 a. Verify that terminating pods are in fact counted in the status. b. Recreate pods only once pod is fully terminated (ie Failed) c. Verify existing behavior with TerminatingOrFailed d. If feature is off verify existing behavior e. Count terminating pods even if terminating Pod considered failed when JobPodReplacementPolicy is disabled f. Count terminating pods even if terminating Pod not considered failed when JobPodReplacementPolicy is enabled
gc_controller.go: April 3rd 2023 - 82.4 a. Set PodPhase to failed when JobPodReplacementPolicy true but PodDisruptionConditions is false

The following scenarios related to tracking the terminating pods are covered:

Failed or Complete conditions are not added while there are still terminating pods
FailureTarget is added when backoffLimitCount is exceeded, or activeDeadlineSeconds timeout is exceeded
SuccessCriteriaMet is added when the completions are satisfied

Integration tests

We will add the following integration test for the Job controller:

Case with JobPodReplacementPolicy on and podReplacementPolicy: Failed

Job starts pods that takes a while to terminate
Delete pods
Verify that terminating is tracked
Verify that pod creation only occurs once pod is fully terminated.

Case with JobPodReplacementPolicy on and podReplacementPolicy: TerminatingOrFailed

Job starts pods that takes a while to terminate
Delete pods
Verify that terminating is tracked
Verify that pod creation only occurs once deletion happens.

Case With JobPodReplacementPolicy off

Job starts pods that takes a while to terminate
Delete pods
Verify that terminating is not tracked
Verify that pod creation only occurs once deletion happens.

Case for disable and reenable JobPodReplacementPolicy

Create Job with podReplacementPolicy: Failed
Job starts pods that takes a while to terminate
Restart controller and disable JobPodReplacementPolicy
Delete some pods
Verify that terminating pods count as failed and pods are recreated.
Restart controller and reenable JobPodReplacementPolicy
Terminate pods with phase Succeeded.
Verify that pods still count as failed.
Delete remaining Pods.
Verify that terminating is tracked.
Verify that pod creation only occurs once pod is fully terminated.
Verify that pod creation only occurs once deletion happens.

To cover cases with PodDisruptionCondition we really only need to worry about tracking terminating fields. Tests will verify counting of terminating fields regardless of PodDisruptionCondition being on or off.

The following scenarios related to tracking the terminating pods are covered:

Failed or Complete conditions are not added while there are still terminating pods
FailureTarget is added when backoffLimitCount is exceeded, or activeDeadlineSeconds timeout is exceeded
SuccessCriteriaMet is added when the completions are satisfied

The integration tests are implemented in https://github.com/kubernetes/kubernetes/blob/v1.31.0/test/integration/job/job_test.go . Most relevant test is TestJobPodReplacementPolicy.

e2e tests

Generally the only tests that are useful for this feature are when PodReplacementPolicy: Failed.
Test should to create a Job which can catch a SIGTERM signal and allow for graceful termination, so when we delete the test
we can first assert that pods aren’t created while the Pod is terminating and finally when it terminates that a new Pod is created.

We can use the default busybox image which is generally used in e2e tests and override the command field with something like:

_term(){  
  sleep 5
  exit 143
}  
trap _term SIGTERM
while true; do  
  sleep 1
done

An e2e test can verify that deletion will not trigger a new pod creation until the exiting pod is fully deleted.

If podReplacementPolicy: TerminatingOrFailed is specified we would test that pod creation happens closely after deletion.

The e2e tests are implemented in https://github.com/kubernetes/kubernetes/blob/v1.31.0/test/e2e/apps/job.go .

Test grid:

gce

Kubernetes e2e suite.[It] [sig-apps] Job should recreate pods only after they have failed if pod replacement policy is set to Failed

Graduation Criteria

Alpha

Job controller can consider terminating pods as active
Job controller counts terminating pods in JobStatus.
Unit Tests
Integration tests

Beta

Address reviews and bug reports from Alpha users
E2e tests are in Testgrid and linked in KEP
The feature flag enabled by default
job_pods_creation_total metric is added.

GA

Address reviews and bug reports from Beta users
Allow Job API clients tracking the number of the terminating pods until all the resources are released (see tracking the terminating pods ). Also, provide links for the relevant integration tests in the KEP.
Lock the JobPodReplacementPolicy feature-gate to true
Restore the .status.terminating assertion for JobSuccessPolicy Conformance Tests in the following:

Deprecation

Remove JobPodReplacementPolicy feature-gate in GA+3.

Upgrade / Downgrade Strategy

Upgrade

Set JobPodReplacementPolicy to true in apiserver and controller manager.

There are no other components required.

Jobs that want to replace pods once they are fully terminal can use PodReplacementPolicy: Failed.

If a Job is not using PodFailurePolicy, one can change PodReplacementPolicy to terminatingOrFailed. This will revert Jobs to existing behavior with the feature off.

If one is using PodFailurePolicy, one will not be able to set the value to terminatingOrFailed as Failed is the only allowable solution. In this case, the recommendation would be to disable the PodFailurePolicy feature also.

Downgrade

Set JobPodReplacementPolicy to false in apiserver and controller manager.

With downgrading, you will no longer see any side-effects of PodReplacementPolicy.

Version Skew Strategy

This feature is limited to control plane.

Note that, kube-apiserver can be in the N+1 skew version relative to the kube-controller-manager (see here ). In that case, the Job controller operates on the version of the Job object that already supports the new Job API.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: JobPodReplacementPolicy
- Components depending on the feature gate:
  - kube-apiserver (for field control)
  - kube-controller-manager (for main functionality)
  - kubelet (for supporting functionality: transition to phase=Failed)

Does enabling the feature change any default behavior?

Yes,

a. Count the number of terminating pods and populate in JobStatus b. Set phase=Failed in kubelet and pod-GC before deleting a Pod object (behavior also present when related PodDisruptionConditions is enabled) c. As part of closely related KEP-3329, we will default podReplacementPolicy to Failed if podFailurePolicy is set which, as described above, will change the way of handling terminating pods.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes.

When the feature is disabled:

the apiserver:
- Discards the value of podReplacementPolicy for new objects.
- Preserves the value of podRepacementPolicy for existing objects.
the job controller:
- processes the Job as podReplacementPolicy: TerminatingOrFailed (the existing behavior)
- stops tracking terminating pods, sets the value of .status.terminating to nil in the next Job sync.

What happens if we reenable the feature if it was previously rolled back?

The job controller will respect the value of podReplacementPolicy for new events (new Pods becoming terminating or failed).

If podReplacementPolicy: Failed and there are currently terminating Pod(s) that were already considered Failed before reenabling the feature, they won’t be re-evaluated.

Are there any tests for feature enablement/disablement?

No, but we will add unit and integration tests for feature enablement and disablement.

An integration test verifies disable and reenable. See integration tests for details.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

A rollout or rollback will not fail as rolling out this feature entails turning on JobPodReplacementPolicy. Failure rates of the Jobs will not increase or decrease on this feature. Pods will be marked as failed later (as we wait for the pods to be fully terminal)

This feature is opt-in for functional changes. We track terminating pods for observability reasons but we only use this data in the case of Failed.

If a user has set PodReplacementPolicy: Failed or has PodFailurePolicy set, then rollbacking this feature would mean that terminating Pods will be recreated once they are deleted.

If a user rollouts this feature with PodFailurePolicy or PodReplacementPolicy set to Failed, then pods will only recreate once they are fully terminal.
This will not impact failure counts as in both cases, they will get marked as failed eventually.

If a user rollouts this feature without PodFailurePolicy or PodReplacementPolicy set, then there will be no impact to existing workloads.

What specific metrics should inform a rollback?

job_syncs_total, exposed by kube-controller-manager
- If the number of syncs increases it could mean that we have an increased number of failures.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

In beta, we are working on adding an integration test for these cases.

In terms of a manual test for upgrade and rollback, we can use 1.28.

The Upgrade->downgrade->upgrade testing was done manually using the alpha version in 1.28 with the following steps:

Start the cluster with the JobPodReplacementPolicy enabled:

Create a KIND cluster with 1.28 and use the config below to turn this feature on.

using config.yaml:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
  "JobPodReplacementPolicy": true
nodes:
- role: control-plane
- role: worker

Then, create the job using .spec.podReplacementPolicy=Failed:

kubectl create -f job.yaml

using job.yaml:

apiVersion: batch/v1
kind: Job
metadata:
  name: job-prp
spec:
  completions: 1
  parallelism: 1
  backoffLimit: 2
  podReplacementPolicy: Failed
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: sleep
        image: gcr.io/k8s-staging-perf-tests/sleep
        args: ["-termination-grace-period", "1m", "60s"]

Await for the pods to be running and delete a pod:

kubectl delete pods -l job-name=job-prp

With feature on and PodReplacementPolicy set to Failed, the replacement pod will be recreated once the pod was fully terminated. While the pod is terminating you can also see the status report a terminating pod.

kubectl get jobs -ljob-name=job-prp -oyaml

status:
  terminating: 1

Simulate downgrade by creating a new Kind cluster with the feature turned off.

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
  "JobPodReplacementPolicy": false
nodes:
- role: control-plane
- role: worker

Then, deleting the pods of the job.

kubectl delete pods -l job-name=job-prp

There should also be no terminating pod status and a pod will be created before the other pod terminates. If you use the above case, you should see a terminating pod and a new pod created.

Simulate upgrade by creating a new Kind cluster with the feature turned on.

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
  "JobPodReplacementPolicy": true
nodes:
- role: control-plane
- role: worker

Deleting the pod will create a replacement pod once the pod is fully terminated. The status field will also state that the pod is terminating.

This demonstrates that the feature is working again for the job.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

During pod terminations, an operator can see that the terminating field is being set.

We will use a new metric:

job_pods_creation_total (new) the reason label will mention what triggers creation (new, recreate_terminating_or_failed, recreate_failed))
and the status label will mention the status of the pod creation (succeeded, failed).
This can be used to get the number of pods that are being recreated due to recreateTerminated. Otherwise, we would expect to see new or recreateTerminatingOrFailed as the normal values.

How can someone using this feature know that it is working for their instance?

If a user terminates pods that are controlled by a job, then we should wait until the existing pods are terminated before starting new ones.

When feature is turned on, we will also include a terminating field in the Job Status if there are any terminating pods.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

We did not propose any SLO/SLI for this feature.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name:
  - job_syncs_total (existing): can be used to see how much the feature enablement causes the number of syncs to increase.
- Components exposing the metric: kube-controller-manager

Are there any missing metrics that would be useful to have to improve observability of this feature?

In beta, we will add a new metric job_pods_creation_total.

Dependencies

In Risks and Mitigations we discuss the interaction with 3329-retriable-and-non-retriable-failures .
We will have to guard against cases if PodFailurePolicy is off while this feature is on.
PodFailurePolicy is in stable and is locked to true by default but we should guard against cases where PodDisruptionCondition is turned off.

Does this feature depend on any specific services running in the cluster?

Scalability

Generally, enabling this will slow down pod creation if pods take a long time to terminate. We would wait to create new pods until the existing ones are terminated.

Will enabling / using this feature result in any new API calls?

In the job controller, we only update the Job.Status if any field in the Job.Status changes. With this feature on, we will track terminating pods in this status. It could be possible to see an increase in updating the status field of Jobs if a lot of the pods are being terminated. However, if pods are being terminated, we would also expect other fields to be getting updated also (active, failed, etc) so there should not be a large increase of API calls for patching.

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

For Job API, we are adding an enum field named PodReplacementPolicy which takes either a TerminatingOrFailed or Failed

API type(s): enum
Estimated increase in size: 8B

We are also added a status field for tracking terminating pods.

API type(s): int32
Estimated increase in size: 4B

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No, SLI/SLO do not include time taking to create new pods if existing ones are terminated.
There is an existing one on pod creation but this will not impact that.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

N/A

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

N/A

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

No change from existing behavior of the Job controller.

What are other known failure modes?

There are no other failure modes.

What steps should be taken if SLOs are not being met to determine the problem?

If one wants to keep the feature on and they could suspend the jobs that are using this feature. Setting Suspend:True in your JobSpec will halt the execution of all jobs.

Implementation History

2023-04-03: Created KEP
2023-05-19: KEP Merged.
2023-07-16: Alpha PRs merged.
2023-09-29: KEP marked for beta promotion.
2023-10-24: Merged bugfix Fix tracking of terminating Pods when nothing else changes
2023-10-24: Merged adding a metric required for beta promotion feat: add job_pods_creation_total metric
2023-10-27: Merged Switch feature flag to beta for pod replacement policy and add e2e test #121491
2024-06-11: [v1.31] Merged Count terminating pods when deleting active pods for failed jobs #125175
2024-07-12: [v1.31] Merged Delay setting terminal Job conditions until all pods are terminal #125510

This feature was promoted to beta in v1.29, but important updates were implemented in v1.31. For additional info, check the PRs linked above with the tag [v1.31].

Drawbacks

Enabling this feature may have rollouts become slower.

Alternatives

We discussed having this under the PodFailurePolicy but this is a more general idea than the PodFailurePolicy.