KEP-3939: Allow Replacement of Pods in a Job when fully terminating
KEP-3939: Allow replacement of Pods in a Job when fully terminated
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Feature Enablement and Rollback
- Rollout, Upgrade and Rollback Planning
- How can a rollout or rollback fail? Can it impact already running workloads?
- What specific metrics should inform a rollback?
- Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
- Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
- Monitoring Requirements
- How can an operator determine if the feature is in use by workloads?
- How can someone using this feature know that it is working for their instance?
- What are the reasonable SLOs (Service Level Objectives) for the enhancement?
- What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Are there any missing metrics that would be useful to have to improve observability of this feature?
- Dependencies
- Scalability
- Will enabling / using this feature result in any new API calls?
- Will enabling / using this feature result in introducing new API types?
- Will enabling / using this feature result in any new calls to the cloud provider?
- Will enabling / using this feature result in increasing size or count of the existing API objects?
- Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
- Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
- Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
- Troubleshooting
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
Currently, Jobs start replacement Pods as soon as previously created Pods are terminating (have a deletionTimestamp) or fail (phase=Failed).
Terminating pods are currently counted as failed in the Job status.
However, terminating pods are actually in a transitory state where they are neither active nor really fully terminated.
This KEP proposes a new field for the Job API that allows for users to specify if they want replacement Pods as soon as
the previous Pods are terminating (existing behavior) or only once the existing pods are fully terminated (new behavior).
Motivation
Existing Issues:
- Job Creates Replacement Pods as soon as Pod is marked for deletion
- Kueue: Account for terminating pods when doing preemption
Many common machine learning frameworks, such as Tensorflow and JAX, require unique pods per Index. Currently, if a pod enters a terminating state (due to preemption, eviction or other external factors), a replacement pod is created and immediately fail to start.
Having a replacement Pod before the previous one fully terminates can also cause problems in clusters with scarce resources or with tight budgets. These resources can be difficult to obtain so pods can take a long time to find resources and they may only be able to find nodes once the existing pods have been terminated. If cluster autoscaler is enabled, the replacement Pods might produce undesired scale ups.
On the other hand, if a replacement Pod is not immediately created, the Job status would show that the number of active pods doesn’t match the desired parallelism. To provide better visibility, the job status can have a new field to track the number of Pods currently terminating.
This new field can also be used by queueing controllers, such as Kueue, to track the number of terminating pods to calculate quotas.
Goals
- Job controller should allow for flexibility in waiting for pods to be fully terminated before creating replacement Pods
- Job controller will have a new status field where we include the number of terminating pods.
Non-Goals
- Other workload APIs are not included in this proposal.
Proposal
The Job controller gets a list of active pods. Active pods are pods that don’t
have a terminal phase (Succeeded or Failed) and are not terminating
(have a deletionTimestamp)
In this KEP, we will consider terminating pods to be separate from active and failed.
As an opt-in behavior, the job controller can use the active and terminating
pods to determine whether replacement Pods are needed.
We propose two new API fields:
- A field in Spec that allows for opt-in behavior of whether to wait for terminating pods to finish before creating replacement pods.
- A new field in Status for tracking the number of terminating pods.
User Stories (Optional)
Story 1
As a machine learning user, ML frameworks allow scheduling of multiple pods.
The Job controller does not typically wait for terminating pods to be marked as failed.
Tensorflow and other ML frameworks may have a requirement that they only want Pods to be started once the other pods are fully terminated.
This case was added due to a bug discovered with running IndexedJobs with Tensorflow.
See Jobs create replacement Pods as soon as a Pod is marked for deletion
for more details.
Story 2
As a cloud user, users would want to guarantee that the number of pods that are running is exactly the amount that they specify.
Terminating pods do not relinquish resources so scarce compute resource are still scheduled to those pods.
Replacement pods do not produce unnecessary scale ups.
Story 3
As a Job-level quota controller, I want to track the number of terminating pods, in addition to the active pods.
See Kueue: Account for terminating pods when doing preemption for an example of this.
Notes/Constraints/Caveats (Optional)
The default job controller behavior
Based on the proposed API
below, the behavior of the
job controller prior to this KEP is equivalent to
podReplacementPolicy: TerminatingOrFailed.
This behavior has the following semantic problems:
- A terminating Pod might gracefully terminate as Succeeded, but it counts
towards
.status.failedas soon as it’s terminating and it’s not reclassified upon termination. - When using podFailurePolicy, the controller might create a replacement Pod before being able to evaluate the terminal state of the Pod. The replacement Pod might be terminated due to the policy.
In a Job v2 API, we should consider having the default behavior equivalent to
podReplacementPolicy: Failed, given the above problems.
We could even consider removing the proposed field podReplacementPolicy.
But for backwards compatibility, in v1, we have to introduce a change of behavior as opt-in.
When Pods enter a terminating state
Pods can be marked for termination by several controllers, which we typically refer to as disruptions, such as: kubelet eviction, scheduler preemption, API eviction, etc.
The job controller itself can delete running Pods, in the following scenarios:
- A job is over the
activeDeadlineSeconds. - When the number of Pod failures reaches the
backoffLimit. - With
PodFailurePolicyactive andFailJobis set as the action.
In all these situations, the Pod initially gets a deletionTimestamp
and we interpret the pod as “terminating”. Once the pod terminates, it gets
a terminal phase (Succeeded or Failed).
Exponential Backoff for Pod Failures
The job controller implements backoff delays to prevent fast recreation of continuously failing Pods.
This behavior is internal (not configurable through the API) and it’s orthogonal to this KEP. The behavior will be preserved as follows:
- When
podReplacementPolicy: TerminatingOrFailed, the backoff period counts from the time the Pod is terminating or Failed. - When
podReplacementPolicy: Failed, the backoff period counts from the time the Pod is Failed.
Risks and Mitigations
Pods are not guaranteed to transition to a terminal phase
One area of contention is how this KEP will work with 3329-retriable-and-non-retriable-failures .
In 3329, there was a decision to make kubelet transition pods to failed before deleting them.
This is feature toggled guarded by PodDisruptionCondition, which in addition to
setting the phase to Failed, it adds a DisruptionTarget condition.
This means that when this feature is turned on, the job controller is able to count pods as failed only when they are fully terminated, as it is guaranteed that all pods will reach a terminal state (Failed or Succeeded).
Note that a terminating pod is not considered active either.
If PodDisruptionCondition is turned off, then the job controller considers the pod as failed as soon as it is terminating (has a deletion timestamp), because there is no guarantee that the pod will transition to phase=Failed.
Another issue is described here . If PodDisruptionConditions is disabled, a pod bound to a no-longer-existing node may be stuck in the Running phase. As a consequence, it will never be replaced, so the whole job will be stuck from making progress. When PodDisruptionConditions is enabled, the PodGC transitions the Pod to phase Failed in this scenario.
Due to the above issues, we propose the following mitigation:
- If
PodDisruptionConditionsORJobPodReplacementPolicyare enabled, set phase=Failed in kubelet and podGC before deleting a Pod. - If
JobPodReplacmentPolicyis enabled, butPodDisruptionConditionsis disabled, the kubelet and podGC only set the phase, but do not add aDisruptionTargetcondition.
Design Details
Job API Definition
At the JobSpec level, we are adding a new enum field:
// This field controls when we recreate pods
// Default will be TerminatingOrFailed ie recreate pods when they are failed
// +enum
type PodReplacementPolicy string
const (
// TerminatingOrFailed is a policy that creates replacement pods when they are
// marked as terminating (have a deletion timestamp) or reach the terminal
// phase `Failed`.
// Terminating pods count towards `.status.failed`, even if they later reach
// the terminal phase `Succeeded`.
TerminatingOrFailed PodReplacementPolicy = "TerminatingOrFailed"
// Failed is a policy that creates replacement Pods only when the previously
// created Pods reach the terminal phase `Failed`.
Failed PodReplacementPolicy = "Failed"
)
type JobSpec struct{
...
// podReplacementPolicy specifies when to create replacement Pods. Possible values are:
// - TerminatingOrFailed means to create a replacement Pod when the previously
// created Pod is terminating or failed.
// - Failed means to wait until a previously created Pod is fully terminated
// before creating a replacement Pod.
//
// When using podFailurePolicy, the default value is Failed and this is the
// only allowed policy.
// When not using podFailurePolicy, the default value is TerminatingOrFailed.
// +optional
PodReplacementPolicy *PodReplacementPolicy
}
In order to offer visibility of the number of terminating pods, we include a new field in the JobStatus.
type JobStatus struct {
...
// Number of terminating pods
// +optional
terminating *int32
}
Defaulting and validation
Defaulting of podReplacementPolicy will depend on whether podFailurePolicy
is in use:
- when
podFailurePolicyis in use, the default value isFailed. - when
podFailurePolicyis not in use, the default value isTerminatingOrFailed.
When podFailurePolicy is in use, the only allowed value for podFailurePolicy
is Failed.
Tracking the terminating pods
In order to allow the quota management for Job-level controllers story 3
we introduced the .status.terminating field which tracks the number of
terminating pods. However, in the initial Beta implementation the field stops
tracking the number of terminating pods as soon as the Job is marked as Failed
with the Failed condition (see (issue #123775)[https://github.com/kubernetes/kubernetes/issues/123775]).
The remaining pods may be occupying resources for an arbitrary amount of time.
In 1.31 we are going to fix this issue by delaying the
addition of the Failed or Complete conditions until all pods are fully
terminated. To indicate that a Job is doomed to fail or succeed, as soon as
possible, we extend the scope of pre-existing conditions: FailureTarget, and
SuccessCriteriaMet, respectively, See more details in
Job API managed-by mechanism
.
Implementation
As part of this KEP, we need to track pods that are terminating (deletionTimestamp != nil and phase is Pending or Running).
The following algorithm could be used:
- Count the number of pods that are active and not terminating.
- Count the number of terminating pods.
- In
manageJobwe will count expected pods as:
- when
podReplacementPolicy: FailedthenexpectedPods = active + terminating. - when
podReplacementPolicy: TerminatingOrFailedthenexpectedPods = active.
- Use the expected number of pods to decide whether to recreate.
In Indexed completion mode, the tracking of pods is per index.
The controller updates the field Status.terminating with the number of terminating pods.
For backwards compatibility, when podReplacementPolicy: TerminatingOrFailed,
the number of failed pods includes the terminating pods.
The controller updates the terminating field in the same API call where it updates other counters, so it should not require any extra API calls.
Test Plan
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
Unit tests
controller_utils:April 3rd 2023-56.6- Adding tests to help determine if pods are terminating.
job:April 3rd 2023-90.4a. Verify that terminating pods are in fact counted in the status. b. Recreate pods only once pod is fully terminated (ieFailed) c. Verify existing behavior withTerminatingOrFailedd. If feature is off verify existing behavior e. Count terminating pods even if terminating Pod considered failed whenJobPodReplacementPolicyis disabled f. Count terminating pods even if terminating Pod not considered failed whenJobPodReplacementPolicyis enabledgc_controller.go:April 3rd 2023-82.4a. SetPodPhasetofailedwhenJobPodReplacementPolicytrue butPodDisruptionConditionsis false
The following scenarios related to tracking the terminating pods are covered:
FailedorCompleteconditions are not added while there are still terminating podsFailureTargetis added when backoffLimitCount is exceeded, or activeDeadlineSeconds timeout is exceededSuccessCriteriaMetis added when thecompletionsare satisfied
Integration tests
We will add the following integration test for the Job controller:
Case with JobPodReplacementPolicy on and podReplacementPolicy: Failed
- Job starts pods that takes a while to terminate
- Delete pods
- Verify that
terminatingis tracked - Verify that pod creation only occurs once pod is fully terminated.
Case with JobPodReplacementPolicy on and podReplacementPolicy: TerminatingOrFailed
- Job starts pods that takes a while to terminate
- Delete pods
- Verify that
terminatingis tracked - Verify that pod creation only occurs once deletion happens.
Case With JobPodReplacementPolicy off
- Job starts pods that takes a while to terminate
- Delete pods
- Verify that
terminatingis not tracked - Verify that pod creation only occurs once deletion happens.
Case for disable and reenable JobPodReplacementPolicy
- Create Job with
podReplacementPolicy: Failed - Job starts pods that takes a while to terminate
- Restart controller and disable
JobPodReplacementPolicy - Delete some pods
- Verify that terminating pods count as failed and pods are recreated.
- Restart controller and reenable
JobPodReplacementPolicy - Terminate pods with phase Succeeded.
- Verify that pods still count as failed.
- Delete remaining Pods.
- Verify that
terminatingis tracked. - Verify that pod creation only occurs once pod is fully terminated.
- Verify that pod creation only occurs once deletion happens.
To cover cases with PodDisruptionCondition we really only need to worry about tracking terminating fields.
Tests will verify counting of terminating fields regardless of PodDisruptionCondition being on or off.
The following scenarios related to tracking the terminating pods are covered:
FailedorCompleteconditions are not added while there are still terminating podsFailureTargetis added when backoffLimitCount is exceeded, or activeDeadlineSeconds timeout is exceededSuccessCriteriaMetis added when thecompletionsare satisfied
The integration tests are implemented in https://github.com/kubernetes/kubernetes/blob/v1.31.0/test/integration/job/job_test.go
.
Most relevant test is TestJobPodReplacementPolicy.
e2e tests
Generally the only tests that are useful for this feature are when PodReplacementPolicy: Failed.
Test should to create a Job which can catch a SIGTERM signal and allow for graceful termination, so when we delete the test
we can first assert that pods aren’t created while the Pod is terminating and finally when it terminates that a new Pod is created.
We can use the default busybox image which is generally used in e2e tests and override the command field with something like:
_term(){
sleep 5
exit 143
}
trap _term SIGTERM
while true; do
sleep 1
done
An e2e test can verify that deletion will not trigger a new pod creation until the exiting pod is fully deleted.
If podReplacementPolicy: TerminatingOrFailed is specified we would test that pod creation happens closely after deletion.
The e2e tests are implemented in https://github.com/kubernetes/kubernetes/blob/v1.31.0/test/e2e/apps/job.go
.
Test grid:
Kubernetes e2e suite.[It] [sig-apps] Job should recreate pods only after they have failed if pod replacement policy is set to Failed
Graduation Criteria
Alpha
- Job controller can consider terminating pods as active
- Job controller counts terminating pods in
JobStatus. - Unit Tests
- Integration tests
Beta
- Address reviews and bug reports from Alpha users
- E2e tests are in Testgrid and linked in KEP
- The feature flag enabled by default
job_pods_creation_totalmetric is added.
GA
- Address reviews and bug reports from Beta users
- Allow Job API clients tracking the number of the terminating pods until all the resources are released (see tracking the terminating pods ). Also, provide links for the relevant integration tests in the KEP.
- Lock the
JobPodReplacementPolicyfeature-gate to true - Restore the
.status.terminatingassertion for JobSuccessPolicy Conformance Tests in the following:- https://github.com/kubernetes/kubernetes/blob/44c230bf5c321056e8bc89300b37c497f464f113/test/e2e/apps/job.go#L514-L515
- https://github.com/kubernetes/kubernetes/blob/44c230bf5c321056e8bc89300b37c497f464f113/test/e2e/apps/job.go#L556-L557
- https://github.com/kubernetes/kubernetes/blob/44c230bf5c321056e8bc89300b37c497f464f113/test/e2e/apps/job.go#L597-L598
Deprecation
- Remove
JobPodReplacementPolicyfeature-gate in GA+3.
Upgrade / Downgrade Strategy
Upgrade
Set JobPodReplacementPolicy to true in apiserver and controller manager.
There are no other components required.
Jobs that want to replace pods once they are fully terminal can use PodReplacementPolicy: Failed.
If a Job is not using PodFailurePolicy, one can change PodReplacementPolicy to terminatingOrFailed. This will revert Jobs to existing behavior with the feature off.
If one is using PodFailurePolicy, one will not be able to set the value to terminatingOrFailed as Failed is the only allowable solution.
In this case, the recommendation would be to disable the PodFailurePolicy feature also.
Downgrade
Set JobPodReplacementPolicy to false in apiserver and controller manager.
With downgrading, you will no longer see any side-effects of PodReplacementPolicy.
Version Skew Strategy
This feature is limited to control plane.
Note that, kube-apiserver can be in the N+1 skew version relative to the kube-controller-manager (see here ). In that case, the Job controller operates on the version of the Job object that already supports the new Job API.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name: JobPodReplacementPolicy
- Components depending on the feature gate:
- kube-apiserver (for field control)
- kube-controller-manager (for main functionality)
- kubelet (for supporting functionality: transition to phase=Failed)
Does enabling the feature change any default behavior?
Yes,
a. Count the number of terminating pods and populate in JobStatus
b. Set phase=Failed in kubelet and pod-GC before deleting a Pod object
(behavior also present when related PodDisruptionConditions is enabled)
c. As part of closely related KEP-3329, we will default podReplacementPolicy
to Failed if podFailurePolicy is set which, as described above, will change
the way of handling terminating pods.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
Yes.
When the feature is disabled:
- the apiserver:
- Discards the value of
podReplacementPolicyfor new objects. - Preserves the value of
podRepacementPolicyfor existing objects.
- Discards the value of
- the job controller:
- processes the Job as
podReplacementPolicy: TerminatingOrFailed(the existing behavior) - stops tracking terminating pods, sets the value of
.status.terminatingtonilin the next Job sync.
- processes the Job as
What happens if we reenable the feature if it was previously rolled back?
The job controller will respect the value of podReplacementPolicy for new
events (new Pods becoming terminating or failed).
If podReplacementPolicy: Failed and there are currently terminating Pod(s) that
were already considered Failed before reenabling the feature, they won’t be
re-evaluated.
Are there any tests for feature enablement/disablement?
No, but we will add unit and integration tests for feature enablement and disablement.
An integration test verifies disable and reenable. See integration tests for details.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
A rollout or rollback will not fail as rolling out this feature entails turning on JobPodReplacementPolicy.
Failure rates of the Jobs will not increase or decrease on this feature. Pods will be marked as failed later (as we wait for the pods to be fully terminal)
This feature is opt-in for functional changes. We track terminating pods for observability reasons but we only use this data in the case of Failed.
If a user has set PodReplacementPolicy: Failed or has PodFailurePolicy set, then
rollbacking this feature would mean that terminating Pods will be recreated once they are deleted.
If a user rollouts this feature with PodFailurePolicy or PodReplacementPolicy set to Failed,
then pods will only recreate once they are fully terminal.
This will not impact failure counts as in both cases, they will get marked as failed eventually.
If a user rollouts this feature without PodFailurePolicy or PodReplacementPolicy set, then there will be no impact to existing workloads.
What specific metrics should inform a rollback?
- job_syncs_total, exposed by kube-controller-manager
- If the number of syncs increases it could mean that we have an increased number of failures.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
In beta, we are working on adding an integration test for these cases.
In terms of a manual test for upgrade and rollback, we can use 1.28.
The Upgrade->downgrade->upgrade testing was done manually using the alpha
version in 1.28 with the following steps:
- Start the cluster with the
JobPodReplacementPolicyenabled:
Create a KIND cluster with 1.28 and use the config below to turn this feature on.
using config.yaml:
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
"JobPodReplacementPolicy": true
nodes:
- role: control-plane
- role: worker
Then, create the job using .spec.podReplacementPolicy=Failed:
kubectl create -f job.yaml
using job.yaml:
apiVersion: batch/v1
kind: Job
metadata:
name: job-prp
spec:
completions: 1
parallelism: 1
backoffLimit: 2
podReplacementPolicy: Failed
template:
spec:
restartPolicy: Never
containers:
- name: sleep
image: gcr.io/k8s-staging-perf-tests/sleep
args: ["-termination-grace-period", "1m", "60s"]
Await for the pods to be running and delete a pod:
kubectl delete pods -l job-name=job-prp
With feature on and PodReplacementPolicy set to Failed, the replacement pod will be recreated once the pod was fully terminated.
While the pod is terminating you can also see the status report a terminating pod.
kubectl get jobs -ljob-name=job-prp -oyaml
status:
terminating: 1
- Simulate downgrade by creating a new
Kindcluster with the feature turned off.
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
"JobPodReplacementPolicy": false
nodes:
- role: control-plane
- role: worker
Then, deleting the pods of the job.
kubectl delete pods -l job-name=job-prp
There should also be no terminating pod status and a pod will be created before the other pod terminates. If you use the above case, you should see a terminating pod and a new pod created.
- Simulate upgrade by creating a new
Kindcluster with the feature turned on.
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
"JobPodReplacementPolicy": true
nodes:
- role: control-plane
- role: worker
Deleting the pod will create a replacement pod once the pod is fully terminated. The status field will also state that the pod is terminating.
This demonstrates that the feature is working again for the job.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
During pod terminations, an operator can see that the terminating field is being set.
We will use a new metric:
job_pods_creation_total(new) thereasonlabel will mention what triggers creation (new,recreate_terminating_or_failed,recreate_failed))
and thestatuslabel will mention the status of the pod creation (succeeded,failed).
This can be used to get the number of pods that are being recreated due torecreateTerminated. Otherwise, we would expect to seeneworrecreateTerminatingOrFailedas the normal values.
How can someone using this feature know that it is working for their instance?
If a user terminates pods that are controlled by a job, then we should wait until the existing pods are terminated before starting new ones.
When feature is turned on, we will also include a terminating field in the Job Status if there are any terminating pods.
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
We did not propose any SLO/SLI for this feature.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
job_syncs_total(existing): can be used to see how much the feature enablement causes the number of syncs to increase.
- Components exposing the metric: kube-controller-manager
- Metric name:
Are there any missing metrics that would be useful to have to improve observability of this feature?
In beta, we will add a new metric job_pods_creation_total.
Dependencies
In Risks and Mitigations
we discuss the interaction with 3329-retriable-and-non-retriable-failures
.
We will have to guard against cases if PodFailurePolicy is off while this feature is on.PodFailurePolicy is in stable and is locked to true by default but we should guard against cases where PodDisruptionCondition is turned off.
Does this feature depend on any specific services running in the cluster?
No
Scalability
Generally, enabling this will slow down pod creation if pods take a long time to terminate. We would wait to create new pods until the existing ones are terminated.
Will enabling / using this feature result in any new API calls?
In the job controller, we only update the Job.Status if any field in the Job.Status changes. With this feature on, we will track terminating pods in this status.
It could be possible to see an increase in updating the status field of Jobs if a lot of the pods are being terminated.
However, if pods are being terminated, we would also expect other fields to be getting updated also (active, failed, etc) so there should not be a large increase of API calls for patching.
Will enabling / using this feature result in introducing new API types?
No
Will enabling / using this feature result in any new calls to the cloud provider?
No
Will enabling / using this feature result in increasing size or count of the existing API objects?
For Job API, we are adding an enum field named PodReplacementPolicy which takes
either a TerminatingOrFailed or Failed
- API type(s): enum
- Estimated increase in size: 8B
We are also added a status field for tracking terminating pods.
- API type(s): int32
- Estimated increase in size: 4B
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No, SLI/SLO do not include time taking to create new pods if existing ones are terminated.
There is an existing one on pod creation but this will not impact that.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
N/A
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
N/A
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
No change from existing behavior of the Job controller.
What are other known failure modes?
There are no other failure modes.
What steps should be taken if SLOs are not being met to determine the problem?
If one wants to keep the feature on and they could suspend the jobs that are using this feature.
Setting Suspend:True in your JobSpec will halt the execution of all jobs.
Implementation History
- 2023-04-03: Created KEP
- 2023-05-19: KEP Merged.
- 2023-07-16: Alpha PRs merged.
- 2023-09-29: KEP marked for beta promotion.
- 2023-10-24: Merged bugfix Fix tracking of terminating Pods when nothing else changes
- 2023-10-24: Merged adding a metric required for beta promotion feat: add job_pods_creation_total metric
- 2023-10-27: Merged Switch feature flag to beta for pod replacement policy and add e2e test #121491
- 2024-06-11: [v1.31] Merged Count terminating pods when deleting active pods for failed jobs #125175
- 2024-07-12: [v1.31] Merged Delay setting terminal Job conditions until all pods are terminal #125510
This feature was promoted to beta in v1.29, but important updates were implemented in v1.31.
For additional info, check the PRs linked above with the tag [v1.31].
Drawbacks
Enabling this feature may have rollouts become slower.
Alternatives
We discussed having this under the PodFailurePolicy but this is a more general idea than the PodFailurePolicy.
Infrastructure Needed (Optional)
NA