KEP-3329: Retriable and non-retriable Pod failures for Jobs
KEP-3329: Retriable and non-retriable Pod failures for Jobs
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- User Stories (Optional)
- Notes/Constraints/Caveats (Optional)
- Risks and Mitigations
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
This KEP extends Kubernetes to configure a job policy for handling pod failures.
In particular, the extension allows determining some of pod failures as caused
by infrastructure errors and to retry them without incrementing the counter
towards backoffLimit.
Additionally, the extension allows determining some pod failures as caused by software bugs and to terminate the associated job early. This is needed to save time and computational resources wasted due to unnecessary retries of containers destined to fail due to software bugs.
Motivation
Running a large computational workload, comprising thousands of pods on thousands of nodes requires usage of pod restart policies in order to account for infrastructure failures.
Currently, kubernetes Job API offers a way to account for infrastructure
failures by setting .backoffLimit > 0. However, this mechanism instructs the
job controller to restart all failed pods - regardless of the root cause
of the failures. Thus, in some scenarios this leads to unnecessary
restarts of many pods, resulting in a waste of time and computational
resources. What makes the restarts more expensive is the fact that the
failures may be encountered late in the execution time of a program.
Sometimes it can be determined from containers exit codes that the root cause of a failure is in the executable and the job is destined to fail regardless of the number of retries. However, since the large workloads are often scheduled to run over night or over the weekend, there is no human assistance to terminate such a job early.
The need for solving the problem has been emphasized by the kubernetes community in the issues, see: #17244 and #31147 .
Some third-party frameworks have implemented retry policies for their pods:
Additionally, some pod failures are not linked with the container execution,
but rather with the internal kubernetes cluster management (see:
Scheduling, Preemption and Eviction
.
Such pod failures should be recognized as infrastructure failures and it
should be possible to ignore them from the counter towards backoffLimit.
Goals
- Extension of Job API with user-friendly syntax to terminate jobs based on the end state of the failed pod.
Non-Goals
- Implementation of other indicators of non-retriable jobs such as termination logs.
- Modification of the semantics for job termination. In particular, allowing for all indexes of an indexed-job to execute when only one or a few indexes fail #109712 .
- Similar termination policies for other workload controllers such as Deployments or StatefulSets.
- Handling of Pod configuration errors resulting in pods stuck in the
Pendingstate (value ofstatus.phase) rather thanFailed(such as incorrect image name, non-matching configMap references, incorrect PVC references). - Adding pod conditions to indicate admission failures (see: active deadline timeout exceeded ) or exceeding of the active deadline timeout (see: admission failures . Also, adding pod conditions to indicate failures tue to exhausting the resource (memory or ephemeral storage) (see: Resource limits exceeded ).
- Adding of the disruption condition for graceful node shutdown on Windows, as there is some ground work required first to support Pod eviction due to graceful node shutdown on Windows.
Proposal
Extension of the Job API with a new field which allows to configure the set of conditions and associated actions which determine how a pod failure is handled. The extended Job API supports discrimination of pod failures based on the container exit codes as well as based on the end state of a failed pod.
In order to support discrimination of pod failures based on their end state
we use the already existing status.conditions field to append a dedicated Pod
Condition indicating (by its type) that the pod is being terminated by an
internal kubernetes component. Moreover, we modify the internal kubernetes
components to send an API call to append the dedicated Pod condition along with
sending the associated Pod delete request. In particluar, the following
kubernetes components will be modified:
- kube-controller-manager (Pod deletion by Taint Manager or PodGC)
- kube-scheduler (Pod deletion due to
Preemption) - kube-apiserver (API-initiated eviction)
- kubelet (Pod eviction due to exceeded resource limits, node shutdown, node pressure etc.)
We use the job controller’s main loop to detect and categorize the pod failures with respect to the configuration. For each failed pod, one of the following actions is applied:
- terminate the job (non-retriable failure),
- ignore the failure (retriable failure) - restart the pod and do not increment
the counter for
backoffLimit, - increment the
backoffLimitcounter and restart the pod if the limit is not reached (current behaviour).
User Stories (Optional)
Story 1
As a machine learning researcher, I run jobs comprising thousands
of long-running pods on a cluster comprising thousands of nodes. The jobs often
run at night or over weekend without any human monitoring. In order to account
for random infrastructure failures we define .backoffLimit: 6 for the job.
However, a significant portion of the failures happen due to bugs in code.
Moreover, the failures may happen late during the program execution time. In
such case, restarting such a pod results in wasting a lot of computational time.
We would like to be able to automatically detect and terminate the jobs which are failing due to the bugs in code of the executable, so that the computation resources can be saved.
Occasionally, our executable fails, but it can be safely restarted with a good chance of succeeding the next time. In such known retriable situations our executable exits with a dedicated exit code in the 40-42 range. All remaining exit codes indicate a software bug and should result in an early job termination.
The following Job configuration could be a good starting point to satisfy my needs:
apiVersion: v1
kind: Job
spec:
template:
spec:
containers:
- name: job-container
image: job-image
command: ["./program"]
backoffLimit: 6
podFailurePolicy:
rules:
- action: FailJob
onExitCodes:
operator: NotIn
values: [40,41,42]
Note that, when no rule specified in podFailurePolicy matches the pod failure
the default handling of pod failures applies - the counter of pod failures
is incremented and checked against the backoffLimit
(see: JobSpec API
]).
Story 2
As a service provider that offers computational resources to researchers I would like to have a mechanism which terminates jobs for which pods are failing due to user errors, but allows infinite retries for pod failures caused by cluster-management events (such as preemption). I do not have knowledge or influence over the executable that researchers run, so I don’t know beforehand which exit codes they might return.
Additionally, I would like to avoid unnecessary retries of Pods which exceed the configured memory or ephemeral-storage limits.
The following Job configuration could be a good starting point to satisfy my needs:
apiVersion: v1
kind: Job
spec:
template:
spec:
containers:
- name: main-job-container
image: job-image
command: ["./program"]
resources:
requests:
memory: "10Gi"
ephemeral-storage: "10Gi"
limits:
memory: "10Gi"
ephemeral-storage: "100Gi"
backoffLimit: 3
podFailurePolicy:
rules:
- action: Ignore
onPodConditions:
- type: DisruptionTarget
Note that, in this case the user supplies a list of Pod condition type values. This approach is likely to require an iterative process to review and extend of the list.
Story 3
As a service provider, similarly as in Story 2 , I would like to have a mechanism which terminates jobs for which pods are failing due to user errors. However, I’m concerned that jobs running on a cluster with a broken configuration might result in costly never-ending Pod retries.
The following Job configuration could be a good starting point to satisfy my needs:
apiVersion: v1
kind: Job
spec:
template:
spec:
containers:
- name: main-job-container
image: job-image
command: ["./program"]
resources:
requests:
memory: "10Gi"
ephemeral-storage: "10Gi"
limits:
memory: "10Gi"
ephemeral-storage: "100Gi"
backoffLimit: 3
podFailurePolicy:
rules:
- action: Count
onPodConditions:
- type: DisruptionTarget
- action: FailJob
onExitCodes:
operator: NotIn
values: [0]
Here we count all disruptions in the counter towards .spec.backoffLimit. All
Pod failures caused by exceeding the configured limits or with non-zero exit code
are categorized as user errors and result in termination of the entire job.
Notes/Constraints/Caveats (Optional)
Job-level vs. pod-level spec
We considered introduction of this feature in the pod spec, allowing to account for
container restarts within a running pod. However, we consider handling of
job-level failures or termination (for example, a Preemption) as an integral part of
this proposal. Also, when pod’s spec.restartPolicy is specified as Never, then the
failures can’t be handled by kubelet and need to be handled at job-level anyway.
Also, we consider this proposal as a natural extension of the already exiting
mechanism for job-level restarts of failed pods based on the Job’s
spec.backoffLimit configuration. In particular, this proposal aims to fix the
issue, in this mechanism, of unnecessary restarts when a Job can be determined
to fail despite retries.
We believe this feature can co-exist with other pod-level features providing container restart policies. However, if we establish there is a problematic interaction of this feature with another such feature, then we will consider additional validation of the JobSpec configuration to avoid the situation.
If, in the future, we introduce failure handling within the Pod spec, it would be limited to restartPolicy=OnFailure. Only one of the Pod spec or Job spec APIs will be allowed to be used at a time.
Relationship with Pod.spec.restartPolicy
We limit this feature by disallowing the use of restartPolicy=OnFailure when
the pod failure policy is specified by the podFailurePolicy field. Since the
Job controller only supports restartPolicy=Never and restartPolicy=OnFailure,
it effectively means that the use of job’s pod failure policy requires
restartPolicy=Never.
This is in order to avoid the problematic race-conditions between Kubelet and
Job controller. For example, Kubelet could restart a failed container before the
Job controller decides to terminate the corresponding job due to a rule using
onExitCodes.
The scope of the FailureTarget condition
As part of this KEP we introduced the FailureTarget condition scoped to the failures due to pod failure policy.
However, we are going to extend the scope of the condition to all Job failure scenarios (covering also backoffLimit exceeded and ActiveDeadlineSeconds exceeded), as part of fixing (issue #123775)[https://github.com/kubernetes/kubernetes/issues/123775].
See more details in the Job API managed-by mechanism .
Current state review
Here we review the current state of kubernetes (version 1.24) regarding its handling pod failures.
The list below contains scenarios which we have reproduced in order to investigate which pod fields could be used as indicators if a pod failure should or should not be retried.
The results demonstrate that there is no universal indicator (like a pod or container field) currently that discriminates pod failures which should be retried from those which should not be retried.
Preemption
- Reproduction: We run two long-running jobs. The second has higher priority pod which preempts the lower priority pod
- Comments: controlled by kube-scheduler in
scheduler/framework/preemption/preemption.go - Pod status:
- status: Terminating
phase=Failedreason=message=
- Container status:
state=TerminatedexitCode=137reason=Error
Taint-based eviction
- Reproduction: We run a long-running job. Then, we taint the node with
NoExecute - Comments: controlled by kube-scheduler in
controller/nodelifecycle/scheduler/taint_manager.go - Pod status:
- status: Terminating
phase=Failedreason=message=
- Container status:
state=TerminatedexitCode=137reason=Error
Node drain
- Reproduction: We run a job with a long-running pod, then drain the node
with the
kubectl draincommand - Comments: performed by Eviction API, controlled by kube-apiserver in
registry/core/pod/storage/eviction.go - Pod status:
- status: Terminating
phase=Failedreason=message=
- Container status:
state=TerminatedexitCode=137reason=Error
Node-pressure eviction
Memory-pressure eviction:
- Reproduction: We run a job with a pod which attempts to allocate more memory than available on the node
- Comments: controlled by kubelet in
kubelet/eviction/eviction_manager.go - Pod status:
- status: ContainerStatusUnknown
phase=Failedreason=Evictedmessage=The node was low on resource: memory. (...)
- Container status:
state=TerminatedexitCode=137reason=ContainerStatusUnknown
Disk-pressure eviction:
- Reproduction: We run a job with a pod which attempts to write more data than the disk space available on the node
- Comments: controlled by kubelet in
kubelet/eviction/eviction_manager.go - Pod status:
- status: Error
phase=Failedreason=Evictedmessage=The node was low on resource: ephemeral-storage. (...)
- Container status:
state=TerminatedexitCode=137reason=Error
Container memory limit exceeded
Linux:
- Reproduction: We run a job with a pod which attempts to allocate more
memory than constrained in the container spec by
resources.limits.memory - Comments: handled by kubelet in
kubelet/kubelet_pods.gowhich merges-in the status update (setting of the reason field) by the container runtime - Pod status:
- status: OOMKilled
phase=Failedreason=message=
- Container status:
state=TerminatedexitCode=137reason=OOMKilled
Windows:
- Reproduction: We run a job with a pod which attempts to allocate more
memory than constrained in the container spec by
resources.limits.memory - Comments: there is not clear indication that the container failed due to exceeding memory limit
- Pod status:
- status: Error
phase=Failedreason=message=
- Container status:
state=TerminatedexitCode=1reason=Error
Container ephemeral-storage limit exceeded
- Reproduction: We run a job with a pod which attempts to consume more disk space
than constrained in the container spec by
resources.limits.ephemeral-storage - Comments: handled by kubelet in
kubelet/eviction/eviction_manager.go - Pod status:
- status: Error
phase=Failedreason=Evictedmessage=Pod ephemeral local storage usage exceeds the total limit of containers 1Gi.
- Container status:
state=TerminatedexitCode=137reason=Error
Graceful node shutdown
Container does not have a dedicated SIGTERM handling and exits with status 137:
- Reproduction: The node needs to be started with positive
shutdownGracePeriod, for exampleshutdownGracePeriod=30s. We run a job with a long-running pod, then we prepare the node for shutdown by the command:dbus-send --system --type=signal /org/freedesktop/login1 org.freedesktop.login1.Manager.PrepareForShutdown boolean:"true", which similulates graceful nodes shutdown for maintenance purposes. - Comments: handled by kubelet in
kubelet/nodeshutdown/nodeshutdown_manager_linux.go - Pod status:
- status: Error
phase=Failedreason=Terminatedmessage=Pod was terminated in response to imminent node shutdown.
- Container status:
state=TerminatedexitCode=137reason=Error
Container handles SIGTERM and exits with status 0:
- Reproduction: As above, but the container handles SIGTERM and exits with status 0.
- Comments: handled by kubelet in
kubelet/nodeshutdown/nodeshutdown_manager_linux.go - Pod status:
- status: Completed
phase=Succeededreason=message=
- Container status:
state=TerminatedexitCode=0
Pod admission error
Admission error due to disk pressure:
- Reproduction: We run a pod on a node which is under disk pressure
(tainted with
node.kubernetes.io/disk-pressure:NoSchedule). In order to schedule the pod we untaint the node by command line. The Pod is scheduled but fails admission by Kubelet as the taint is re-added shortly after its manual removal. - Comments: controlled by kubelet in
kubelet/kubelet.go - Pod status:
- status: Evicted
phase=Failedreason=Evictedmessage=Pod The node had condition: [DiskPressure].
- No containers created
Note that, admission errors may occur due to various other reasons, resulting in different messages for pods.
Disconnected node
- Reproduction: We run a job with a long-running pod with finalizer (for example
created by the Job controller), then disconnect the node
and delete it by the
kubectl deletecommand - Comments: handled by Pod Garbage collector in:
controller/podgc/gc_controller.go. However, the pod phase remainsRunning. - Pod status:
- status: Terminating
phase=Runningreason=message=
- Container status:
state=RunningexitCode=reason=
Disconnected node when taint-manager is disabled
- Reproduction: Run kube-controller-manager with disabled taint-manager (with the
flag
--enable-taint-manager=false). Then, run a job with a long-running pod and disconnect the node - Comments: handled by node lifecycle controller in:
controller/nodelifecycle/node_lifecycle_controller.go. However, the pod phase remainsRunning. - Pod status:
- status: Unknown
phase=Runningreason=NodeLostmessage=Node mycluster-worker which was running pod play-longrun-f28ls is unresponsive
- Container status:
state=RunningexitCode=reason=
Direct container kill
- Reproduction: We run a job with a long-running pod, then we kill the container
by the
crictl stopcommand - Comments: handled by Kubelet
- Pod status:
- status: Error
phase=Failedreason=message=
- Container status:
state=TerminatedexitCode=137reason=Error
Termination initiated by Kubelet
In Alpha, there is no support for Pod conditions for failures or disruptions initiated by kubelet.
For Beta we introduce handling of Pod failures initiated by Kubelet by adding the pod disruption condition (introduced in Alpha) in case of disruptions initiated by Kubelet (see Design details ).
Kubelet can also evict a pod in some scenarios which are not covered with adding a pod failure condition:
Active deadline timeout exceeded
Kubelet can also evict a pod due to exceeded active deadline timeout (configured
by pod’s .spec.activeDeadlineSeconds field). On one hand, exceeding the timeout,
may suggest a software bug due to which the pod executes longer than expected.
On the other hand, it might be due node CPU pressure caused by other processes
on the node. Thus, in order to give users freedom of handling this situation we
should introduce a dedicated pod condition type, such as ActiveDeadlineExceeded.
However, as the feature focuses on scenarios which can be naturally interpreted
in terms of retriability and evolving Pod condition types
(see evolving condition types
) are a concern we decide
to do not add any pod condition in this case. It should be re-considered in the
future if there is a good motivating use-case.
The reported issue which could be addressed by the new condition for exceeding the active deadline timeout: Pod Failure Policy Edge Case: Job Retries When Pod Finishes Successfully .
Admission failures
In some scenarios a pod admission failure could result in a successful pod restart on another
node (for example a pod scheduled to a node with resource pressure, see:
Pod admission error
). However, in other situations it won’t be as clear,
since the failure can be caused by incompatible pod and node configurations. Node configurations
are often the same within a cluster, so it is likely that the pod would fail if restarted on any
other node in the cluster. In that case, adding DisruptionTarget condition could cause
a never-ending loop of retries, if the pod failure policy was configured to ignore such failures.
Given the above, we decide not to add any pod condition for such failures. If there is a sufficient
motivating use-case, a dedicated pod condition might be introduced to annotate some of the
admission failure scenarios.
Resource limits exceeded
A Pod failure initiated by Kubelet can be caused by exceeding pod’s
(or container’s) resource (memory or ephemeral-storage) limits. We have considered
(and prepared an initial implementation, see PR
Add ResourceExhausted pod condition for oom killer and exceeding of local storage limits
)
introduction of a dedicated Pod failure condition ResourceExhausted to
annotate pod failures due to the above scenarios.
However, it turned out, that there are complications with detection of exceeding memory limits:
- the approach we considered is to rely on the Out-Of-Memory (OOM) killer.
In particular, we could detect that a pod was terminated due to OOM killer based
on the container’s
reasonfield being equal toOOMKilled. This value is set on Linux by the leading container runtime implementations: containerd (see here for event handling and here for the constant definition) and CRI-O (see here ). - setting the
reasonfield toOOMKilledis not standardized, either. During the Beta phase implementation we discussed the standardization issue within the community (involving CNCF Technical Advisory Group for Runtime and SIG-node). We have also started an effort to standardize the handling of OOM killed containers (see: Documentation for the CRI API reason field to standardize the field for containers terminated by OOM killer ). However, in the process it turned out that in some configurations (for example the CRI-O with cgroupv2, see: Add e2e_node test for oom killed container reason ), the container’sreasonfield is not set toOOMKilled. - OOM killer might get invoked not only when container’s limits are exceeded,
but also when the system is running low on memory. In such scenario there
can be race conditions in which both the
DisruptionTargetcondition and theResourceExhaustedcould be added.
Thus, we’ve decided not to annotate the scenarios with the ResourceExhausted
condition in this KEP. Handling of exceeded limits might be done as a
follow up KEP once the ground work of introducing pod failure conditions is done.
While there are not known issues with detection of the exceeding of
Pod’s ephemeral storage limits, we prefer to avoid future extension of the
semantics of the new condition type. Alternatively, we could introduce a pair of
dedicated pod condition types: OOMKilled and EphemeralStorageLimitExceeded.
This approach, however, could create an unnecessary proliferation of the pod
condition types.
Finally, we would like to first hear user feedback on the preferred approach and also on how important it is to cover the resource limits exceeded scenarios.
JobSpec API alternatives
Alternative versions of the JobSpec API to define requirements on exit codes and on pod end state have been proposed and discussed (see: Alternatives ). The outcome of the discussions as well as the experience gained during the Alpha implementation may influence the final API.
Failing delete after a condition is added
Here we consider a scenario when a component fails (for example its container dies) between appending a pod condition and deleting the pod.
In particular, scheduler can possibly decide to preempt a different pod the next time (or none). This would leave a pod with a condition that it was preempted, when it actually wasn’t. This in turn could lead to improper handling of the pod by the job controller.
As a solution we implement a worker, part of the disruption
controller, which clears the pod condition added if DeletionTimestamp is
not added to the pod for a long enough time (for example 2 minutes).
Marking pods as Failed
When matching a failed pod against Job pod failure policy, it is important that
the pod is actually in the terminal phase (Failed), to ensure their state is
not modified while Job controller matches them against the pod failure policy.
Additionally, it is necessary to avoid the creation of a replacement Pod if the
previously created Pod becomes terminating (has a deletionTimestamp but is
not Failed nor Succeeded yet), or we might create replacement Pods that
wouldn’t be created if the pod failure policy was applied against the terminated Pod.
There are scenarios in which a pod gets stuck in a non-terminal phase,
but is doomed to be failed, as it is terminating (has deletionTimestamp set, also
known as the DELETING state, see:
The API Object Lifecycle
).
In order to workaround this issue, Job controller, when pod failure policy is
disabled, considers any terminating pod that is in a non-terminal phase as failed.
Note that, it is important that when Job controller considers such pods as failed
so that it removes their finalizers and thus allows the API server to complete
their deletion.
In order to ensure consistency in behavior when handling pod failures when pod failure policy is used or not, we need to make sure that all pods which are terminating (were previously considered as failed by the Job controller), are transitioned to the failed state eventually.
Thus, we review scenarios in which a pod is considered by Job controller as failed, but may get stuck in a non-terminal. Note that, the following scenarios are not only problematic to the Job controller (due to its use of finalizers), but they can be considered as bugs in their own right, as every pod should end up in a terminal phase, even if not started (see discussion ).
- Orphan pods (solved in Alpha)
This pod state is characterized by the following:
- scheduled (the Pod’s
.spec.nodeNameis set), but the node no longer exists RunningorPendingphase
Note that, as PodGC sends DELETE request for such pod (DELETE request was sent
prior to this KEP, but without setting the phase to Failed), it becomes also
terminating (has deletionTimestamp set).
Example steps leading to the Pod’s state:
- pod scheduled to a node, might be
RunningorPending - node deleted (see: Disconnected node )
- Pending, terminating and unscheduled (solved in Alpha)
This pod state is characterized by the following:
- unscheduled (the Pod’s
.spec.nodeNameis nil), so no Kubelet is assigned Pendingphase- the Pod’s
deletionTimestampis set by a DELETE request (terminating)
Example steps leading to the Pod’s state:
- unschedulable pod (for example due to unsatisfiable requests)
- DELETE request sent by a user or another k8s component
Note that, the point about the Pod being terminating is important here. As long as the pod is not terminating it can be scheduled if resources are increased to satisfy its requests.
- Pending, terminating, and scheduled (planned for second Beta)
This pod state is characterized by the following:
- the Pod’s
.spec.nodeNameis set, but the node no longer exists Pendingphase- the Pod’s
deletionTimestampis set by a DELETE request (terminating)
Example steps leading to the Pod’s state:
- Pod scheduled to a node, but remains in
Pending(e.g. due to an invalid image reference, invalid config map, one of the containers failing to start) - DELETE request sent by a user or another k8s component
Note that, the point about the Pod being terminating is important here. As long as the pod is not terminating, it could be a transient issue and Kubelet can make progress after retrying.
Review of example steps for the 3rd scenario
Below we present example steps leading to the 3rd scenario on k8s 1.26.
Invalid image reference, pod deleted by a user
- create a pod with invalid image reference, example
yaml:
apiVersion: batch/v1
kind: Job
metadata:
name: invalid-image
spec:
template:
spec:
restartPolicy: Never
containers:
- name: invalid-image
image: non_existing_image_name:102
command: ["bash"]
args: ["-c", 'echo "Hello world"']
podFailurePolicy: # this is just to prevent considering the pod as failed until in terminal phase
rules: []
backoffLimit: 0
- delete the pod with
kubectl delete pods -l job-name=invalid-image
The relevant fields of the pod:
metadata:
deletionTimestamp: "2023-02-03T13:48:14Z"
finalizers:
- batch.kubernetes.io/job-tracking
status:
conditions:
- status: "True"
type: Initialized
- status: "False"
type: Ready
- status: "False"
type: ContainersReady
- status: "True"
type: PodScheduled
containerStatuses:
- image: non_existing_image_name:102
state:
waiting:
message: Back-off pulling image "non_existing_image_name:102"
reason: ImagePullBackOff
phase: Pending
The pod is stuck in the Pending and Terminating state. The finalizer
is not deleted by Job controller as it requires the pod to be in terminal
phase when podFailurePolicy is defined. The pod remains in the state even
if the image reference is fixed manually.
Invalid image reference, pod deleted by scheduler
- Create a pod similar to the one above, but extend it with requests to ensure it will be preempted by another pod with a higher priority
- Create another pod with the critical priority (e.g.
system-node-critical) on the same node so that Kube-scheduler preempts the first pod.
The status looks like in the previous example, but contains the
DisruptionTarget=True condition.
As above, the pod is stuck in the Pending and Terminating state.
Invalid configMap reference, pod deleted by a user
- create a pod with invalid configMap reference, example
yaml:
apiVersion: batch/v1
kind: Job
metadata:
name: invalid-configmap-ref
spec:
template:
spec:
restartPolicy: Never
volumes:
- name: volume-name
configMap:
name: invalid-config-name-name
containers:
- name: invalid-configmap-ref
image: centos:7
command: ["bash"]
args: ["-c", 'echo "Hello world"']
volumeMounts:
- mountPath: /script_path
name: volume-name
podFailurePolicy:
rules: []
backoffLimit: 0
- delete the pod with
kubectl delete pods -l job-name=invalid-configmap-ref
The relevant fields of the pod:
metadata:
deletionTimestamp: "2023-02-03T13:48:14Z"
finalizers:
- batch.kubernetes.io/job-tracking
status:
conditions:
- status: "True"
type: Initialized
- status: "False"
type: Ready
- status: "False"
type: ContainersReady
- status: "True"
type: PodScheduled
containerStatuses:
- image: centos:7
state:
terminated:
exitCode: 137
message: The container could not be located when the pod was terminated
reason: ContainerStatusUnknown
phase: Pending
As above, the pod is stuck in the Pending and Terminating state.
Correct config, but image takes long to download, pod deleted by a user in the meanwhile
This scenario is a little bit different, the config is correct, but the
pod is deleted by a user while in the Pending phase. In that case, the
pods transition into the Running phase and fail soon after. With the proposed
change the transition will happen earlier, thus saving resources.
- create a pod using a huge image to be in the
Pendingphase for long, exampleyaml:
apiVersion: batch/v1
kind: Job
metadata:
name: huge-image
spec:
template:
spec:
restartPolicy: Never
containers:
- name: huge-image
image: sagemathinc/cocalc # this is around 20GB
command: ["bash"]
args: ["-c", 'sleep 60 && echo "Hello world"']
podFailurePolicy:
rules: []
backoffLimit: 0
- delete the pod with
kubectl delete pods -l job-name=huge-image
The relevant fields of the pod:
status:
conditions:
- status: "True"
type: Initialized
- status: "False"
type: Ready
- reason: PodFailed
status: "False"
type: ContainersReady
- status: "True"
type: PodScheduled
containerStatuses:
- image: docker.io/sagemathinc/cocalc:latest
state:
terminated:
exitCode: 137
reason: Error
phase: Failed
Here, the pod is not stuck, however it transitions to Running and fails
soon after, making the interim transition to Running unnecessary. Also, there
is a race condition, if the container succeeds before the graceful period for
pod termination (if not for the sleep 60 in the example above) the running pod may complete with the
Succeeded status before its containers are killed (and it transitions in the
Failed phase). This is already problematic for the Job controller, which might
count the pod as failed, despite the pod eventually succeeding. With the proposed
change, in the scenario, the pod transitions directly from the Pending phase
to Failed.
Proposed solution for the 3rd scenario
We plan to fix the 3rd scenario by
modifying Kubelet to transition such pods (Pending, terminating, and scheduled)
into the Failed phase, allowing the Job
controller to count them, match against the (optional) pod failure policy and
remove their Job finalizers, allowing API server to complete deletion.
In the final update transitioning the pod to the Failed phase Kubelet will
also set the reason and the message field. We propose the values as
Deleted and Deleted while pending, respectively, in order to reflect the
direct reason for the pod to transitioned to the Failed phase.
As the reasons for deleting a Pod while it is in the Pending phase might be
various, it should be up to the controller or a user to add an adequate pod
condition to reflect the reason. An example scenario when Kube-scheduler adds
the DisruptionTarget is presented above.
Note that, the transitioning such pods to Failed phase requires an additional PATCH request from Kubelet to the API server. However, the situation should be rare as it requires DELETE which needs to be send by a user or another component for the Pod to be terminating, so it should have a limited impact on performance. An analogous decision has been made for PodGC.
A prototype implementation is prepared here: Mark Pending Terminating pods as Failed by Kubelet .
Implementation progress and plan
For Alpha, we implemented a fix for scenarios 1. and 2. by setting the pod phase
as Failed in PodGC. Note that, in both scenarios there is no Kubelet to
transition the pod, thus it is done by PodGC.
As the 3rd scenario wasn’t fixed in the first iteration of Beta (1.26), we only
require pods to be in terminal phase when podFailurePolicy is specified (and
only in that case), but this creates an inconsistency with handling Jobs
with and without pod failure policies
(see Issue: Job controller should wait for Pods to terminate to match the failure policy
).
For the second iteration of Beta (1.27), we plan to fix the 3rd scenario as suggested above, see: Proposed solution for the 3rd scenario .
Also, for the second iteration of Beta (1.27) we are going to expand the feature documentation to explain this change. In particular, we are going to provide a list of example scenarios impacted by this change, including: invalid image reference, invalid config map reference.
We considered to simplify Job controller to count as failed only pods which
are in terminal phase regardless of the fact if the podFailurePolicy.
However, we will not do it as discussed on the dedicated issue
Job controller should wait for Pods to be in a terminal phase before considering them failed or succeeded
,
because this would not only be a cleanup, but also change of the current semantic
when pod failure policy is not used. The current semantic matches the expectations.
Risks and Mitigations
Garbage collected pods
The Pod status (which includes the conditions field and the container exit
codes) could be lost if the failed pod is garbage collected.
Losing Pod’s status before it is interpreted by Job Controller can be prevented by using the feature of job tracking with finalizers (see more about the design details section: Interim FailureTarget condition ).
Evolving condition types
The list of available pod condition types field will be evolving with new values being added and potentially some values becoming obsolete. This can make it difficult to maintain a valid list of pod condition types enumerated in the Job configuration.
In order to mitigate this risk we are going to define (along with documentation) the new condition types as constants in the already existing list defined in the k8s.io/apis/core/v1 package. Thus, every addition of a new condition type will require an API review. The constants will allow users of the package to reduce the risk of typing mistakes.
Additionally, we introduce a generic condition type DisruptionTarget which
indicates pod disruption (due to e.g. preemption, API-initiated eviction or
taint-based eviction).
A more detailed information about the condition (containing the kubernetes
component name which initiated the disruption) is conveyed by the reason and
message fields of the added pod condition.
Finally, we are going to cover the handling of pod failures associated with the new pod condition types in integration tests.
Stale DisruptionTarget condition which is not cleaned up
It is possible that a stale disruption condition (see: Failing delete after a condition is added ) is not clean up by the disruption controller before the Pod completes. The stale Pod condition can misguide the Job controller.
First, the scenario is unlikely as it requires that a Pod deletion call fails
right after the Pod status update succeeded, and that conditions in the cluster
change (for example the NoExecute taint is removed or kube-scheduler decides
to preempt another Pod) so that the deletion is not re-attempted. Additionally,
the Pod needs to fail within 2min (before the disruption controller cleans it
up) for a reason which would not result in adding DisruptionTarget, so the
stale condition is inaccurate when inspected by the Job controller.
Second, the negative consequence of the scenario is limited. For jobs which are configured to ignore the disruption errors it results in an unnecessary Pod retry.
Given the factors above we assess this is an acceptable risk.
Design Details
As our review shows there is currently no convenient indicator, in the pod end state, if the pod should be retried or should not. Thus, we introduce a set of dedicated Pod conditions which can be used for this reason.
New PodConditions
A new condition type, called DisruptionTarget, is introduced to indicate
a pod failure caused by a disruption. In order to account for different
reasons for pod termination we add the following reason types based on the
invocation context (we focus on covering these scenarios were the new
condition makes it easier to determine if a failed pod should be restarted):
- PreemptionByKubeScheduler (Pod preempted by kube-scheduler)
- DeletionByTaintManager (Pod evicted by kube-controller-manager due to taints)
- EvictionByEvictionAPI (Pod deleted by Eviction API)
- DeletionByPodGC (an orphaned Pod deleted by pod GC)
- TerminationByKubelet (Pod terminated due to graceful node shutdown, node resource pressure, or Kubelet preemption for critical pods).
The already existing status.conditions field in Pod will be used by kubernetes
components to append a dedicated condition.
When the failure is initiated by a component which deletes the pod, then the API call to append the condition will be issued as a pod status update call before the Pod delete request (not necessarily as the last update request before the actual delete). For Kubelet, which does not delete the pod itself, the pod condition is added in the same API request as the phase change to failed. This way the Job controller will be able to see the condition, and match it against the pod failure policy, when handling a failed pod.
During the implementation process we are going to review the places where the
pod delete requests are issued to modify the code to also append a meaningful
condition with dedicated type, reason and message fields based on the
invocation context.
We list the reason constants above in order to gain the community consensus on
informative names, but they are purely informational. It is not supported to use the reason or message fields
when defining a pod failure policy (see PodFailurePolicyOnPodConditionsPattern
in JobSpec API
). Supporting of matching by the reason field
is a possible extension of the feature which can be implemented once there is
use case to motivate it. However, it may create a risk of breaking compatibility
with evolving set of reasons in use, similar to the risk of
evolving condition types
.
Interim FailureTarget Job condition
There is a risk of losing the Pod status information due to PodGC, which could prevent Job Controller to react to a pod failure with respect to the configured pod failure policy rules (see also: Garbage collected pods ).
In order to make sure all pods are checked against the rules we require the feature of job tracking with finalizers to be enabled.
Additionally, before we actually remove the finalizers from the pods
(allowing them to be deleted by PodGC) we record the determined job failure
message (if any rule with JobFail matched) in an interim job condition, called
FailureTarget. Once the pod finalizers are removed we update the job status
with the final Failed job condition. This strategy eliminates a possible
race condition that we could lose the information about the job failure if
Job Controller crashed between removing the pod finalizers are updating the final
Failed condition in the job status.
JobSpec API
We extend the Job API in order to allow to apply different actions depending on the conditions associated with the pod failure.
// PodFailurePolicyAction specifies how a Pod failure is handled.
// +enum
type PodFailurePolicyAction string
const (
// This is an action which might be taken on a pod failure - mark the
// pod's job as Failed and terminate all running pods.
PodFailurePolicyActionFailJob PodFailurePolicyAction = "FailJob"
// This is an action which might be taken on a pod failure - the counter towards
// .backoffLimit, represented by the job's .status.failed field, is not
// incremented and a replacement pod is created.
PodFailurePolicyActionIgnore PodFailurePolicyAction = "Ignore"
// This is an action which might be taken on a pod failure - the pod failure
// is handled in the default way - the counter towards .backoffLimit,
// represented by the job's .status.failed field, is incremented.
PodFailurePolicyActionCount PodFailurePolicyAction = "Count"
)
// +enum
type PodFailurePolicyOnExitCodesOperator string
const (
PodFailurePolicyOnExitCodesOpIn PodFailurePolicyOnExitCodesOperator = "In"
PodFailurePolicyOnExitCodesOpNotIn PodFailurePolicyOnExitCodesOperator = "NotIn"
)
// PodFailurePolicyOnExitCodesRequirement describes the requirement for handling
// a failed pod based on its container exit codes. In particular, it lookups the
// .state.terminated.exitCode for each app container and init container status,
// represented by the .status.containerStatuses and .status.initContainerStatuses
// fields in the Pod status, respectively. Containers completed with success
// (exit code 0) are excluded from the requirement check.
type PodFailurePolicyOnExitCodesRequirement struct {
// Restricts the check for exit codes to the container with the
// specified name. When null, the rule applies to all containers.
// When specified, it should match one the container or initContainer
// names in the pod template.
// +optional
ContainerName *string
// Represents the relationship between the container exit code(s) and the
// specified values. Containers completed with success (exit code 0) are
// excluded from the requirement check. Possible values are:
// - In: the requirement is satisfied if at least one container exit code
// (might be multiple if there are multiple containers not restricted
// by the 'containerName' field) is in the set of specified values.
// - NotIn: the requirement is satisfied if at least one container exit code
// (might be multiple if there are multiple containers not restricted
// by the 'containerName' field) is not in the set of specified values.
// Additional values are considered to be added in the future. Clients should
// react to an unknown operator by assuming the requirement is not satisfied.
Operator PodFailurePolicyOnExitCodesOperator
// Specifies the set of values. Each returned container exit code (might be
// multiple in case of multiple containers) is checked against this set of
// values with respect to the operator. The list of values must be ordered
// and must not contain duplicates. Value '0' cannot be used for the In operator.
// At least one element is required. At most 255 elements are allowed.
// +listType=set
Values []int32
}
// PodFailurePolicyOnPodConditionsPattern describes a pattern for matching
// an actual pod condition type.
type PodFailurePolicyOnPodConditionsPattern struct {
// Specifies the required Pod condition type. To match a pod condition
// it is required that specified type equals the pod condition type.
Type api.PodConditionType
// Specifies the required Pod condition status. To match a pod condition
// it is required that the specified status equals the pod condition status.
// Defaults to True.
Status api.ConditionStatus
}
// PodFailurePolicyRule describes how a pod failure is handled when the requirements are met.
// One of OnExitCodes and onPodConditions, but not both, can be used in each rule.
type PodFailurePolicyRule struct {
// Specifies the action taken on a pod failure when the requirements are satisfied.
// Possible values are:
// - FailJob: indicates that the pod's job is marked as Failed and all
// running pods are terminated.
// - Ignore: indicates that the counter towards the .backoffLimit is not
// incremented and a replacement pod is created.
// - Count: indicates that the pod is handled in the default way - the
// counter towards the .backoffLimit is incremented.
// Additional values are considered to be added in the future. Clients should
// react to an unknown action by skipping the rule.
Action PodFailurePolicyAction
// Represents the requirement on the container exit codes.
// +optional
OnExitCodes *PodFailurePolicyOnExitCodesRequirement
// Represents the requirement on the pod conditions. The requirement is represented
// as a list of pod condition patterns. The requirement is satisfied if at
// least one pattern matches an actual pod condition. At most 20 elements are allowed.
// +listType=atomic
OnPodConditions []PodFailurePolicyOnPodConditionsPattern
}
// podFailurePolicy describes how failed pods are accounted. In particular,
// how they influence the backoffLimit.
// When using podFailurePolicy, terminating Pods (have a `deletionTimestamp`)
// are not immediately replaced and don't count as failed until they reach
// a terminal phase (`Failed` or `Succeeded`).
type PodFailurePolicy struct {
// A list of pod failure policy rules. The rules are evaluated in order.
// Once a rule matches a Pod failure, the remaining of the rules are ignored.
// When no rule matches the Pod failure, the default handling applies - the
// counter of pod failures is incremented and it is checked against
// the backoffLimit. At most 20 elements are allowed.
// +listType=atomic
Rules []PodFailurePolicyRule
}
// JobSpec describes how the job execution will look like.
type JobSpec struct {
...
// Specifies the policy of handling failed pods. In particular, it allows to
// specify the set of actions and conditions which need to be
// satisfied to take the associated action.
// If empty, the default behaviour applies - the counter of failed pods,
// represented by the jobs's .status.failed field, is incremented and it is
// checked against the backoffLimit. This field cannot be used in combination
// with .spec.podTemplate.spec.restartPolicy=OnFailure.
//
// This field is alpha-level. To use this field, you must enable the
// `JobPodFailurePolicy` feature gate (disabled by default).
// +optional
PodFailurePolicy *PodFailurePolicy
...
Additionally, we validate the following constraints for each instance of PodFailurePolicyRule:
- exactly one of the fields
onExitCodesandOnPodConditionsis specified for a requirement - the specified
containerNamematches name of a configurated container
Here is an example Job configuration which uses this API:
apiVersion: v1
kind: Job
spec:
template:
spec:
containers:
- name: main-job-container
image: job-image
command: ["./program"]
resources:
limits:
memory: "128Mi"
ephemeral-storage: "1Gi"
- name: monitoring-job-container
image: job-monitoring
command: ["./monitoring"]
resources:
limits:
memory: "128Mi"
ephemeral-storage: "1Gi"
backoffLimit: 3
podFailurePolicy:
rules:
- action: FailJob
onExitCodes:
containerName: main-job-container
operator: In
values: [1,2,3]
- action: Ignore
onPodConditions:
- type: DisruptionTarget
Evaluation
We use the syncJob function of the Job controller to evaluate the specified
podFailurePolicy rules against the failed pods.
Since terminating Pods (have deletionTimestamp and are not Failed or
Succeeded) don’t have an exit code yet and might actually succeed, the
controller will not evaluate them against the podFailurePolicy.
The job controller will also not create a replacement Pod until they reach the
Failed phase. This behavior is the same as
podReplacementPolicy: Failed
.
When evaluating Failed Pods against the podFailurePolicy, it is only the first
rule with matching requirements which is applied as the rules are evaluated in order.
If the pod failure does not match any of the specified rules, then default
handling of failed pods applies.
If we limit this feature to use onExitCodes only when restartPolicy=Never
(see: limiting this feature
), then the rules using
onExitCodes are evaluated only against the exit codes in the state field
(under terminated.exitCode) of pod.status.containerStatuses and
pod.status.initContainerStatuses. We may also need to check for the exit codes
in lastTerminatedState if we decide to support onExitCodes when
restartPolicy=OnFailure.
Test Plan
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
We assess that the Job controller (which is where the most complicated changes will be done) has adequate test coverage for places which might be impacted by this enhancement. Thus, no additional tests prior implementing this enhancement are needed.
Unit tests
Unit tests will be added along with any new code introduced. In particular, the following scenarios will be covered with unit tests:
- handling or ignoring of
spec.podFailurePolicyby the Job controller when the feature gate is enabled or disabled, respectively, - validation of a job configuration with respect to
spec.podFailurePolicyby kube-apiserver - handling of a pod failure, in accordance with the specified
spec.podFailurePolicy, when the failure is associated with- a failed container with non-zero exit code,
- a dedicated Pod condition indicating termination originated by a kubernetes component
- adding of the
DisruptionTargetby Kubelet in case of:- eviction due to graceful node shutdown
- eviction due to node pressure
The core packages (with their unit test coverage) which are going to be modified during the implementation:
k8s.io/kubernetes/pkg/controller/job:13 June 2022-88%k8s.io/kubernetes/pkg/apis/batch/validation:13 June 2022-94.4%k8s.io/kubernetes/pkg/apis/batch/v1:13 June 2022-83.6%k8s.io/kubernetes/pkg/controller/podgc:4 June 2024-81.0%k8s.io/kubernetes/pkg/controller/tainteviction:4 June 2024-81.8%k8s.io/kubernetes/pkg/registry/core/pod/storage:4 June 2024-78.8%k8s.io/kubernetes/pkg/controller/disruption:4 June 2024-79.3%k8s.io/kubernetes/pkg/scheduler/framework/preemption:4 June 2024-30.1%
The kubelet packages (with their unit test coverage) which are going to be modified during implementation:
k8s.io/kubernetes/pkg/kubelet/nodeshutdown:13 Sep 2022-74.9%k8s.io/kubernetes/pkg/kubelet/eviction:13 Sep 2022-67.7%k8s.io/kubernetes/pkg/kubelet/preemption:4 June 2024-73.7%
Integration tests
The following scenarios will be covered with integration tests:
enabling, disabling and re-enabling of the feature gate link
pod failure is triggered by a delete API request along with appending a Pod condition indicating termination originated by a kubernetes component (we aim to cover all such scenarios)
pod failure is caused by a failed container with a non-zero exit code link
cleanup of a stale DisruptionTarget condition link
More integration tests might be added to ensure good code coverage based on the actual implementation.
e2e tests
The following scenario are covered with e2e tests:
- sig-apps#gce
:
- Job Using a pod failure policy to not count some failures towards the backoffLimit Ignore DisruptionTarget condition
- Job Using a pod failure policy to not count some failures towards the backoffLimit Ignore exit code 137
- Job should allow to use the pod failure policy on exit code to fail the job early
- Job should allow to use the pod failure policy to not count the failure towards the backoffLimit
- sig-scheduling#gce-serial
:
- SchedulerPreemption [Serial] validates pod disruption condition is added to the preempted pod
- sig-release-master-informing#gce-cos-master-serial
:
- NoExecuteTaintManager Single Pod [Serial] pods evicted from tainted nodes have pod disruption condition
The following scenarios are covered with node e2e tests (sig-node-presubmits#pr-kubelet-gce-e2e-pod-disruption-conditions and sig-node-presubmits#pr-node-kubelet-serial-containerd ):
- GracefulNodeShutdown [Serial] [NodeFeature:GracefulNodeShutdown] [NodeFeature:GracefulNodeShutdownBasedOnPodPriority] graceful node shutdown when PodDisruptionConditions are enabled [NodeFeature:PodDisruptionConditions] should add the DisruptionTarget pod failure condition to the evicted pods
- PriorityPidEvictionOrdering [Slow] [Serial] [Disruptive][NodeFeature:Eviction] when we run containers that should cause PIDPressure; PodDisruptionConditions enabled [NodeFeature:PodDisruptionConditions] should eventually evict all of the correct pods
- CriticalPod [Serial] [Disruptive] [NodeFeature:CriticalPod] when we need to admit a critical pod should add DisruptionTarget condition to the preempted pod [NodeFeature:PodDisruptionConditions]
More e2e test scenarios might be considered during implementation if practical.
Graduation Criteria
Alpha
- Implementation:
- handling of failed pods with respect to
spec.podFailurePolicyby Job controller - appending of a dedicated Pod condition (when the Pod termination is initiated by a kubernetes control plane component) to the list of Pod conditions along with sending the Pod delete request
- define as a constant and document the new Pod condition Type
- the feature is limited by disallowing of the use of
onExitCodeswhenrestartPolicy=OnFailure
- handling of failed pods with respect to
- The feature flag disabled by default
- Tests: unit and integration
Beta
- Address reviews and bug reports from Alpha users
- E2e tests are in Testgrid and linked in KEP
- implementation of extending the existing Job controller’s metrics:
job_finished_totalby thereasonfield; and introduction of thepod_failures_handled_by_failure_policy_totalmetric with theactionlabel (see also here ) - implementation of adding pod disruption conditions (
DisruptionTarget) by Kubelet when terminating a Pod (see: Termination initiated by Kubelet ) - Refactor adding of pod conditions with the use of SSA client.
- The feature flag enabled by default
Second iteration (1.27):
- Extend Kubelet to mark as failed pending terminating pods (see: Marking pods as Failed ).
- Extend the feature documentation to explain transitioning of pending and
terminating pods into
Failedphase.
Third iteration (1.28):
- Add
DisruptionTargetcondition for pods which are preempted by Kubelet to make room for critical pods. Also, backport this fix to 1.26 and 1.27 release branches, and update the user-facing documentation to reflect this change. - Avoid creation of replacement Pods for terminating Pods until they reach the terminal phase. Update user-facing documentation. It was back-ported to 1.27 .
Fourth iteration (1.29):
- Fix the Pod Garbage collector fails to clean up PODs from nodes that are not running anymore
.
by withdrawing from SSA in the k8s controllers which were adding the
DisruptionTargetcondition. We will reconsider returning to SSA if the issue is fixed, but we consider the transition as a technical detail, not impacting the API, which can be done independently of the KEP graduation cycles. The fix was back-ported to 1.28 , 1.27 , and 1.26 .
GA
- Address reviews and bug reports from Beta users
- Improved tests coverage:
- unit test for preemption by kube-scheduler, if feasible
- Write a blog post about the feature
- Graduate e2e tests as conformance tests
- Lock the
PodDisruptionConditionsandJobPodFailurePolicyfeature-gates - Declare deprecation of the
PodDisruptionConditionsandJobPodFailurePolicyfeature-gates in documentation - Modify the code to ignore the
PodDisruptionConditionsandJobPodFailurePolicyfeature gates
Deprecation
In GA+2 release:
- Remove the
PodDisruptionConditionsandJobPodFailurePolicyfeature gates
Upgrade / Downgrade Strategy
Upgrade
An upgrade to a version which supports this feature should not require any
additional configuration changes. In order to use this feature after an upgrade
users will need to configure their Jobs by specifying spec.podFailurePolicy. The
only noticeable difference in behavior, without specifying spec.podFailurePolicy,
is that Pods terminated by kubernetes components will have an additional
condition appended to status.conditions.
Downgrade
A downgrade to a version which does not support this feature should not require
any additional configuration changes. Jobs which specified
spec.podFailurePolicy (to make use of this feature) will be handled in a
default way.
Version Skew Strategy
This feature uses an additional API call between kubernetes components to append a Pod condition when terminating a pod. However, this API call uses pre-existing API so the version skew does not introduce runtime compatibility issues.
We use the feature gate strategy for coordination of the feature enablement between components.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name: PodDisruptionConditions
- Components depending on the feature gate:
- kube-apiserver
- kube-controller-manager
- kube-scheduler
- kubelet
- Components depending on the feature gate:
- Feature gate name: JobPodFailurePolicy
- Components depending on the feature gate:
- kube-apiserver
- kube-controller-manager
- Components depending on the feature gate:
- Feature gate name: PodDisruptionConditions
- Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?
Does enabling the feature change any default behavior?
Yes. The kubernetes components (kubelet, kube-apiserver, kube-scheduler and kube-controller-manager) will append a pod condition along with the request pod delete request.
However, the part of the feature responsible for handling of the failed pods
is opt-in with .spec.podFailurePolicy.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
Yes. Using the feature gate is the recommended way. When the feature is disabled
the Job controller manager handles pod failures in the default way even if
spec.podFailurePolicy is specified. Additionally, the dedicated Pod Conditions
are no longer appended along with delete requests.
What happens if we reenable the feature if it was previously rolled back?
The Job controller starts to handle pod failures according to the specified
spec.podFailurePolicy. Additionally, again, along with the delete requests, the
dedicated Pod Conditions are appended to Pod’s status.condition.
Are there any tests for feature enablement/disablement?
Yes, unit and integration test for the feature enabled, disabled and transitions.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
If any component has not yet rolled out, or fails to rollout, the existing default behavior will continue to apply, but there is no downtime during partial rollout or rollback.
What specific metrics should inform a rollback?
A substantial increase in the job_sync_duration_seconds metric may suggest the
processing of the configured job pod failure policy rules consumes too much time.
An operator can also observe job_pods_finished_total to check if the reason count
of taken actions (FailJob, Count or Ignore) correlates with the expected
changes based on the Job workload specificity.
Additionally, an operator should check if the terminated pods (due to reasons
listed in design details
) have the appropriate pod condition
added. The addition of the pod conditions can be checked by standard tools such
as the kubectl describe command or a watch
kubectl get pods -o yaml -w --output-watch-events.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
Manual test performed to simulate the upgrade->downgrade->upgrade scenario:
- Deploy k8s 1.25 with
PodDisruptionConditionsandJobPodFailurePolicyfeature gates disabled - Enable the
PodDisruptionConditionsandJobPodFailurePolicyfeature gates for the control plane components - Test the scenarios (described in Handling retriable and non-retriable pod failures with Pod failure policy :
- Scenario 1:
- Create a job with container failing with
42exit code. The job hasbackoffLimit>0and pod failure policy with aFailJobrule matching the exit codes. - Verify that the job fails fast without retries.
- Create a job with container failing with
- Scenario 2:
- Create a job with a long running containers and
backoffLimit=0. - Verify that the job continues after the node in uncordoned
- Create a job with a long running containers and
- Disable the feature gates. Verify that the above scenarios result in default behavior:
- In scenario 1: the job restarts pods failed with exit code
42 - In scenario 2: the job is failed due to exceeding the
backoffLimitas the failed pod failed during the draining
- Re-enable the feature gates
- Verify the above described scenarios work as after the first enablement
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
We use the metrics-based approach based on the following metrics (exposed by kube-controller-manager):
job_finished_total(existing, extended by a label): the newreasonlabel indicates the reason for the job termination. Possible values arePodFailurePolicy,BackoffLimitExceededandDeadlineExceeded. It can be used to determine what is the relative frequency of job terminations due to different reasons. For example, if jobs are terminated often due toBackoffLimitExceededit may suggest that the pod failure policy should be extended with new rules to terminate jobs early more oftenpod_failures_handled_by_failure_policy_total(new): theactionlabel tracks the number of failed pods that are handled by a specific failure policy action. Possible values are:FailJob,IgnoreandCount. This metric can be used to assess the coverage of pod failure scenarios withspec.podFailurePolicyrules.
How can someone using this feature know that it is working for their instance?
- Pod .status
- Condition type:
DisruptionTargetwhen a Pod is terminated due to a reason listed in design details .
- Condition type:
- Job .status
- Condition reason:
PodFailurePolicyfor the jobFailedcondition if the job was terminated due to the matchingFailJobrule.
- Condition reason:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
- 99% percentile over day for Job syncs is <= 15s for a client-side 50 QPS limit.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
job_sync_duration_seconds(existing): can be used to see how much the feature enablement increases the time spent in the sync job
- Components exposing the metric: kube-controller-manager
- Metric name:
Are there any missing metrics that would be useful to have to improve observability of this feature?
No.
Dependencies
Does this feature depend on any specific services running in the cluster?
No.
Scalability
Will enabling / using this feature result in any new API calls?
Yes. A PATCH API call to append a Pod condition when deleting the Pod. Also,
one PATCH API call to set Pod’s phase as Failed in scenarios (2nd and 3rd)
described under Marking pods as Failed
. Note that,
in the 1st scenario the phase is set along with adding the Pod condition.
Will enabling / using this feature result in introducing new API types?
No.
Will enabling / using this feature result in any new calls to the cloud provider?
No.
Will enabling / using this feature result in increasing size or count of the existing API objects?
Yes.
When the feature is enabled, Pods will be added a new Pod condition on termination.
- API type: Pod
- Estimated increase in size: 100B
- No new Pod objects
The size of the new Pod condition we estimate by adding the estimated sizes of the fields (in bytes):
type: 20status: 5reason: 30message: 50lastProbeTime: 8LastTransitionTime: 8.
When the feature is enabled users will be able to configure the Job’s pod failure policy.
- API type: Job
- Estimated increase in size: 22KB
- No new Job objects
We estimate the size of the new podFailurePolicy field in Job as max_number_of_rules * max(est_onExitCodes_size, est_onPodConditions_size), where (in bytes):
max_number_of_rules: 20est_onExitCodes_size: 1120 (255*4 for exit code values + 100 forcontainerName)est_onPodConditions_size: 500 (max 20 patterns * (5 forstatus+ 20 fortype)).
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
The additional CPU and memory increase in kube-controller-manager related to
handling of failed pods is negligible and only limited to these jobs which
specify spec.podFailurePolicy.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No. This feature does not introduce any resource exhaustive operations.
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
No change from existing behavior of the Job controller.
What are other known failure modes?
- When
PodDisruptionConditionsenabled pods are not terminated based on the NoExecute taint- Known bug in 1.25.x (fixed in master)
- Bugs: Fix handling of NoExecute taint when PodDisruptionConditions is enabled
- Detection: Observe that the pods are not deleted when a node is tainted with
NoExecute - Mitigations: disable
PodDisruptionConditions - Testing: Discovered bugs are covered by unit and integration tests.
DisruptionTargetcondition is not added to pods preempted by Kubelet when scheduling a critical pod. As a consequence there is no way to handle such pod failures with pod failure policy.- Known bug in 1.26.0-5 and 1.27.0-2
- Bugs: described in Add DisruptionTarget condition when preempting for critical pod
- Detection: Observe failed pods with reason
Preempting, and messagePreempted in order to admit critical pod, but withoutDisruptionTargetcondition. - Mitigations: upgrade to a fixed version (1.26.6+, 1.27.3+ or 1.28+). Alternatively, set higher
backoffLimitfor Jobs. - Testing: Discovered bug is covered by an integration test.
- When
PodDisruptionConditionsand pods with duplicated env. names or container ports are used, then pods cannot be deleted by PodGC and other core k8s controllers.- Known bug in 1.26.0-10, 1.27.0-7, 1.28.0-3
- Bugs: Pod Garbage collector fails to clean up PODs from nodes that are not running anymore
- Detection: Pods expected to be deleted are stuck terminating. The logs show a message similar to the following:
'failed to create manager for existing fields: failed to convert new object (app-b/app-b-5894548cb-7tssd; /v1, Kind=Pod) to smd typed: .spec.containers[name="app-b"].ports: duplicate entries for key [containerPort=8082,protocol="TCP"]' - Mitigations: upgrade to a fixed version (1.26.11+, 1.27.8+, or 1.28.4+). Alternatively, make sure pods with duplicated keys for env. variables or container pods are not created. Also, update the existing pods to cleanup the problematic fields.
- Testing: PodGC integration test reproduced the issue before withdrawing from SSA in PodGC in the PR #121103 .
What steps should be taken if SLOs are not being met to determine the problem?
If pods terminate without the expected pod failure conditions (this part of the feature does not depend on the Job controller, the standard troubleshooting technics apply):
- Check reachability between kubernetes components.
- Consider increasing the logging level to trace when the issue occurs.
If a job failure policy isn’t respected (this part of the feature depends on the Job controller, the standard and Job controller-specific troubleshooting technics apply):
- Inspect manually if the job’s terminated pods have containers with exit codes matching the configured rules.
- Inspect manually if the job’s terminated pods have conditions expected according to the design details .
- Inspect the Job controller’s
job_sync_duration_secondsmetric to see if there is an increase of the Job controller processing time. - Inspect the Job controller’s
job_pods_finished_totalmetric for the to check if the numbers of pod failures handled by specific actions (counted by thefailure_policy_actionlabel) agree with the expectations. For example, if a user configures job failure policy withIgnoreaction for theDisruptionTargetcondition, then a node drain is expected to increase the metric forfailure_policy_action=Ignore. - Consider increasing the logging level of kube-controller-manager to trace the job_controller logs.
Implementation History
- 2022-06-23: Initial KEP merged
- 2022-07-12: Preparatory PR “Refactor gc_controller to do not use the deletePod stub” merged
- 2022-07-14: Preparatory PR “Refactor taint_manager to do not use getPod and getNode stubs” merged
- 2022-07-20: Preparatory PR “Add integration test for podgc” merged
- 2022-07-28: KEP updates merged
- 2022-08-01: Additional KEP updates merged
- 2022-08-02: PR “Append new pod conditions when deleting pods to indicate the reason for pod deletion” merged
- 2022-08-02: PR “Add worker to clean up stale DisruptionTarget condition” merged
- 2022-08-04: PR “Support handling of pod failures with respect to the configured rules” merged
- 2022-09-09: Bugfix PR for test “Fix the TestRoundTripTypes by adding default to the fuzzer” merged
- 2022-09-26: Prepared PR for KEP Beta update. Summary of the changes:
- proposal to extend kubelet to add the following pod conditions when evicting a pod (see Design details
):
- DisruptionTarget for evictions due graceful node shutdown, admission errors, node pressure or Pod admission errors
- ResourceExhausted for evictions due to OOM killer and exceeding Pod’s ephemeral-storage limits
- extended the review of pod eviction scenarios by kubelet-initiated pod evictions:
- added a Risk and Mitigations sections:
- updated names of the proposed metrics fields:
PodFailurePolicyRule->PodFailurePolicyandJobTerminated->JobFailed(see here ) - added Story 3 to demonstrate how to use the API to ensure there are no infinite Pod retries
- updated Graduation Criteria for Beta
- updated of kep.yaml and PRR questionnaire to prepare the KEP for Beta
- proposal to extend kubelet to add the following pod conditions when evicting a pod (see Design details
):
- 2022-10-27: PR “Use SSA to add pod failure conditions” (link )
- 2022-10-31: PR “Extend metrics with the new labels” (link )
- 2022-11-03: PR “Fix disruption controller permissions to allow patching pod’s status” (link )
- 2022-11-08: KEP update for Beta (link
) with main changes:
- do not introduce the
ResourceExhaustedcondition (it was planned to be used for pods killed due to OOM killer or exceeding ephemeral storage limits) - do not add
DisruptionTargetcondition in case of admission failures
- do not introduce the
- 2022-11-11: PR “Fix match onExitCodes when Pod is not terminated” (link )
- 2022-11-11: PR “Wait for Pods to finish before considering Failed in Job” (link )
- 2022-11-15: PR “Add e2e test to ignore failures with 137 exit code” (link )
- 2023-01-03: PR “Fix clearing of rate-limiter for the queue of checks for cleaning stale pod disruption conditions” (link )
- 2023-01-09: PR “Adjust DisruptionTarget condition message to do not include preemptor pod metadata” (link )
- 2023-01-13: PR “PodGC should not add DisruptionTarget condition for pods which are in terminal phase” (link )
- 2023-03-17: PR “Give terminal phase correctly to all pods that will not be restarted” (link )
- 2023-03-18: PR “API-initiated eviction: handle deleteOptions correctly” (link )
- 2023-05-23: PR “Add DisruptionTarget condition when preempting for critical pod” (link )
- 2023-10-19: PR “Use Patch instead of SSA for Pod Disruption condition” (link )
- 2024-06-18: PR “scheduler: Test that the DisruptionTarget condition is added at preemption time” (link )
- 2024-07-09: PR “Graduate PodDisruptionConditions to stable” (link )
- 2024-07-12: PR “Graduate JobPodFailurePolicy to stable” (link )
- 2024-07-12: PR “Use omitempty for optional fields in Job Pod Failure Policy” (link )
- 2024-07-17: PR “Promote JobPodFailurePolicy and PodDisruptionConditions e2e tests to Conformance” (link )
- 2024-07-17: PR “clean up codes after PodDisruptionConditions was promoted to GA” (link )
- 2024-07-18: PR “cleanup after JobPodFailurePolicy is promoted to GA” (link )
- 2024-08-14: PR “Fix a scheduler preemption issue where the victim isn’t properly patched, leading to preemption not functioning as expected” (link )
Drawbacks
Alternatives
Only support for exit codes
We considered supporting just exit codes when defining the policy for handling pod failures. However, this approach alone would not be sufficient to distinguish pod failures caused by infrastructure issues. A special handling of such failures is important in some use cases (see: Story 2 ).
Using Pod status.reason field
We considered using of the pod’s status.reason field to determine
the reason for a pod failure. This field would be set based on the DeleteOptions
reason field associated with the delete API requests. However, this approach is
problematic as then the field would be used to set by multiple components
leading to race-conditions. Also reasons could be arbitrary strings, making it
hard for users to know which reasons to look for in each version.
Using of various PodCondition types
We considered introducing a set of dedicated PodCondition types corresponding to different components or reasons in which a pod deletion is triggered. However, this could be problematic as the list of available PodCondition types field would be evolving with new values being added and potentially some values becoming obsolete. This could make it difficult to maintain a valid list of PodCondition types enumerated in the Job configuration.
More nodeAffinity-like JobSpec API
Along with introduction of the set of PodCondition types we also considered
a more nodeAffinity-like JobSpec API being able to match against multiple
condition types. It would also support the key field for constraining the
status field (and potentially other fields). This is an example Job
spec using such API:
podFailurePolicy:
rules:
- action: Ignore
- onPodConditions:
- key: Type
operator: In
values:
- Evicted
- Preempted
- key: Status
operator: In
values:
- True
Such API, while more flexible, might be harder to use in practice. Thus, in the first iteration of the feature, we intend to provide a user-friendly API targeting the known use-cases. A more flexible API can be considered as a future improvement.
Possible future extensions
As one possible direction of extending the feature is adding pod failure conditions in the following scenarios (see links for discussions on the factors that made us not to cover the scenarios in Beta):
We are going to re-evaluate the decisions based on the user feedback after users
start to use the feature - using job failure policies based on the DisruptionTarget condition
and container exit codes.