KEP-5593: Configure the max CrashLoopBackOff delay
KEP-5593: Configure the max CrashLoopBackOff delay
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Per node config
- Implementing with KubeletConfiguration
- Refactor of recovery threshold
- Kubelet overhead analysis
- Relationship with Job API podFailurePolicy and backoffLimit
- Relationship with ImagePullBackOff
- Relationship with k/k#123602
- Test Plan
- Graduation Criteria
- Upgrade / Downgrade Strategy
- Version Skew Strategy
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Appendix A
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
CrashLoopBackoff is designed to slow the speed at which failing containers
restart, preventing the starvation of kubelet by a misbehaving container.
Currently it is a subjectively conservative, fixed behavior regardless of
container failure type: when a Pod has a restart policy besides Never, after
containers in a Pod exit, the kubelet restarts them with an exponential back-off
delay (10s, 20s, 40s, …), that is capped at five minutes. The delay for restarts
will stay at 5 minutes until a container has executed for 2x the maximum backoff
– that is, 10 minutes – without any problems, in which case the kubelet resets
the backoff count for that container and further crash loops start again at the
beginning of the delay curve
(ref
.
Both the decay to 5 minute back-off delay, and the 10 minute recovery threshold, are considered too conservative, especially in cases where the exit code was 0 (Success) and the pod is transitioned into a “Completed” state or the expected length of the pod run is less than 10 minutes.
This KEP proposes the following changes:
- Provide a knob to cluster operators to configure maximum backoff down, to minimum 1s, at the node level
- Formally split backoff counter reset threshold for container restart backoff behavior and maintain the current 10 minute recovery threshold
Originally, this KEP was part of the larger KEP-4603 - Tune CrashLoopBackoff
Motivation
Kubernetes#57291 , with over 250 positive reactions and in the top five upvoted issues in k/k, covers a range of suggestions to change the rate of decay for the backoff delay or the criteria to restart the backoff counter, in some cases requesting to make this behavior tunable per node, per container, and/or per exit code. Anecdotally, there are use cases representative of some of Kubernetes’ most rapidly growing workload types like gaming and AI/ML that would benefit from this behavior being different for varying types of containers. Application-based workarounds using init containers or startup wrapper scripts, or custom operators like kube_remediator and descheduler are used by the community to anticipate crashloopbackoff behavior, prune pods with nonzero backoff counters, or otherwise “pretend” the pod did not exit recently to force a faster restart from kubelet. Discussions with early Kubernetes contributors indicate that the current behavior was not designed beyond the motivation to throttle misbehaving containers, and is open to reintepretation in light of user experiences, empirical evidence, and emerging workloads.
This KEP will allow pods to restart faster and more often than the current status quo; let it be known that such a change is desired. It is also the intention of the author that, to some degree, this change happens without the need to reconfigure workloads or expose extensive API surfaces, as experience shows that makes changes difficult to adopt, increases the risk for misconfiguration, and can make the system overly complex to reason about.
A large number of alternatives have been discussed over the 5+ years the canonical tracking issue has been open, some of which imply high levels of sophistication for kubelet to make different decisions based on system state, workload specific factors, or by detecting anomalous workload behavior. While this KEP does not rule out those directions in the future, the proposal herein focuses on a simpler, easily modeled change designed to address the most common issues observed today.
Goals
- Improve pod restart backoff logic to better match the actual load it creates and meet emerging use cases
- Provide a simple UX that does not require changes for the majority of workloads
- Must work for Jobs and sidecar containers
Non-Goals
- This effort is NOT intending to support fully user-specified configuration, to cap risk to node stability
- This effort is purposefully NOT implementing more complex heuristics by kubelet (e.g. system state, workload specific factors, or by detecting anomalous workload behavior) to focus on better benchmarking and observability and address common use cases with easily modelled changes first
- This effort is NOT changing the independent backoff curve for image pulls
Proposal
This KEP proposes providing an option to cluster operators to configure a lower maximum backoff for all containers on a specific Node, down to 1s. By default, the backoff delay starts at 10s and doubles with each retry up to a maximum of 5 minutes (300s). Configuring a maximum backoff less than the 10s initial delay will lower the initial delay to match the maximum, allowing for more rapid container restarts.
The upper limit of the configuration will be 5 minutes (300s). With this value, existing behavior remains unchanged.
This proposal will NOT change:
- backoff behavior for Pods transitioning from the “Success” state differently from those transitioning from a “Failed” state – see here in Alternatives Considered
- the ImagePullBackoff – out of scope, see Design Details
- changes that address ’late recovery’, or modifications to backoff behavior once the maximum backoff has been reached – see Alternatives

While the complete information is saved for Design Details , its expedient to see the exact config proposed here:
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
crashloopbackoff:
maxSeconds: 4
Refactor and flat rate to 10 minutes for the backoff counter reset threshold
Finally, this KEP proposes factoring out the backoff counter reset threshold for CrashLoopBackoff behavior induced restarts into a new constant. Instead of being proportional to the configured maximum backoff, it will instead be a flat rate equivalent to the current implementation, 10 minutes. This will only apply to backoff counter resets from CrashLoopBackoff behavior.
User Stories
Task isolation
By design the container wants to exit so it will be recreated and restarted to start a new “session” or “task”. It may be a new gaming session or a new isolated task processing an item from the queue.
This is not possible to do by creating the wrapper of a container that will restart the process in a loop because those tasks or gaming sessions desire to start from scratch. In some cases - they may even want to download a new container image.
For these cases, it is important that the startup latency of the full rescheduling of a new pod is avoided.
Fast restart on failure
There are AI/ML scenarios where an entire workload is delayed if one of the Pods is not running. Some Pods in this scenario may fail with a recoverable error – for example some dependency failure, or be killed by infrastructure (e.g. exec probe failures on containerd slow restart). In such cases, it is desirable to restart the container as fast as possible and not wait for 5 minutes of the max crashloop backoff timeout. With the existing max crashloop backoff timeout, a failure of a single pod in a highly-coupled workload can cause a cascade in the workload leading to an overall delay of (much) greater than the 5 minute backoff.
The typical pattern here today is to be quick to restart container, but with the intermittently failed dependency (e.g. network is down for some time, some intermittent issue with a GPU), this causes the container to fail repeatedly and the backoff timeout to eventually be reached. However when the dependency goes back to green, the container is not restarted immediately and has already reached the maximum crash loop backoff duration of 5 minutes.
Sidecar containers fast restart
There are cases when the Pod consists of a user container implementing business logic and a sidecar providing networking access (Istio), logging (opentelemetry collector), or orchestration features. In some cases sidecar containers are critical for the user container functioning as they provide a basic infrastructure for the user container to run in. In such cases it is considered beneficial to not apply the exponential backoff to the sidecar container restarts and keep it at a low constant value.
This is especially true for cases when the sidecar is killed by infrastructure (e.g. OOMKill) as it may have happened for reasons independent from the sidecar functionality.
Notes/Constraints/Caveats (Optional)
Risks and Mitigations
The biggest risk of this proposal is giving operators a knob which could potentially compromise node stability, risking the kubelet component to become too slow to respond and the pod lifecycle to increase in latency. While not default, it is by design allowing a more severe reduction in the decay behavior. In the worst case, it could cause nodes to fully saturate with near-instantly restarting pods that will never recover, risking taking down nodes or at least nontrivially slowing kubelet, or increasing the API requests to store backoff state so significantly that the central API server is unresponsive and the cluster fails.
With nearly trivial (when compared to pod startup metrics) max allowable backoffs of 1s, there is more risk to node stability expected. In this case, for a hypothetical node with the default maximum 110 pods each with one crashing container all stuck in a simultaneous 1s maximum CrashLoopBackoff, at its most efficient this would result in a new restart for each pod every second, and therefore the API requests to change the state transition would be expected to increase from ~550 requests/10s to 5500 requests/10s, or 10x. In addition, since the maximum backoff would be lowered, an ideal pod would continue to restart more often than today’s behavior, adding 305 excess restarts within the first 5 minutes and 310 excess restarts every 5 minutes after that; each crashing pod would be contributing an excess of ~1550 pod state transition API requests, and fully saturated node with a full 110 crashing pods would be adding 170,500 new pod transition API requests every five minutes, which is an an excess of ~568 requests/10s. «[!UNRESOLVED non blocking: kubernetes default for the kubelet client rate limit and how this changes by machine size]» «[UNRESOLVED]»
The first line of defense is that the enhancement is not usable by default and must be opted into using a feature gate.
Beyond the feature gate, this configuration can only be modified by users with the permissions to modify the kubelet configuration – in other words, a cluster operator persona.
Design Details
Per node config
For some users in Kubernetes#57291 , any delay over 1 minute at any point is just too slow, even if it is legitimately crashing. A common refrain is that for independently recoverable errors, especially system infrastructure events or recovered external dependencies, or for absolutely nonnegotiably critical sidecar pods, users would rather poll more often or more intelligently to reduce the amount of time a workload has to wait to try again after a failure. In the extreme cases, users want to be able to configure (by container, node, or exit code) the backoff to close to 0 seconds. This KEP considers it out of scope to implement fully user-customizable behavior, and too risky without full and complete benchmarking to node stability to allow legitimately crashing workloads to have a backoff of 0, but it is in scope for the first alpha to provide users a way to opt in to a faster restart behavior.
So why opt in by node? In fact, the initial proposal of this KEP for 1.31 was to
opt in by Pod, to minimize the blast radius of a given Pod’s worst case restart
behavior. In 1.31 this was proposed using a new restartPolicy value in the Pod
API, described in Alternatives Considered here
. Concerns
with this approach fell into two buckets: 1. technical issues with the API
(which could have been resolved by a different API approach), and 2. design
issues with exposing this kind of configuration to users without holistic
insight into cluster operations, for example, to users who might have pod
manifest permissions in their namespace but not for other namespaces in the same
cluster and which might be dependent on the same kubelet. For 1.32, we were
looking to address the second issue by moving the configuration somewhere we
could better guarantee a cluster operator type persona would have exclusive
access to. In addition, initial manual stress testing and benchmarking indicated
that even in the unlikely case of mass pathologically crashing and instantly
restarting pods across an entire node, cluster operations proceeded with
acceptable latency, disk, cpu and memory. Worker polling loops, context
timeouts, the interaction between various other backoffs, as well as API server
rate limiting made up the gap to the stability of the system. Therefore, to
simplify both the implementation and the API surface, this 1.32 proposal puts
forth that the opt-in will be configured per node via kubelet configuration.
Implementing with KubeletConfiguration
Kubelet configuration is governed by two main input points, 1) command-line flag
based config and 2) configuration following the API specification of the
kubelet.config.k8s.io/v1beta1 KubeletConfiguration Kind, which is passed to
kubelet as a config file or, beta as of Kubernetes 1.30, a config directory
(ref
.
Since this is a per-node configuration that likely will be set on a subset of nodes, or potentially even differently per node, it’s important that it can be manipulated per node. Expected use cases of this type of heterogeneity in configuration include
- Dedicated node pool for workloads that are expected to rapidly restart
- Config aligned with node labels/pod affinity labels for workloads that are expected to rapidly restart
- Machine size adjusted config
By default KubeletConfiguration is intended to be shared
between nodes, but the beta feature for drop-in configuration files in a
colocated config directory cirumvents this. In addition, KubeletConfiguration
drops fields unrecognized by the current kubelet’s schema, making it a good
choice to circumvent compatibility issues with n-3 kubelets. While there is an
argument that this could be better manipulated with a command-line flag, so
lifecycle tooling that configures nodes can expose it more transparently, that
was an acceptable design change given the introduction of KubeletConfiguration
in the first place. In any case, the advantages to backwards and forward
compatibility by far outweigh this consideration for the alpha period and can be
revisted before beta.
The proposed configuration explicitly looks like this:
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
crashloopbackoff:
maxContainerRestartPeriod: 4
Refactor of recovery threshold
A simple change to maximum backoff would naturally come with a modification of the backoff counter reset threshold – as it is currently calculated based on 2x the maximum backoff. Without any other modification, as a result of this KEP, default containers would be “rewarded” by having their backoff counter set back to 0 for running successfully for 2*1 minute=2 minutes (instead of for 2*5minutes=10 minutes like it is today); containers on nodes with an override could be rewarded for running successfully for as low as 2 seconds if they are configured with the minimum allowable backoff of 1s.
From a technical perspective, granularity of the associated worker polling loops governing restart behavior is between 1 and 10 seconds, so a reset value under 10 seconds is effectively meaningless (until and unless those loops increase in speed or we move to evented PLEG). From a user perspective, it does not seem that there is any particular end user value in artificially preserving the current 10 minute recovery threshold as part of this implementation, since it was an arbitrary value in the first place. However, since late recovery as a category of problem space is expressly a non-goal of this KEP, and in the interest of reducing the number of changed variables during the alpha period to better observe the ones previously enumerated, this proposal intends to maintain that 10 minute recovery threshold anyways.
Forecasting that the recovery threshold for CrashLoopBackOff may be better
served by being configurable in the future, or at least separated in the code
from all other uses of client_go.Backoff for whatever future enhancements
address the late recovery problem space, the mechanism proposed here is to
redefine client_go.Backoff to accept alternate functions for
client_go.Backoff.hasExpired
,
and configure the client_go.Backoff object created for use by the kube runtime
manager for container restart bacoff with a function that compares to a flat
rate of 300 seconds.
Kubelet overhead analysis
As it’s intended that, after this KEP, pods will restart more often that in
current Kubernetes, it’s important to understand what the kubelet does during
pod restarts. The most informative code path for that is through (all links to
1.31) kubelet/kubelet.go SyncPod
,
kubelet/kuberuntime/kuberuntime_manager.go SyncPod
,
kubelet/kuberuntime/kuberuntime_container.go startContainer
and kubelet/kubelet.go SyncTerminatingPod
,
kubelet/kuberuntime/kuberuntime_container.go killContainer
,
kubelet/kuberuntime/kuberuntime_manager.go killPodWithSyncResult
,
kubelet/kubelet.go SyncTerminatingPod
, and kubelet/kubelet.go SyncTerminatedPod
.

As you might imagine this is a very critical code path with hooks to many in-flight features; Appendix A includes a more complete list (yet still not as exhaustive as the source code), but the following are selected as the most important behaviors of kubelet during a restart to know about, that are not currently behind a feature gate.
After a Pod is in Terminating phase, kubelet:
- Clears up old containers using container runtime
- Stops probes and pod sandbox, and unmounts volumes and unregisters secrets/configmaps (since Pod was in a terminal phase)
- While still in the backoff window, wait for network / attach volumes / register pod to secret and configmap managers / re-download image secrets if necessary
Once the current backoff window has passed, kubelet:
- Potentially re-downloads the image (utilizing network + IO and blocks other image downloads) if image pull policy specifies it (ref ).
- Recreates the pod sandbox and probe workers
- Recreates the containers using container runtime
- Runs user configured prestart and poststart hooks for each container
- Runs startup probes until containers have started (startup probes may be more expensive than the readiness probes as they often configured to run more frequently)
- Redownloads all secrets and configmaps, as the pod has been unregistered and reregistered to the managers, while computing container environment/EnvVarFrom
- Application runs through its own initialization logic (typically utilizing more IO)
- Logs information about all container operations (utilizing disk IO and “spamming” logs)
The following diagram showcases these same highlights more visually and in context of the responsible API surface (Kubelet or Runtime aka CRI).

<<[UNRESOLVED non blocking answer these question from original PR or make new bugs]>>
>Does this [old container cleanup using containerd] include cleaning up the image filesystem? There might be room for some optimization here, if we can reuse the RO layers.
to answer question: looks like it is per runtime. need to check about leasees. also part of the value of this is to restart the sandbox.
Relationship with Job API podFailurePolicy and backoffLimit
Job API provides its own API surface for describing alterntive restart
behaviors, from KEP-3329: Retriable and non-retriable Pod failures for
Jobs
,
in beta as of Kubernetes 1.30. The following example from that KEP shows the new
configuration options: backoffLimit, which controls for number of retries on
failure, and podFailurePolicy, which controls for types of workload exit codes
or kube system events to ignore against that backoffLimit.
apiVersion: v1
kind: Job
spec:
[ . . . ]
backoffLimit: 3
podFailurePolicy:
rules:
- action: FailJob
onExitCodes:
containerName: main-job-container
operator: In
values: [1,2,3]
- action: Ignore
onPodConditions:
- type: DisruptionTarget
The implementation of KEP-3329 is entirely in the Job controller, and the
restarts are not handled by kubelet at all; in fact, use of this API is only
available if the restartPolicy is set to Never (though
kubernetes#125677
wants to relax this validation to allow it to be used with other restartPolicy
values). As a result, to expose the new backoff curve to Jobs using this
feature, the updated backoff curve must also be implemented in the Job
controller. This is currently considered out of scope of the first alpha
implementation of this design.
Relationship with ImagePullBackOff
ImagePullBackoff is used, as the name suggests, only when a container needs to pull a new image. If the image pull fails, a backoff decay is used to make later retries on the image download wait longer and longer. This is configured internally independently (here ) from the backoff for container restarts (here ).
This KEP considers changes to ImagePullBackoff as out of scope, so during implementation this will keep the same backoff. This is because the problem space of ImagePullBackoff could likely be handled by a completely different pattern, as unlike with CrashLoopBackoff the types of errors with ImagePullBackoff are less variable and better interpretable by the infrastructure as recoverable or non-recoverable (i.e. 404s).
Relationship with k/k#123602
It was observed that there is a bug with the current requeue behavior, described in https://github.com/kubernetes/kubernetes/issues/123602 . The first restart will have 0 seconds delay instead of the advertised initial value delay, because the backoff object will not be initialized until a key is generated, which doesn’t happen until after the first restart of a pod (see ref ). That first restart will also not impact future backoff calculations, so the observed behavior is closer to:
- 0 seconds delay for first restart
- 10 seconds for second restart
- 10 * 2^(restart_count - 2) for subsequent restarts
By watching a crashing pod, we can observe that it does not enter a CrashLoopBackoff state or behave as advertised until after that first “free” restart:
# a pod that crashes every 10s
thing 0/1 Pending 0 29s
thing 0/1 Pending 0 97s
thing 0/1 ContainerCreating 0 97s
thing 1/1 Running 0 110s
thing 0/1 Completed 0 116s # first run completes
thing 1/1 Running 1 (4s ago) 118s # no crashloopbackoff observed, 1 restart tracked
thing 0/1 Completed 1 (10s ago) 2m4s # second runc ompletes
thing 0/1 CrashLoopBackOff 1 (17s ago) 2m19s # crashloopbackoff observed
thing 1/1 Running 2 (18s ago) 2m20s # third run starts
thing 0/1 Completed 2 (23s ago) 2m25s
thing 0/1 CrashLoopBackOff 2 (14s ago) 2m38s # crashloopbackoff observed again
thing 1/1 Running 3 (27s ago) 2m51s
thing 0/1 Completed 3 (32s ago) 2m56s
Ultimately, being able to predict the exact number of restarts or remedying up to 10 seconds delay for the advertised behavior is not the ultimate goal of this KEP, though certain assumptions were made when calculating risks, mitigations, and analyzing existing behavior that are affected by this bug. Since the behavior is already changing as part of this KEP, and similar code paths will be changed, it is within scope of this KEP to address this bug if it is a blocker to implementation for alpha; it can wait until beta otherwise. This is represented below in the Graduation Criteria .
Test Plan
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
- Test coverage of proper requeue behavior; see https://github.com/kubernetes/kubernetes/issues/123602
Unit tests
Integration tests
Unit and E2E tests are expected to provide sufficient coverage.
e2e tests
- Crashlooping container that restarts some number of times (ex 10 times), timestamp the logs and read it back in the test, and expect the diff in those time stamps to be minimum the backoff, with a healthy timeout
Graduation Criteria
Alpha
- New
int32 crashloopbackoff.maxContainerRestartPeriodfield inKubeletConfigurationAPI, validated to a minimum of 1 and a maximum of 300, used whenKubeletCrashLoopBackOffMaxfeature flag enabled, to customize CrashLoopBackOff per node - Maintain current 10 minute recovery threshold by refactoring backoff counter reset threshold and explicitly implementing container restart backoff behavior at the current 10 minute recovery threshold
- Initial e2e tests setup and enabled
- Initial unit tests covering new behavior
- Especially confirming the backoff object is set properly depending on the feature gates set as per the Conflict Resolution policy
- Test proving
KubeletConfigurationobjects will silently drop unrecognized fields in theconfig.validation_testpackage (ref ).- «[UNRESOLVED non blocking]»Is this also the expected behavior when the feature gate is disabled?«[/UNRESOLVED]»
- Test coverage of proper requeue behavior; see https://github.com/kubernetes/kubernetes/issues/123602
- Actually fix https://github.com/kubernetes/kubernetes/issues/123602 if this blocks the implementation, otherwise beta criteria
Beta
- Feature Enabled by Default: The KubeletCrashLoopBackOffMax feature gate is enabled by default.
GA
- 2 Kubernetes releases soak in beta
- Remove the feature flag code
- Conformance test added for per-node configuration
- Benchmarking of extreme conditions (very low max backoff, large number of crashing pods) to better characterize the worst-case scenarios
Upgrade / Downgrade Strategy
For an existing cluster, no changes are required to configuration, invocations or API objects to make an upgrade.
To make use of this enhancement, on cluster upgrade, the
KubeletCrashLoopBackOffMax feature gate must first be turned on for the
cluster. Then, if any nodes need to use a different backoff curve, their kubelet
must be completely redeployed either in the same upgrade or after that upgrade
with the crashloopbackoff.maxContainerRestartPeriod KubeletConfiguration
set.
To stop use of this enhancement, there are two options.
On a per-node basis, nodes can be completely redeployed with
crashloopbackoff.maxContainerRestartPeriod KubeletConfiguration unset.
Since kubelet does not cache the backoff object, on kubelet restart they will
start from the beginning of their backoff curve (10s).
Or, the entire cluster can be restarted with the
KubeletCrashLoopBackOffMax feature gate turned off. In this case, any
Node configured with a different backoff curve will instead use
the default backoff curve. Again, since the cluster was restarted and Pods were
redeployed, they will not maintain prior state and will start at the beginning
of their backoff curve.
Conflict resolution
If on a given node at a given time, the per-node configured maximum backoff is
lower than the initial value, the initial value for that node will instead be
set to the configured maximum. For example, if KubeletCrashLoopBackOffMax is
turned on and a given node is configured to a maximum of 1s, then the
initial value for that node will be configured to 1s. In other words, operator-
invoked configuration will have precedence over the default if it is faster.
If on a given node at a given time, the per-node configured maximum backoff is lower than 1 second or higher than the 300s, validation will fail and the kubelet will crash/be unable to start, like it does with other invalid kubelet configuration today.
If on a given node at a given time, the per-node configured maximum backoff is higher than the current initial value, but within validation limits as it is lower than 300s, it will be honored. In other words, operator-invoked configuration will have precedence over the default, even if it is slower, as long as it is valid.
If crashloopbackoff.maxContainerRestartPeriod KubeletConfiguration exists
but KubeletCrashLoopBackOffMax is off, kubelet will log a warning but will
not honor the crashloopbackoff.maxContainerRestartPeriod
KubeletConfiguration. In other words, operator-invoked per node configuration
will not be honored if the overall feature gate is turned off.
| scenario | KubeletCrashLoopBackOffMax | Effective initial value |
|---|---|---|
| today’s behavior | disabled | 10s |
| faster per node config | 2s | 2s |
| slower per node config | 10s | 10s |
| " | 11s | 11s |
| " | 300s | 300s |
| invalid per node config | 301s | kubelet crashes |
Version Skew Strategy
No coordination needs to be done between the control plane and the nodes; all behavior changes are local to the kubelet component and its start up configuration. An n-3 kube-proxy, n-1 kube-controller-manager, or n-1 kube-scheduler without this feature available is not affected when this feature is used, nor will different CSI or CNIimplementations. Code paths that will be touched are exclusively in kubelet component.
An n-3 kubelet without this feature available will behave like normal, with the
original CrashLoopBackOff behavior. It will drop unrecognized fields in
KubeletConfiguration by default for per node config so if one is specified on
start up in a past kubelet version, it will not break kubelet (though the
behavior will, of course, not change).
While the CRI is a consumer of the result of this change (as it will receive more requests to start containers), it does not need to be updated at all to take advantage of this feature as the restart logic is entirely in process of the kubelet component.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name:
KubeletCrashLoopBackOffMax- Components depending on the feature gate:
kubelet
- Components depending on the feature gate:
- Feature gate name:
Does enabling the feature change any default behavior?
No, by default the max crashloop backoff value will remain unchanged, even when
the feature gate is enabled. Cluster operators must explicitly opt-in to
behavioral changes by configuring the
crashloopbackoff.maxContainerRestartPeriod KubeletConfiguration.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
Yes, disable is supported. If KubeletCrashLoopBackOffMax is disabled, once
kubelet is restarted it will initialize the default initial and maximum backoff
values to the global defaults of 10s and 300s, respectively.
What happens if we reenable the feature if it was previously rolled back?
Since the backoff values are latched when kubelet is started, reenabling the feature will pull in whatever configured values are set.
Are there any tests for feature enablement/disablement?
At minimum, unit tests will be included confirming the backoff object is set properly.
In this version of the proposal, there are no API schema changes or conversions
necessary. However it is worth nothing there is one addition to an API object,
which is a new field in the KubeletConfiguration Kind. Based on manual tests
by the author, adding an unknown field to KubeletConfiguration is safe and the
unknown config field is dropped before addition to the
kube-system/kubelet-config object which is its final destination (for example,
in the case of n-3 kubelets facing a configuration introduced by this KEP).
Ultimately this is supported by the configuratinon of a given Kind’s
fieldValidation strategy in API machinery
(ref
)
which, in 1.31+, is set to “warn” by default and is only valid for API objects
and it turns out is not explicitly set as strict for
KuberntesConfiguration object so they ultimately bypass this
(ref
).
This is not currently tested as far as I can tell in the tests for
KubeletConfiguration (in either the most likely location, in
validation_test
,
nor other tests in the config
package
)
and discussions with other contributors indicate that while little in core
kubernetes does strict parsing, it’s not well tested. At minimum as part of this
implementation a test covering this for KubeletConfiguration objects will be
included in the config.validation_test package.
pkg/kubelet/apis/config/validation/validation_test.go
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
The primary risk of this proposal is giving operators a knob which could potentially compromise node stability, risking the kubelet component to become too slow to respond and the pod lifecycle to increase in latency.
All behavior changes are local to the kubelet component and its start up configuration, so a mix of different (or unset) max backoff durations will not cause issues to running workloads.
Rolling back is straightforward: the operator needs to revert or update the
crashloopbackoff.maxContainerRestartPeriod KubeletConfiguration and restart
kubelet.
What specific metrics should inform a rollback?
This biggest bottleneck expected will be kubelet, as it is expected to get more restart requests and have to trigger all the overhead discussed in Design Details more often. Cluster operators should be closely watching these existing metrics:
- Kubelet component CPU and memory
kubelet_http_inflight_requestskubelet_http_requests_duration_secondskubelet_http_requests_totalkubelet_pod_worker_duration_secondskubelet_runtime_operations_duration_seconds
Most important to the perception of the end user is Kubelet’s actual ability to create pods, which we measure in the latency of a pod actually starting compared to its creation timestamp. The following existing metrics are for all pods, not just ones that are restarting, but at a certain saturation of restarting pods this metric would be expected to become slower and must be watched to determine rollback:
kubelet_pod_start_duration_secondskubelet_pod_start_sli_duration_seconds
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
Since the configuration is local to the kubelet, tests showing the feature gate enabled and disabled are sufficient for verifying correct behavior. pkg/kubelet/kubelet_test.go was added for this verification.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
N/A
How can someone using this feature know that it is working for their instance?
- Events
- Event Reason:
- API .status
- Condition name:
- Other field:
- Other (treat as last resort)
- Details: Users will be able to observe a different restart behavior for containers of crashing pods. Since the maximum duration can only be set lower than the default 5 minutes, users will observe crashloopbackoff capping out at the new maximum value. If the maximum value is configured lower than the initial delay value (currently 10 seconds), users will observe many more (and more frequent) container restarts.
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
The core risk of this feature is degradation of kubelet performance. As such, the focus should be on ensuring the kubelet remains healthy and performant.
The time required for any pod to start up should not perceptibly increase after enabling this feature. The overall pod startup experience on a node must remain consistent with its established performance baseline.
The kubelet’s consumption of node resources, like CPU and memory, must remain stable and within reasonable bounds. It should not grow excessively or threaten the stability of other workloads, even when managing a high rate of pod restarts.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
- kubelet_http_inflight_requests
- kubelet_http_requests_duration_seconds
- kubelet_http_requests_total
- kubelet_pod_worker_duration_seconds
- kubelet_runtime_operations_duration_seconds
- kubelet_pod_start_duration_seconds
- kubelet_pod_start_sli_duration_seconds
- Metric name:
Are there any missing metrics that would be useful to have to improve observability of this feature?
No
Dependencies
Does this feature depend on any specific services running in the cluster?
The feature described is entirely self-contained within the kubelet component. There are no other dependencies.
Scalability
Will enabling / using this feature result in any new API calls?
It will not result in NEW API calls but it will result in MORE API calls. See the Risks and Mitigations section for the back-of-the-napkin math on the increase in especially /pods API endpoint calls, which initial benchmarking showed an aggressive case (110 instantly restarting single-contaier pods) reaching 5 QPS before slowing down to 2 QPS.
Will enabling / using this feature result in introducing new API types?
No, this KEP will not result in any new API types.
Will enabling / using this feature result in any new calls to the cloud provider?
No, this KEP will not result in any new calls to the cloud provider.
Will enabling / using this feature result in increasing size or count of the existing API objects?
No, this KEP will not result in increasing size or count of the existing API objects.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
Maybe! As containers could be restarting more, this may affect “Startup latency of schedulable stateless pods”, “Startup latency of schedule stateful pods”. In manual benchmarking experiments on a node saturated with 110 instantly crashing single-container pods and a max backoff value of 0s, the query-per-second load on the API server increased up to 30x. No tested scenarios caused the API server to be completely unresponsive or affect normal operation. However, more comprehensive testing of extreme conditions will be required before this feature can graduate to GA.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
This depends on the configured max backoff value. We expect more CPU usage of kubelet as it processes more restarts, when the max backoff value is set lower than the initial delay value. In initial manual benchmarking tests, CPU usage of kubelet increased 2x on nodes saturated with 110 instantly crashing single-container pods and a max backoff value of 0s (which is invalid in the final implementation).
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
Based on the initial benchmarking, no, which was based on manual benchmarking tests on nodes saturated with 110 instantly crashing single-container pods. However, more “normal” cases (with lower percentage of crashing pods) and even more pathological cases (with higher container-density Pods, sidecars, n/w traffic, and large image downloads) have not been tested with the most aggressive restart characteristics.
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
The feature continues to function normally if the API server is unavailable.
The CrashLoopBackOff logic is handled entirely by the kubelet locally on the node and doesn’t need the API server to operate. The only impact is that the kubelet cannot report pod status updates until the API server is back online, but that is outside the scope of this feature.
What are other known failure modes?
Kubelet could fail to start if an invalid configuration value is supplied.
Setting an extremely low backoff period could cause cascading failures leading to node instability.
What steps should be taken if SLOs are not being met to determine the problem?
Reverting the change to the configured max backoff and restarting kubelet.
Implementation History
- 04-23-2024: Problem lead opted in by SIG-Node for 1.31 target (enhancements#4603 )
- 06-04-2024: KEP proposed to SIG-Node focused on providing limited alpha
changes to baseline backoff curve, addition of opt-in
Rapidcurve, and change to constant backoff forSucceededPods - 06-06-2024: Removal of constant backoff for
SucceededPods - 09-09-2024: Removal of
RestartPolicy: Rapidin proposal, removal of PRR, in order to merge a provisional and address the new 1.32 design in a cleaner PR - 09-20-2024: Rewrite for 1.32 design focused on per-node config in place of
RestartPolicy: Rapid - 10-02-2024: PRR added for 1.32 design
- 10-29-2025: Split KEP-4603 - Tune CrashLoopBackoff into two separate proposals.
Drawbacks
CrashLoopBackoff behavior has been stable and untouched for most of the Kubernetes lifetime. It could be argued that it “isn’t broken”, that most people are ok with it or have sufficient and architecturally well placed workarounds using third party reaper processes or application code based solutions, and changing it just invites high risk to the platform as a whole instead of individual end user deployments. However, per the Motivation section, there are emerging workload use cases and a long history of a vocal minority in favor of changes to this behavior, so trying to change it now is timely. Obviously we could still decide not to graduate the change out of alpha if the risks are determined to be too high or the feedback is not positive.
Though the issue is highly upvoted, on an analysis of the comments presented in
the canonical tracking issue
Kubernetes#57291
, 22
unique commenters were requesting a constant or instant backoff for Succeeded
Pods, 19 for earlier recovery tries, and 6 for better late recovery behavior;
the latter is arguably even more highly requested when also considering related
issue Kubernetes#50375
.
Though an early version of this KEP also addressed the Success case, in its
current version this KEP really only addresses the early recovery case, which by
our quantitative data is actually the least requested option. That being said,
other use cases described in User Stories
that don’t have
quantitative counts are also driving forces on why we should address the early
recovery cases now. On top of that, compared to the late recovery cases, early
recovery is more approachable and easily modelable and improving benchmarking
and insight can help us improve late recovery later on (see also the related
discussion in Alternatives here
and
here
).
Alternatives
Global override
Allow an override of the global constant of a maximum backoff period (or other settings) in kubelet configuration.
Per exit code configuration
One alternative is for new container spec values that allow individual containers to respect overrides on the global timeout behavior depending on the exit reason. These overrides will exist for the following reasons:
- image download failures
- workload crash: any non-0 exit code from the workload itself
- infrastructure events: terminated by the system, e.g. exec probe failures, OOMKill, or other kubelet runtime errors
- success: a 0 exit code from the workload itself
These had been selected because there are known use cases where changed restart behavior would benefit workloads epxeriencing these categories of failures.
RestartPolicy: Rapid
In the 1.31 version of this proposal, this KEP proposed a two-pronged approach to revisiting the CrashLoopBackoff behaviors for common use cases:
- modifying the standard backoff delay to start faster but decay to the same 5m threshold
- allowing Pods to opt-in to an even faster backoff curve with a lower max cap
For step (2), the method to allow the Pods to opt-in was by a new enum value,
Rapid, for a Pod’s RestartPolicy. In this case, Pods and restartable init
(aka sidecar) containers would be able to set a new OneOf value, restartPolicy: Rapid, to opt in to an exponential backoff decay that starts at a lower initial
value and maximizes to a lower cap. This proposal suggested we start with a new
initial value of 250ms and cap of 1 minute, and analyze its impact on
infrastructure during alpha.

Why not?: There was still a general community consensus that even though this was opt-in, giving the power to reduce the backoff curve to users in control of the pod manifest – who as a persona are not necessarily users with cluster-wide or at least node-wide visibility into load and scheduling – was too risky to global node stability.
In addition, overriding an existing Pod spec enum value, while convenient, required detailed management of the version skew period, at minimum across 3 kubelet versions per the API policy for new enum values in existing fields . In practice this meant the API server and kubelets across all nodes must be coordinated.
Firstly, Rapid must be a valid option to the restartPolicy in the API server
(which would only be possible if/when the API server was updated), and secondly,
the Rapid value must be interpretable by all kubelets on every node.
Unfortunately, it is not possible for the API server to be aware of what version
each kubelet is on, so it cannot serve Rapid as Always preferentially to
each kubelet depending on its version. Instead, each kubelet must be able to
handle this value properly, both at n-3 kubelet version and – more easily – at
its contemporary kubelet version. For updated kubelet versions, each kubelet
would be able to detect if it has the feature gate on, and if so, interpret
Rapid to use the new rapid backoff curve; and if the feature gate is off,
interpret it instead as Always. But at earlier kubelet versions, Rapid must
be ignored in favor of Always. Unfortunately for this KEP, the default value
for restartPolicy is Never, though even more unfortunately, it looks like
different code paths use a different default value (thank you
@tallclair
!!;
1
defaults to Always,
2
defaults to OnFailure,
3
defaults to Always, and
4
defaults to Never), so if kubelet drops unexpected enum values for
restartPolicy, a Pod with Rapid will be misconfigured by an old kubelet.
Flat-rate restarts for Succeeded Pods
We start from the assumption that the “Succeeded” phase of a Pod in Kubernetes means that all workloads completed as expected. Most often this is colloquially referred to as an exit code 0, as this exit code is what is caught by Kuerbenetes for linux containers.
The wording of the public documentation
(ref
)
and the naming of the CrashLoopBackOff state itself implies that it is a
remedy for a container not exiting as intended, but the restart delay decay
curve is applied to even successful pods if their restartPolicy = Always. On the
canonical tracking issue for this problem, a significant number of requests
focus on how an exponential decay curve is inappropriate for workloads
completing as expected, and unnecessarily delays healthy workloads from being
rescheduled.
This alternative would vastly simplify and cut down on the restart delays for workloads completing as expected, as detectable by their transition through the “Succeeded” phase in Kubernetes. The target is to get as close to the capacity of kubelet to instantly restart as possible, anticipated to be somewhere within 0-10s flat rate + jitter delay for each restart, pending benchmarking in alpha.
Fundamentally, this change is taking a stand that a successful exit of a
workload is intentional by the end user – and by extension, if it has been
configured with restartPolicy = Always, that its impact on the Kubernetes
infrastructure when restarting is by end user design. This is in contrast to the
prevailing Kubernetes assumption that on its own, the Pod API best models
long-running containers that rarely or never exit themselves with “Success”;
features like autoscaling, rolling updates, and enhanced workload types like
StatefulSets assume this, while other workload types like those implemented with
the Job and CronJob API better model workloads that do exit themselves, running
until Success or at predictable intervals. If this alternative was pursued, we
would instead interpret an end user’s choice to run a relatively fast exiting
Pod (under 10 minutes) with both a successful exit code and configured to
restartPolicy: Always, as their intention to restart the pod indefinitely
without penalty.

Why not?: This provides a workaround (and therefore, opportunity for abuse), where application developers could catch any number of internal errors of their workload in their application code, but exit successfully, forcing extra fast restart behavior in a way that is opaque to kubelet or the cluster operator. Something similar is already being taken advantage of by application developers via wrapper scripts, but this causes no extra strain on kubelet as it simply causes the container to run indefinitely and uses no kubelet overhead for restarts.
On Success and the 10 minute recovery threshold
The original version of this proposal included a change specific to Pods transitioning through the “Succeeded” phase to have flat rate restarts. On further discussion, this was determined to be both too risky and a non-goal for Kubernetes architecturally, and moved into the Alternatives section. The risk for bad actors overloading the kubelet is described in the Alternatives section and is somewhat obvious. The larger point of it being a non-goal within the design framework of Kubernetes as a whole is less transparent and discussed here.
After discussion with early Kubernetes contributors and members of SIG-Node, it’s become more clear to the author that the prevailing Kubernetes assumption is that that on its own, the Pod API best models long-running containers that rarely or never exit themselves with “Success”; features like autoscaling, rolling updates, and enhanced workload types like StatefulSets assume this, while other workload types like those implemented with the Job and CronJob API better model workloads that do exit themselves, running until Success or at predictable intervals. In line with this assumption, Pods that run “for a while” (longer than 10 minutes) are the ones that are “rewarded” with a reset backoff counter – not Pods that exit with Success. Ultimately, non-Job Pods are not intended to exit Successfully in any meaningful way to the infrastructure, and quick rerun behavior of any application code is considered to be an application level concern instead.
Therefore, even though it is widely desired by commenters on Kubernetes#57291 , this KEP is not pursuing a different backoff curve for Pods exiting with Success any longer.
For Pods that are today intended to rerun after Success, it is instead suggested to
- exec the application logic with an init script or shell that reruns it indefinitely, like that described in Kubernetes#57291#issuecomment-377505620 :
#!/bin/bash
while true; do
python /code/app.py
done
- or, if a shell in particular is not desired, implement the application such that it starts and monitors the restarting process inline, or as a subprocess/separate thread/routine
The author is aware that these solutions still do not address use cases where users have taken advantage of the “cleaner” state “guarantees” of a restarted pod to alleviate security or privacy concerns between sequenced Pod runs.
This decision here does not disallow the possibility that this is solved in other ways, for example:
- the Job API, which better models applications with meaningful Success states,
introducing a variant that models fast-restarting apps by infrastructure
configuration instead of by their code, i.e. Jobs with
restartPolicy: Alwaysand/or with no completion count target - support restart on exit 0 as a directive in the container runtime or as a
common independent tool, e.g.
RESTARTABLE CMD mycommandorrestart-on-exit-0 -- mycommand -arg -arg -arg - formalized reaper behavior such as discussed in Kubernetes#50375
However, there will always need to be some throttling or quota for restarts to protect node stability, so even if these alternatives are pursued separately, they will depend on the analysis and benchmarking implementation during this KEP’s alpha stage to stay within node stability boundaries.
Related: API opt-in for flat rate/quick restarts when transitioning from Succeeded phase
Workloads must opt-in with restartPolicy: FastOnSuccess, as a
foil to restartPolicy: OnFailure. In this case, existing workloads with
restartPolicy: Always or ones not determined to be in the critical path would
use the new, yet still relatively slower, front-loaded decay curve and only
those updated with FastOnSuccess would get truer fast restart behavior.
However, then it becomes impossible for a workload to opt into both
restartPolicy: FastOnSuccess and restartPolicy: Rapid.
Related: Succeeded vs Rapidly failing: who’s getting the better deal?
When both a flat rate Succeeded and a Rapid implementation were combined in
this proposal, depending on the variation of the initial value, the first few
restarts of a failed container would be faster than a successful container,
which at first look seems backwards.

However, based on the use cases, this is still correct because the goal of restarting failed containers is to take maximum advantage of quickly recoverable situations, while the goal of restarting successful containers is only to get them to run again sometime and not penalize them with longer waits later when they’ve behaving as expected.
Exposing per-node config as command-line flags
Command-line configuration is more easily and transparently exposed in tooling
used to bootstrap nodes via templating. Exposed command-line configuration for
Kubelet are defined by struct KubeletFlags
and merged with configuration files in kubelet/server.go NewKubeletCommand
.
To expose configuration as a command-line flag, a new field would be added to
the KubeletFlags struct to be validated in kubelet at runtime.
Why not?: Per comments in the code at this
location
,
it seems we don’t want to continue to expose configuration at this location. In
addition, since config directories for kubelet config are now in beta
(ref
,
there is a reasonable alternative to per-node configuration using the
KubeletConfiguration API object instead. By using the API, we can take
advantage of API machinery level lifecycle, validation and guarantees, including
that unrecognized fields will be dropped in older versions of kubelet, which is
valuable for version skew requirements we must meet back to kubelet n-3.
Late recovery
There are many use cases not covered in this KEP’s target User Stories that share the common properties of being concerned with the recovery timeline of Pods that have already reached their max cap for their backoff. Today, some of these Pods will have their backoff counters reset once they have run successfully for 10 minutes. However, user stories exist where
- the Pod will never successfully run for 10 minutes by design
- the user wants to be able to force the decay curve to restart (Kubernetes#50375 )
- the application knows what to wait for and could communicate that to the system (like a restart probe)
As discussed here in Alternatives Considered , the first case is unlikely to be address by Kubernetes.
The latter two are considered out of scope for this KEP, as the most common use cases are regarding the initial recovery period. If there is still sufficient appetite after this KEP reaches beta to specifically address late recovery scenarios, then that would be a good time to address them without the noise and change of this KEP.
More complex heuristics
The following alternatives are all considered by the author to be in the category of “more complex heuristics”, meaning solutions predicated on kubelet making runtime decisions on a variety of system or workload states or trends. These approaches all share the common negatives of being:
- harder to reason about
- of unknown return on investment for use cases relative to the investment to implement
- expensive to benchmark and test
That being said, after this initial KEP reaches beta and beyond, it is entirely possible that the community will desire more sophisticated behavior based on or inspired by some of these considered alternatives. As mentioned above, the observability and benchmarking work done within the scope of this KEP can help users provide empirical support for further enhancements, and the following review may be useful to such efforts in the future.
- Expose podFailurePolicy to nonJob Pods
- Subsidize successful running time/readinessProbe/livenessProbe seconds in current backoff delay
- Detect anomalous workload crashes
Appendix A
Kubelet SyncPod
- Update API status
- Wait for network ready
- Register pod to secrets and configmap managers
- Create/update QOS cgroups for a restarting, unkilled pod if enabled
- Create mirror pod for static pod if deleted
- Make pod data directories
- Wait for volumes to attach/mount, up to 2 minutes per sync loop
- Get image pull secrets
- Add pod to probe manager (start probe workers)
- Vertical pod scaling; potentially do a resize on podmanager/statusmanager
Runtime SyncPod
- Fix/create pod sandbox if necessary
- Start ephemeral containers
- Start init/sidecar containers
- Start normal containers
- Start =
- Get image volumes
- If still in backoff, error; will come back next round
- Error caught by kubelet, isTerminal=False
- Pull image volumes if enabled
- Pull image
- Generate container config, including generating env and pulling secrets
- Create container
- User Pre start hook
- Start container
- User Post start hook
Kubelet SyncTerminatingPod + runtime killPod
- Update Pod status
- Stop probes
- Kill pod (with grace if enabled)
- Send kill signals and wait
- Stop sandboxes
- update QOS cgroups if enabled
- Remove probes
- Deallocate DRA resources, if enabled
- Set the Pod status again with exit codes
Kubelet SyncTerminatedPod
- Update Pod status
- Unmount volumes (up to 2 minutes 3 seconds per sync loop)
- Unregister secret and configmap
- Remove cgroups for qos if enabled
- Release user namespace (if enabled, needs I/O)
- Update Pod status again