KEP-5532: Restart All Containers on Container Exits
KEP-5532: Restart All Containers on Container Exits
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
This KEP proposes an extension to the container restart rules introduced in KEP-5307 to allow a container’s exit to trigger a restart of the entire pod. This is part of the Pod / Container Restart roadmap planned earlier (see discussion ). This “in-place” pod restart will terminate and then restart all of the pod’s containers (including init and sidecar containers) while preserving the pod’s sandbox, UID, network namespace, attached devices, and IP address. This provides a more efficient way to reset a pod’s state compared to deleting and recreating the pod, which is particularly beneficial for workloads like AI/ML training, where rescheduling is costly.
Motivation
While KEP-5307 introduces container-level restart policies, there are scenarios where restarting the entire pod is more desirable than restarting a single container. The benefits of restarting the whole pod in-place includes the following.
Re-run with Init Containers
Many applications rely on init containers to prepare the environment, such as mounting volumes with gcsfuse or performing other setup tasks. When a container fails, a full pod restart ensures that these init containers are re-executed, guaranteeing a clean and correctly configured environment for the new set of application containers.
Another scenario is when Init container “takes the next item from the queue”. And main container exists with the indication that it want’s a “new item’ to process. See also
- https://github.com/kubernetes/enhancements/issues/3676
- https://github.com/kubernetes/enhancements/issues/3759#issuecomment-2389506153
Handles Init Container Failures
Sidecar container failures sometimes render the main container not ready, and restarting the sidecar is insufficient. For example, if a sidecar that manages the remote volume fails and restarts, the main container may be trying to access an outdated volume. With RestartAllContainers action, the sidecar could force the main container to restart as well for a clean environment.
The example may be a gcsfuse https://github.com/GoogleCloudPlatform/gcsfuse that employs the architecture of CSI driver and a sidecar working together.
Efficient In-Place Restart
Deleting and recreating a pod is a heavy operation involving the scheduler, node resource allocation, and re-initialization of networking and storage. An in-place restart, which preserves the pod sandbox and its associated resources (UID, IP, devices), is significantly faster and reduces resource churn.
This is especially helpful for ML training workloads, where computation resources are expensive and in-place restarts improves resource usage efficiency. This is also helpful if the workload can run in seconds, and restart in-place is much more efficient than rescheduling. See also
- https://github.com/kubernetes-sigs/jobset/issues/467
- https://docs.google.com/document/d/16zexVooHKPc80F4dVtUjDYK9DOpkVPRNfSv0zRtfFpk/edit?tab=t.0#heading=h.y6xl7juq7465
Separates Watcher-sidecars from worker containers
In ML training workloads we have setups with a watcher process that listens for failures, and restarts the worker from the previous checkpoint if needed. Without a RestartAllContainers action, the watcher process and worker process have to be coupled in a single container, increasing complexity and decreasing cohesion. The RestartAllContainers action eliminates this coupling. See also
- https://docs.google.com/document/d/16zexVooHKPc80F4dVtUjDYK9DOpkVPRNfSv0zRtfFpk/edit?tab=t.0#heading=h.y6xl7juq7465
- https://github.com/kubernetes/enhancements/issues/4438
Improved Predictability and Debugging
Restarting all containers together brings the entire pod to a known good state. This is often easier to reason about and debug than a state where some containers are running while others have been restarted independently.
Goals
- Introduce a
RestartAllContainersaction to theContainerRestartRuleAPI. - Implement the kubelet logic to perform an in-place pod restart, which includes:
- Terminating and removing all containers (not including prestop hooks and graceful termination).
- Preserving the pod sandbox, UID, IP, network namespace, user namespace and mappings.
- Preserving all volumes, including emptyDir and mounted volumes.
- Re-running init containers.
- Restarting all regular and sidecar containers.
- Introduce a new PodCondition to make the pod restart process observable.
- Restart the pod within 1 minute after the CRI detects the container terminated with matching exit code and rule.
Non-Goals
- Introducing triggers for pod restart other than container exits (e.g., via a direct API call). This could be a future enhancement.
- Tearing down and recreating the all pod resources during the restart. The focus is on an efficient “in-place” restart of the containers and preserve the environment.
Proposal
This proposal extends the API defined in KEP-5307 by adding a new action, RestartAllContainers, to ContainerRestartRuleAction. When a container exits, the kubelet will evaluate the restartPolicyRules. If a rule with the RestartAllContainers action matches the exit condition (e.g., a specific exit code), the kubelet will initiate an in-place restart of the pod.
User Stories
Story 1: Rerun with init containers
As a developer, I have a pod where an init container is responsible for setting up a resource, like mounting a volume or preparing a configuration file, that the main container depends on. If the main application container fails in a way that corrupts this resource’s state, I want the entire pod to restart. This ensures the init container runs again to provide a clean setup before the application container starts. I can configure the main container to exit with a specific code that triggers the RestartAllContainers action.
Story 2: Efficient in-place restart
As an ML engineer, I run distributed training jobs where a sidecar container monitors the main training container. If the training process encounters a specific, retriable error, the sidecar detects it and needs to restart the whole worker pod from the last checkpoint. With this feature, I can program the sidecar to simply exit with a specific code. This triggers the RestartAllContainers action, which efficiently resets the worker without involving the Job controller for a full pod recreation or needing complex communication between the sidecar and the main container.
See details in https://docs.google.com/document/d/16zexVooHKPc80F4dVtUjDYK9DOpkVPRNfSv0zRtfFpk/edit?tab=t.0#heading=h.y6xl7juq7465
Story 3: RestartAllContainers with init container providing items from a queue
As a developer, I want a pod with an init container and a main container. The init container takes the next item from the queue, and the main container process the item. The main container should be able to exit and indicate that it wants a “new item” to process.
Story 4: Restart main container on sidecar failures
As a developer, I have a pod with a sidecar container that provides resources to the main container. If the sidecar fails and restarts, the main container would be trying to access an outdated resource. I want to be able to restasrt all containers if the sidecar fails. This helps to keep my main container up-to-date with the sidecar containers.
Risks and Mitigations
Unintended Pod Restart Loops
A container might persistently exit with an exit code that triggers a RestartAllContainers action, causing the entire pod to enter a restart loop. This could consume significant node resources and mask the underlying problem.
Mitigation: The kubelet already implements an exponential backoff for container restarts. This same backoff mechanism will be applied to pod restarts triggered by this feature. This will introduce increasing delays between restart attempts, preventing rapid, resource-intensive restart loops and giving operators time to diagnose the issue.
Design Details
API
The proposal is to extend the ContainerRestartRuleAction enum with RestartAllContainers.
type ContainerRestartRuleAction string
const (
// Restarts the container that exited.
ContainerRestartRuleActionRestart ContainerRestartRuleAction = "Restart"
// Restarts the entire pod.
ContainerRestartRuleActionRestartAllContainers ContainerRestartRuleAction = "RestartAllContainers"
)
Example usage in a Pod manifest:
apiVersion: v1
kind: Pod
metadata:
name: my-ml-worker
spec:
restartPolicy: Never
initContainers:
- name: setup-envs
image: setup
- name: watcher-sidecar
image: watcher
restartPolicy: Always
restartPolicyRules:
- action: RestartAllContainers
onExit:
exitCodes:
operator: In
values: [88] # A specific exit code indicating the pod should be restarted.
containers:
- name: main-container
image: training-app
The history of the container statuses will be preserved. The restart count of all containers and pod will increment as well. This will be tested in unit test and e2e test, as well as working with the JobSet APIs.
If the pod restart policy is “Never”, and the init container fails after the RestartAllContainers action requested, the Pod will be marked as Failed.
Restart Phases
The pod restart can be split into two phases.
The first phase is pod termination. The kubelet compares the containerStatuses with restart rules and decides to terminate the pod. The kubelet sets the AllContainersRestarting=True pod condition to the API. The SyncLoop will try to 1) kill all running containers, 2) remove all init and regular containers from the container runtime. The sandbox is preserved to keep the pod IP, UID, devices, and network namespace. The API endpoint slice is also kept.
Steps to terminate the pod includes:
- Add pod condition AllContainersRestarting
- Kill all running containers
- No ordering during the kill
- Best-effort: prestop hooks
- Termination grace periods are not respected.
- Remove all init and regular containers from container runtime
- ContainerStatuses are kept in the API
- Exited containers on the runtime is removed
- Necessary for a clean restart; otherwise kubelet cannot tell if a container exited before the restart (expected) or after the restart (a new failure).
- No changes to probes
- No changes to other pod resources, such as sandbox, IP, network namespcae, devices, volumes, etc.
The second phase is pod startup. With all containers terminated and removed, the kubelet unset the AllContainersRestarting pod condition to the API. Because kubelet sees no containers from the container runtime, it can proceed with the normal Pod startup actions in the SyncLoop. This will follow the regular pod startup flow, except the sandbox already exists.
This includes the following steps:
- The pod resources (sandbox, IP, devices, volumes, etc.) already exists; kubelet will skip recreating those resources.
- Running init containers in sequence
- Any new failures will be handled according to restartPolicy, e.g. fail pod if restartPolicy=Never
- Only proceeds after success
- Running all sidecar containers in sequence
- Only proceeds after startupProbe succeeds
- regular containers
- poststart hooks
- Probes became active again
Termination Grace Periods
The TerminationGracePeriodSeconds is not respected. In many cases, best effort cleanups and termination grace periods are desired for real terminations, such as pod being deleted or evicted. However, they might not be expected for quick in-place restarts. Because the container will restart in-place relatively quickly, there shouldn’t be much concern about skipping the cleanup. The termination grace periods will still be respected if the pod is terminating (not restart in-place).
This provides “graceful termination” for real terminations and “fast and nongraceful termination” for in-place restarts.
Rejected Alternative: Respect pod.Spec.terminationGracePeriodSeconds
An alternative would be to respect the terminationGracePeriod on the pod level; all containers will be using the same value of pod-level termination grace period. This gives containers the opportunity to perform graceful termination even during restarts. However, this could cause “unexpected cleanup” being performed during the PodRestart; and could slow down the restart process.
This provides “graceful termination” for real terminations as well as “slow and graceful termination” for in-place restarts.
Potential future improvement: Customizable TerminationGracePeriod for RestartAllContainers
Another alternative is to allow users to specify a separate terminationGracePeriod for RestartAllContainers action. With this setup, containers can have appropriate time to cleanup for real terminations, and can have shorter (or even none) periods for in-place restarts. Similar to the probe-level termination grace periods, which overrides the pod-level termination grace period.
This does add extra complexity to the API and implementation. It can be extended in the future if there are feature requests for RestartAllContainers specific termination grace periods.
Prestop Hooks
Because termination grace periods is not respected, the prestop hooks will not be executed. If prestop hooks execution are desired for in-place restart, it could be potentially included with the Customizable TerminationGracePeriod for RestartAllContainers improvement.
Containers in Runtime
Init containers, sidecar containers, and regular containers are all removed from the runtime to ensure a clean restart of the pod. Ephemeral containers are kept, because they are ephemeral in nature and should not be executed again.
ContainerStatuses in API
Container statuses in the API are kept for observability and clarity. However, they will not affect how kubelet restart the pod and containers.
Sandbox
Sandbox is preserved. This means pod UID, IP, devices are all preserved. This ensures a faster restart and the pod will get the same resources.
Volumes
Volumes are kept. PodRestart focuses on container restart, instead of resetting the environment.
Note: In some cases, remounting the volumes might be desired. This is not in-scope of this KEP. There are ongoing discussions around a separate KEP that focusing on marking volumes as “required for remount” during the container-level restarts or RestartAllContainers actions.
Init Containers
Init containers are started in order, including sidecar containers.
- Requires init containers to be reentrant.
- A failing init container with restartAction=RestartAllContainers can keep the pod restarting (also possible today).
Regular Containers
All regular containers will be restarted during a RestartAllContainers action.
- Including succeeded containers with restartPolicy=OnFailure or restartPolicy=Never
- Including all failed containers with restartPolicy=Never
- RestartAllContainers makes more sense to restart all the containers, skipping containers can make reasoning harder.
- In the case of Jobs, it is preferable to restart everything, so the worker can run from scratch again.
- Also possible today if the node got restarted.
- Failed / Succeeded containers can run multiple times if misconfigured.
Ephemeral Containers
Will not be restarted due to their ephemeral nature.
Probes
Probes are not deactivated during the restart. All probes are expected to fail during pod restart. The failure of probes should not trigger another pod restart.
Liveness Probes Liveness probes on containers that were running before the restart are expected to fail (because the container is being restarted). The kubelet will coordinate the liveness probe with the SyncPod cycle to ensure that the container is started in order and not affected by liveness probes.
Readiness Probes Readiness probes are expected to fail as well. It is expected that the readiness probe may render the container as not ready.
Startup Probes Startup probes are expected to fail during the restart. After the restart, Startup probes will become active and valid again. The execution of startup probes after the restart will affect the pod lifecycle (e.g. if startup probe failed, the pod will be marked as failed if restartPolicy=Never).
Pod Status
[New] Pod condition AllContainersRestarting
To make the restart process observable, a new pod condition will be added to the Pod.status.conditions.
type: AllContainersRestarting
status: True / False
reason: ContainerExited
message: 'Container my-container exited with code 88, triggering pod restart'
The kubelet will set this condition to True at the beginning of the termination phase. The kubelet will set it to False at the end of the termination phase (with all containers removed from the runtime).
This condition has the following benefits:
- Restart status is kept across reboots and updates.
- Consistent with 1) API server is single truth; 2) the SyncLoop read Pod from API server and update pod status and perform actions
- Pod lifecycle is reported to the API server and visible to user / other components
Existing Pod Conditions
When a container is stopped, pod condition Ready and ContainersReady will be marked as False.
However, pod condition Initialized will not be marked as false, because currently it is assumed that once a pod is initialized, it cannot be “uninitialized”. The reasoning is that PodRestart should be considered “restarting all containers of the pod”, not necessarily recreating the pod itself.
Pod Phase
The pod pod should be in the Pending phase throughout the restart. This means if the pod was in the Running phase, it could be reverted to the Pending phase. This is possible today as well.
Kubelet Implementation
The in-place pod restart will be implemented in the kubelet as a state machine based on the PodCondition mentioned above. If the AllContainersRestarting condition is true, then the pod is in the Termination Phase. Otherwise, it is considered the Startup Phase (which is the same as pod regular startup).
When a RestartAllContainers rule is triggered, the kubelet will set the PodCondition AllContainersRestarting=True. In this state, the kubelet’s only goal is to kill and remove all of the pod’s containers. This process is similar to a normal pod shutdown but skips tearing down the sandbox. The container statuses from the previous run are preserved for history.
Once the kubelet verifies that all containers are removed, it transitions to startup phase by setting the PodCondition AllContainersRestarting=False. In this state, the kubelet’s goal is to start the pod from the beginning, preserving the existing sandbox. This is the same as a normal pod startup sequence.
Kubelet Restarts
If kubelet restarted in the Termination Phase, because the PodCondition is preserved on the API server, kubelet could continue the cleanup.
- If the kubelet did not preserve pod condition, it could also infer from the container statuses from the CRI that a RestartAllContainers action is triggered.
If kubelet restarted in the Startup Phase, it proceeds normally as today by synchronizing all pods. From kubelet’s perspective, the pod just got created and assigned.
Node Restarts
On node restarts, kubelet and container runtime loses all containers. In the first pass, kubelet would sync the pods assigned to it.
If the pod was previously restarted in place, and was in the Termination Phase, it would have the pod condition
AllContainersRestarting=True. Since kubelet sees all containers do not exist, it will set the pod conditionAllContainersRestarting=Falseand proceed with normal pod start up sequence.If the pod was previously restarted in place, and was in the Startup Phase, then kubelet will proceed as if the pod just got created.
Test Plan
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
N/A
Unit tests
k8s.io/api/pod- TestValidateContainerRestartRulesOption
k8s.io/apis/corek8s.io/apis/core/v1/validations- TestValidatePodUpdate
- TestValidateContainerRestartPolicy
k8s.io/featuresk8s.io/kubeletk8s.io/kubelet/containerk8s.io/kubelet/kuberuntimek8s.io/kubelet/proberk8s.io/kubelet/status
Integration tests
Unit and E2E tests are expected to provide sufficient coverage.
e2e tests
- Create a pod with a container that has a
restartPolicyRulewith theRestartAllContainersaction. Verify that when the container (init, regular, or sidecar) exits with the specified code, the entire pod is restarted in-place (same UID, IP). - Verify that init containers are re-executed after a pod restart is triggered.
- Verify that all regular and sidecar containers are restarted.
- Verify that the
AllContainersRestartingcondition is added to the pod status during the restart and removed after it completes.
List of other restart sequences that need to be tested:
- 1st or 2nd init container fails and trigger RestartAllContainers, all containers should be restarted.
- Init container failing after RestartAllContainers, restartPolicy=Never, the pod should be failed.
- Init container failing after RestartAllContainers, restartPolicy=Always, then init container eventually succeeds, the pod should be started.
- Sidecar failing before the regular container started and triggers RestartAllContainers, all containers should be restarted.
RestartAllContainers action should work with other features, including Topology Manager and Resource Manager.
Graduation Criteria
Alpha
- Feature implemented behind a
RestartAllContainersOnContainerExitsfeature gate. - The
RestartAllContainersaction is added to the API. - Kubelet implementation of the in-place pod restart logic is complete.
- Initial e2e tests are completed and enabled to verify the core functionality.
- Documentation is added.
Beta
- Container restart policy functionality running behind feature flag for at least one release.
- Container restart policy runs well with Job controller.
GA
- No major bugs reported for three months.
- User feedback (ideally from at least two distinct users) is green.
Upgrade / Downgrade Strategy
The feature gate RestartAllContainersOnContainerExits will protect the new functionality.
- Upgrade: When upgrading, the API server should be upgraded before the kubelets. If a pod with the
RestartAllContainersrule is scheduled on an older kubelet that doesn’t support the feature, the rule will be ignored, and the pod’srestartPolicywill be used. - Downgrade: If the feature is disabled or kubelets are downgraded, any
RestartAllContainersrules in existing pods will be ignored. The pod will revert to the behavior defined by itsrestartPolicy.
Version Skew Strategy
Previous kubelet client unaware of the RestartAllContainers action will ignore this field and keep the existing behavior determined by pod’s restart policy.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name:
RestartAllContainersOnContainerExits - Components depending on the feature gate: kube-apiserver, kubelet
- Feature gate name:
Does enabling the feature change any default behavior?
No. The feature is opt-in. It only takes effect when the RestartAllContainers action is explicitly used in a container’s restartPolicyRules. Existing workloads are unaffected.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
Yes. Disabling the feature gate RestartAllContainersOnContainerExits on the API server and kubelets will cause the RestartAllContainers action to be ignored. Pods will fall back to the behavior defined by their restartPolicy.
What happens if we reenable the feature if it was previously rolled back?
If the feature is re-enabled, kubelets will once again recognize and enforce the RestartAllContainers rules for any pods that have them defined.
Are there any tests for feature enablement/disablement?
- Unit test for the API’s validation with the feature enabled and disabled.
- Unit test for the kubelet with the feature enabled and disabled.
- Unit test for API on the new field for the Pod API.
- The new API field is immutable.
- The option to allow the new API will be true if either the feature is enabled, or the API is already used in the existing pod.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
During a rollout, a cluster may have a mix of kubelets with the feature enabled and disabled. If a pod using the RestartAllContainers feature is scheduled on a node where the feature is not yet enabled, it will not have the desired restart behavior. This could lead to inconsistent behavior for a given workload during the rollout period, but it will not cause running workloads to fail.
What specific metrics should inform a rollback?
If pods do not specify the new RestartAllContainers action, the upgrade should not be affected by this feature.
If pods specified this action and are in a repeated crash loop backoff, especially if they are not progressing, this is a sign for rollback.
Optionally, if the metric kube_pod_container_status_restarts_total is enabled, a significant increase in the metric
is also a sign for rollback.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
Yes, the upgrade->downgrade->upgrade path was tested manually to ensure that the RestartAllContainers
action is correctly ignored when the feature gate is disabled and resumes working when re-enabled.
Tested with a kind cluster with the feature gate enabled, created a pod with RestartAllContainers action.
The validation passed, and all containers were restarted after the source container exit.
Disabled the feature flag and restarted kubelet and kube-apiserver. The action is ignored, and the container did not restart.
Enabled the feature flag and restarted kubelet and kube-apiserver. All containers are restarted again.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
The feature is only enabled if the Pod API specifies the RestartAllContainers action,
and the container exited with a matching exit code. Pod not specifying the action
will not use this feature.
Also, to verify a pod is restarted by this feature, the pod condition will have
AllContainersRestarting=True during the restart. After the restart, container status
will show the last termination status as restarted by this action.
Lastly, the restart count of all containers will be incremented during the action.
This could be a supplementary evidence that the RestartAllContainers action was
triggered if the pod condition and container statuses are unavailable.
How can someone using this feature know that it is working for their instance?
- [] Events
- Event Reason:
- API .status
- Condition name:
AllContainersRestarting - Other field:
reasonandmessageinPodCondition
- Condition name:
- Other (treat as last resort) API .spec
- Details: pod.containers.RestartPolicyRules with action=RestartAllContainers
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
- There is no explicit cluster-wide SLO for this feature. Some best-effort experiment and measurement are listed below.
- A reasonable time for restart all containers to succeed should be within the typical container restart latencies (accounting for exponential backoff) as if all containers are restarted sequentially. Each container has a default termination of 2s, a reasonable restart time (not including init container execution) per container is less than 5 seconds.
- A simple 2-container pod could restart its first container within 5s from the source container exit in 99%. The restart duration could be longer if the pod contains more containers. This can be measured by the difference between container exit events and container started events of the pod.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
kube_pod_container_status_restarts_total - Aggregation method: Sum over time, grouped by container and pod.
- Components exposing the metric: kube-state-metrics
- Metric name:
- Other (treat as last resort)
- Details: ContainerStatuses API will also include the last termination state of the restarted containers, indicating that the container was restarted due to RestartAllContainers.
Are there any missing metrics that would be useful to have to improve observability of this feature?
No.
Dependencies
Does this feature depend on any specific services running in the cluster?
No. It depends only on the standard Kubernetes components (kube-apiserver, kubelet) and a CRI-compatible container runtime.
Scalability
Will enabling / using this feature result in any new API calls?
No.
Will enabling / using this feature result in introducing new API types?
A new possible value “RestartAllContainers” for RestartRulesAction will be introduced.
Will enabling / using this feature result in any new calls to the cloud provider?
No.
Will enabling / using this feature result in increasing size or count of the existing API objects?
The size of the PodCondition API object will be increased to account for the new AllContainersRestarting status, example:
type: AllContainersRestarting
status: True / False
reason: ContainerExited
message: 'Container my-container exited with code 88, triggering pod restart'
- API type: PodCondition
- Estimated increase in size: 200B
- Estimated amount of new objects: at most one per pod.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
No.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No.
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
If the API server is unavailable, the kubelet will not be able to update the AllContainersRestarting condition. However, the kubelet will still be able to perform the in-place restart of the containers locally if the trigger conditions are met. Observability of the restart process will be delayed until the API server becomes available again.
What are other known failure modes?
Kubelet lost contact with API server:
- Detection: Kubelet runs in standalone mode or cannot connect to the API server.
- Mitigations: Not needed. RestartAllContainers should keep functioning without the API server.
- Testing: Covered by e2e tests that runs kubelet in standalone mode.
Kubelet restarted:
- Detection: Kubelet log shows that kubelet is restarted.
- Mitigations: Not needed. RestartAllContainers should keep functioning even if the kubelet was restarted during the container restarts.
- Testing: Covered by e2e tests that interrupts kubelet during RestartAllContainer actions.
Infinite Restart Loop:
- Detection: The pod status shows a high number of container restartCount, and/or the pod status frequently transitioned into the
AllContainersRestartingcondition. - Mitigations: Kubelet’s exponential backoff mechanism for container restarts will apply to RestartAllContainers actions. Additionally, operators can delete the pod.
- Diagnostics: Kubelet logs will indicate the reason for which container exited
and triggered the
RestartAllContainersaction. Check container logs and pod spec to understand why the container kept exiting. - Testing: Covered by e2e tests that the RestartAllContainers should only be triggered once if the workload and PodSpec are configured properly.
- Detection: The pod status shows a high number of container restartCount, and/or the pod status frequently transitioned into the
Stuck in Termination Phase:
- Detection:
AllContainersRestartingcondition isTruefor an extended period (beyond the expected termination time of containers). - Mitigations: Operators can delete the pod.
- Diagnostics: Kubelet logs will indicate the reason for which container exited
and triggered the
RestartAllContainersaction. Container runtime logs will indicate why container restart failed. - Testing: Not covered. This usually indicates an issue with the container runtime.
- Detection:
What steps should be taken if SLOs are not being met to determine the problem?
- Review container logs to understand why container exited.
- Review kubelet logs for errors related to container termination or startup.
- Verify the container runtime’s health and responsiveness.
- Ensure that kubelet can communicate with API server and the
AllContainersRestartingcondition is being correctly updated.
Implementation History
- 1.35: Implemented as Alpha feature
Drawbacks
Alternatives
Giving access to CRI API or subset
Pods could be given direct, albeit limited, access to the Container Runtime Interface (CRI) on the node. A container within the pod could make a CRI call to the kubelet or container runtime to request a restart of itself or other containers. To restart the whole pod, it would need to request the termination and recreation of all containers.
Pros
- Provides fine-grained control over the pod’s lifecycle from within the pod itself.
Cons
- Major security risk. Exposing the CRI API, even a subset, to workloads is a significant security concern. A compromised container could potentially disrupt other pods on the same node.
- Breaks abstraction. It violates the abstraction layer between the pod and the node infrastructure. Pods should be managed by the Kubernetes control plane and kubelet, not manage themselves at the runtime level.
- Increased complexity for developers. Application developers would need to understand and interact with the CRI, which is a low-level infrastructure detail.
Pod Self-orchestration
See also https://github.com/kubernetes/enhancements/issues/5309 . This is a more structured version of the CRI access idea is to introduce a formal concept of an “orchestration container” within a pod. One container in the pod would be designated as the orchestrator. This container would have special privileges to manage the lifecycle (start, stop, monitor) of other containers within the same pod.
Pros
- Structured and flexible management of container lifecycle. Provides a very flexible way to handle complex inter-container dependencies and startup/shutdown sequences. Powerful for specific use cases Ideal for scenarios where containers have complex, ordered dependencies that go beyond existing pod lifecycle.
Cons
- High complexity. This is a much larger and more complex feature than what is needed for “restart the whole pod” action. Probably an overkill for the use-cases.
- Significant implementation effort. Implementing full pod self-orchestration is a major undertaking with wide-ranging implications for the kubelet and API.
- Imperative management of container lifecycles. Kubernetes tend to manage container and pod lifecycle declaratively. The imperative kill could have a steep learning curve for the users.
Livesness probes on regular containers that point to a sidecar container
Liveness probes can be configured on regular containers that point to a sidecar container. The exit of (or the unexpected response from) the sidecar container can lead to liveness failures and causing regular containers to terminate / restart.
Pros
- The sidecar do not necessarily need to terminate, an error response is sufficient to trigger the regular container restart.
- Reusing the existing probes, which could have some flexibility and tolerations.
Cons
- Does not trigger a full pod restart. It cannot re-run init containers or restart other containers.
- Indirect and complex. It can be difficult to debug and understand the relationship between the sidecar’s health and the main container’s lifecycle. It also need to be configured for every regular container to achieve this.
- There is a delay between the sidecar detecting the failure and the liveness probe failing, based on the probe’s
periodSecondsandfailureThreshold.