KEP-5532: Restart All Containers on Container Exits

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
- User Stories
- Risks and Mitigations
  - Unintended Pod Restart Loops
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This KEP proposes an extension to the container restart rules introduced in KEP-5307 to allow a container’s exit to trigger a restart of the entire pod. This is part of the Pod / Container Restart roadmap planned earlier (see discussion ). This “in-place” pod restart will terminate and then restart all of the pod’s containers (including init and sidecar containers) while preserving the pod’s sandbox, UID, network namespace, attached devices, and IP address. This provides a more efficient way to reset a pod’s state compared to deleting and recreating the pod, which is particularly beneficial for workloads like AI/ML training, where rescheduling is costly.

Motivation

While KEP-5307 introduces container-level restart policies, there are scenarios where restarting the entire pod is more desirable than restarting a single container. The benefits of restarting the whole pod in-place includes the following.

Re-run with Init Containers

Many applications rely on init containers to prepare the environment, such as mounting volumes with gcsfuse or performing other setup tasks. When a container fails, a full pod restart ensures that these init containers are re-executed, guaranteeing a clean and correctly configured environment for the new set of application containers.

Another scenario is when Init container “takes the next item from the queue”. And main container exists with the indication that it want’s a “new item’ to process. See also

Handles Init Container Failures

Sidecar container failures sometimes render the main container not ready, and restarting the sidecar is insufficient. For example, if a sidecar that manages the remote volume fails and restarts, the main container may be trying to access an outdated volume. With RestartAllContainers action, the sidecar could force the main container to restart as well for a clean environment.

The example may be a gcsfuse https://github.com/GoogleCloudPlatform/gcsfuse that employs the architecture of CSI driver and a sidecar working together.

Efficient In-Place Restart

Deleting and recreating a pod is a heavy operation involving the scheduler, node resource allocation, and re-initialization of networking and storage. An in-place restart, which preserves the pod sandbox and its associated resources (UID, IP, devices), is significantly faster and reduces resource churn.

This is especially helpful for ML training workloads, where computation resources are expensive and in-place restarts improves resource usage efficiency. This is also helpful if the workload can run in seconds, and restart in-place is much more efficient than rescheduling. See also

Separates Watcher-sidecars from worker containers

In ML training workloads we have setups with a watcher process that listens for failures, and restarts the worker from the previous checkpoint if needed. Without a RestartAllContainers action, the watcher process and worker process have to be coupled in a single container, increasing complexity and decreasing cohesion. The RestartAllContainers action eliminates this coupling. See also

Improved Predictability and Debugging

Restarting all containers together brings the entire pod to a known good state. This is often easier to reason about and debug than a state where some containers are running while others have been restarted independently.

Goals

Introduce a RestartAllContainers action to the ContainerRestartRule API.
Implement the kubelet logic to perform an in-place pod restart, which includes:
- Terminating and removing all containers (not including prestop hooks and graceful termination).
- Preserving the pod sandbox, UID, IP, network namespace, user namespace and mappings.
- Preserving all volumes, including emptyDir and mounted volumes.
- Re-running init containers.
- Restarting all regular and sidecar containers.
Introduce a new PodCondition to make the pod restart process observable.
Restart the pod within 1 minute after the CRI detects the container terminated with matching exit code and rule.

Non-Goals

Introducing triggers for pod restart other than container exits (e.g., via a direct API call). This could be a future enhancement.
Tearing down and recreating the all pod resources during the restart. The focus is on an efficient “in-place” restart of the containers and preserve the environment.

Proposal

This proposal extends the API defined in KEP-5307 by adding a new action, RestartAllContainers, to ContainerRestartRuleAction. When a container exits, the kubelet will evaluate the restartPolicyRules. If a rule with the RestartAllContainers action matches the exit condition (e.g., a specific exit code), the kubelet will initiate an in-place restart of the pod.

User Stories

Story 1: Rerun with init containers

As a developer, I have a pod where an init container is responsible for setting up a resource, like mounting a volume or preparing a configuration file, that the main container depends on. If the main application container fails in a way that corrupts this resource’s state, I want the entire pod to restart. This ensures the init container runs again to provide a clean setup before the application container starts. I can configure the main container to exit with a specific code that triggers the RestartAllContainers action.

Story 2: Efficient in-place restart

As an ML engineer, I run distributed training jobs where a sidecar container monitors the main training container. If the training process encounters a specific, retriable error, the sidecar detects it and needs to restart the whole worker pod from the last checkpoint. With this feature, I can program the sidecar to simply exit with a specific code. This triggers the RestartAllContainers action, which efficiently resets the worker without involving the Job controller for a full pod recreation or needing complex communication between the sidecar and the main container.

See details in https://docs.google.com/document/d/16zexVooHKPc80F4dVtUjDYK9DOpkVPRNfSv0zRtfFpk/edit?tab=t.0#heading=h.y6xl7juq7465

Story 3: RestartAllContainers with init container providing items from a queue

As a developer, I want a pod with an init container and a main container. The init container takes the next item from the queue, and the main container process the item. The main container should be able to exit and indicate that it wants a “new item” to process.

Story 4: Restart main container on sidecar failures

As a developer, I have a pod with a sidecar container that provides resources to the main container. If the sidecar fails and restarts, the main container would be trying to access an outdated resource. I want to be able to restasrt all containers if the sidecar fails. This helps to keep my main container up-to-date with the sidecar containers.

Risks and Mitigations

Unintended Pod Restart Loops

A container might persistently exit with an exit code that triggers a RestartAllContainers action, causing the entire pod to enter a restart loop. This could consume significant node resources and mask the underlying problem.

Mitigation: The kubelet already implements an exponential backoff for container restarts. This same backoff mechanism will be applied to pod restarts triggered by this feature. This will introduce increasing delays between restart attempts, preventing rapid, resource-intensive restart loops and giving operators time to diagnose the issue.

Design Details

API

The proposal is to extend the ContainerRestartRuleAction enum with RestartAllContainers.

type ContainerRestartRuleAction string

const (
  // Restarts the container that exited.
  ContainerRestartRuleActionRestart ContainerRestartRuleAction = "Restart"
  // Restarts the entire pod.
  ContainerRestartRuleActionRestartAllContainers ContainerRestartRuleAction = "RestartAllContainers"
)

Example usage in a Pod manifest:

apiVersion: v1
kind: Pod
metadata:
  name: my-ml-worker
spec:
  restartPolicy: Never
  initContainers:
  - name: setup-envs
    image: setup
  - name: watcher-sidecar
    image: watcher
    restartPolicy: Always
    restartPolicyRules:
    - action: RestartAllContainers
      onExit:
        exitCodes:
          operator: In
          values: [88] # A specific exit code indicating the pod should be restarted.
  containers:
  - name: main-container
    image: training-app

The history of the container statuses will be preserved. The restart count of all containers and pod will increment as well. This will be tested in unit test and e2e test, as well as working with the JobSet APIs.

If the pod restart policy is “Never”, and the init container fails after the RestartAllContainers action requested, the Pod will be marked as Failed.

Restart Phases

The pod restart can be split into two phases.

The first phase is pod termination. The kubelet compares the containerStatuses with restart rules and decides to terminate the pod. The kubelet sets the AllContainersRestarting=True pod condition to the API. The SyncLoop will try to 1) kill all running containers, 2) remove all init and regular containers from the container runtime. The sandbox is preserved to keep the pod IP, UID, devices, and network namespace. The API endpoint slice is also kept.

Steps to terminate the pod includes:

Add pod condition AllContainersRestarting
Kill all running containers
- No ordering during the kill
- Best-effort: prestop hooks
- Termination grace periods are not respected.
Remove all init and regular containers from container runtime
- ContainerStatuses are kept in the API
- Exited containers on the runtime is removed
- Necessary for a clean restart; otherwise kubelet cannot tell if a container exited before the restart (expected) or after the restart (a new failure).
No changes to probes
No changes to other pod resources, such as sandbox, IP, network namespcae, devices, volumes, etc.

The second phase is pod startup. With all containers terminated and removed, the kubelet unset the AllContainersRestarting pod condition to the API. Because kubelet sees no containers from the container runtime, it can proceed with the normal Pod startup actions in the SyncLoop. This will follow the regular pod startup flow, except the sandbox already exists.

This includes the following steps:

The pod resources (sandbox, IP, devices, volumes, etc.) already exists; kubelet will skip recreating those resources.
Running init containers in sequence
- Any new failures will be handled according to restartPolicy, e.g. fail pod if restartPolicy=Never
- Only proceeds after success
Running all sidecar containers in sequence
- Only proceeds after startupProbe succeeds
regular containers
poststart hooks
Probes became active again

Termination Grace Periods

The TerminationGracePeriodSeconds is not respected. In many cases, best effort cleanups and termination grace periods are desired for real terminations, such as pod being deleted or evicted. However, they might not be expected for quick in-place restarts. Because the container will restart in-place relatively quickly, there shouldn’t be much concern about skipping the cleanup. The termination grace periods will still be respected if the pod is terminating (not restart in-place).

This provides “graceful termination” for real terminations and “fast and nongraceful termination” for in-place restarts.

Rejected Alternative: Respect pod.Spec.terminationGracePeriodSeconds

An alternative would be to respect the terminationGracePeriod on the pod level; all containers will be using the same value of pod-level termination grace period. This gives containers the opportunity to perform graceful termination even during restarts. However, this could cause “unexpected cleanup” being performed during the PodRestart; and could slow down the restart process.

This provides “graceful termination” for real terminations as well as “slow and graceful termination” for in-place restarts.

Potential future improvement: Customizable TerminationGracePeriod for RestartAllContainers

Another alternative is to allow users to specify a separate terminationGracePeriod for RestartAllContainers action. With this setup, containers can have appropriate time to cleanup for real terminations, and can have shorter (or even none) periods for in-place restarts. Similar to the probe-level termination grace periods, which overrides the pod-level termination grace period.

This does add extra complexity to the API and implementation. It can be extended in the future if there are feature requests for RestartAllContainers specific termination grace periods.

Prestop Hooks

Because termination grace periods is not respected, the prestop hooks will not be executed. If prestop hooks execution are desired for in-place restart, it could be potentially included with the Customizable TerminationGracePeriod for RestartAllContainers improvement.

Containers in Runtime

Init containers, sidecar containers, and regular containers are all removed from the runtime to ensure a clean restart of the pod. Ephemeral containers are kept, because they are ephemeral in nature and should not be executed again.

ContainerStatuses in API

Container statuses in the API are kept for observability and clarity. However, they will not affect how kubelet restart the pod and containers.

Sandbox

Sandbox is preserved. This means pod UID, IP, devices are all preserved. This ensures a faster restart and the pod will get the same resources.

Volumes

Volumes are kept. PodRestart focuses on container restart, instead of resetting the environment.

Note: In some cases, remounting the volumes might be desired. This is not in-scope of this KEP. There are ongoing discussions around a separate KEP that focusing on marking volumes as “required for remount” during the container-level restarts or RestartAllContainers actions.

Init Containers

Init containers are started in order, including sidecar containers.

Requires init containers to be reentrant.
A failing init container with restartAction=RestartAllContainers can keep the pod restarting (also possible today).

Regular Containers

All regular containers will be restarted during a RestartAllContainers action.

Including succeeded containers with restartPolicy=OnFailure or restartPolicy=Never
Including all failed containers with restartPolicy=Never
RestartAllContainers makes more sense to restart all the containers, skipping containers can make reasoning harder.
In the case of Jobs, it is preferable to restart everything, so the worker can run from scratch again.
Also possible today if the node got restarted.
Failed / Succeeded containers can run multiple times if misconfigured.

Ephemeral Containers

Will not be restarted due to their ephemeral nature.

Probes

Probes are not deactivated during the restart. All probes are expected to fail during pod restart. The failure of probes should not trigger another pod restart.

Liveness Probes Liveness probes on containers that were running before the restart are expected to fail (because the container is being restarted). The kubelet will coordinate the liveness probe with the SyncPod cycle to ensure that the container is started in order and not affected by liveness probes.

Readiness Probes Readiness probes are expected to fail as well. It is expected that the readiness probe may render the container as not ready.

Startup Probes Startup probes are expected to fail during the restart. After the restart, Startup probes will become active and valid again. The execution of startup probes after the restart will affect the pod lifecycle (e.g. if startup probe failed, the pod will be marked as failed if restartPolicy=Never).

Pod Status

[New] Pod condition AllContainersRestarting

To make the restart process observable, a new pod condition will be added to the Pod.status.conditions.

type: AllContainersRestarting
status: True / False
reason: ContainerExited
message: 'Container my-container exited with code 88, triggering pod restart'

The kubelet will set this condition to True at the beginning of the termination phase. The kubelet will set it to False at the end of the termination phase (with all containers removed from the runtime).

This condition has the following benefits:

Restart status is kept across reboots and updates.
Consistent with 1) API server is single truth; 2) the SyncLoop read Pod from API server and update pod status and perform actions
Pod lifecycle is reported to the API server and visible to user / other components

Existing Pod Conditions

When a container is stopped, pod condition Ready and ContainersReady will be marked as False.

However, pod condition Initialized will not be marked as false, because currently it is assumed that once a pod is initialized, it cannot be “uninitialized”. The reasoning is that PodRestart should be considered “restarting all containers of the pod”, not necessarily recreating the pod itself.

Pod Phase

The pod pod should be in the Pending phase throughout the restart. This means if the pod was in the Running phase, it could be reverted to the Pending phase. This is possible today as well.

Kubelet Implementation

The in-place pod restart will be implemented in the kubelet as a state machine based on the PodCondition mentioned above. If the AllContainersRestarting condition is true, then the pod is in the Termination Phase. Otherwise, it is considered the Startup Phase (which is the same as pod regular startup).

When a RestartAllContainers rule is triggered, the kubelet will set the PodCondition AllContainersRestarting=True. In this state, the kubelet’s only goal is to kill and remove all of the pod’s containers. This process is similar to a normal pod shutdown but skips tearing down the sandbox. The container statuses from the previous run are preserved for history.

Once the kubelet verifies that all containers are removed, it transitions to startup phase by setting the PodCondition AllContainersRestarting=False. In this state, the kubelet’s goal is to start the pod from the beginning, preserving the existing sandbox. This is the same as a normal pod startup sequence.

Kubelet Restarts

If kubelet restarted in the Termination Phase, because the PodCondition is preserved on the API server, kubelet could continue the cleanup.

If the kubelet did not preserve pod condition, it could also infer from the container statuses from the CRI that a RestartAllContainers action is triggered.

If kubelet restarted in the Startup Phase, it proceeds normally as today by synchronizing all pods. From kubelet’s perspective, the pod just got created and assigned.

Node Restarts

On node restarts, kubelet and container runtime loses all containers. In the first pass, kubelet would sync the pods assigned to it.

If the pod was previously restarted in place, and was in the Termination Phase, it would have the pod condition AllContainersRestarting=True. Since kubelet sees all containers do not exist, it will set the pod condition AllContainersRestarting=False and proceed with normal pod start up sequence.
If the pod was previously restarted in place, and was in the Startup Phase, then kubelet will proceed as if the pod just got created.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

N/A

Unit tests

k8s.io/api/pod
- TestValidateContainerRestartRulesOption
k8s.io/apis/core
k8s.io/apis/core/v1/validations
- TestValidatePodUpdate
- TestValidateContainerRestartPolicy
k8s.io/features
k8s.io/kubelet
k8s.io/kubelet/container
k8s.io/kubelet/kuberuntime
k8s.io/kubelet/prober
k8s.io/kubelet/status

Integration tests

Unit and E2E tests are expected to provide sufficient coverage.

e2e tests

Create a pod with a container that has a restartPolicyRule with the RestartAllContainers action. Verify that when the container (init, regular, or sidecar) exits with the specified code, the entire pod is restarted in-place (same UID, IP).
Verify that init containers are re-executed after a pod restart is triggered.
Verify that all regular and sidecar containers are restarted.
Verify that the AllContainersRestarting condition is added to the pod status during the restart and removed after it completes.

List of other restart sequences that need to be tested:

1st or 2nd init container fails and trigger RestartAllContainers, all containers should be restarted.
Init container failing after RestartAllContainers, restartPolicy=Never, the pod should be failed.
Init container failing after RestartAllContainers, restartPolicy=Always, then init container eventually succeeds, the pod should be started.
Sidecar failing before the regular container started and triggers RestartAllContainers, all containers should be restarted.

RestartAllContainers action should work with other features, including Topology Manager and Resource Manager.

Graduation Criteria

Alpha

Feature implemented behind a RestartAllContainersOnContainerExits feature gate.
The RestartAllContainers action is added to the API.
Kubelet implementation of the in-place pod restart logic is complete.
Initial e2e tests are completed and enabled to verify the core functionality.
Documentation is added.

Beta

Container restart policy functionality running behind feature flag for at least one release.
Container restart policy runs well with Job controller.

GA

No major bugs reported for three months.
User feedback (ideally from at least two distinct users) is green.

Upgrade / Downgrade Strategy

The feature gate RestartAllContainersOnContainerExits will protect the new functionality.

Upgrade: When upgrading, the API server should be upgraded before the kubelets. If a pod with the RestartAllContainers rule is scheduled on an older kubelet that doesn’t support the feature, the rule will be ignored, and the pod’s restartPolicy will be used.
Downgrade: If the feature is disabled or kubelets are downgraded, any RestartAllContainers rules in existing pods will be ignored. The pod will revert to the behavior defined by its restartPolicy.

Version Skew Strategy

Previous kubelet client unaware of the RestartAllContainers action will ignore this field and keep the existing behavior determined by pod’s restart policy.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: RestartAllContainersOnContainerExits
- Components depending on the feature gate: kube-apiserver, kubelet

Does enabling the feature change any default behavior?

No. The feature is opt-in. It only takes effect when the RestartAllContainers action is explicitly used in a container’s restartPolicyRules. Existing workloads are unaffected.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. Disabling the feature gate RestartAllContainersOnContainerExits on the API server and kubelets will cause the RestartAllContainers action to be ignored. Pods will fall back to the behavior defined by their restartPolicy.

What happens if we reenable the feature if it was previously rolled back?

If the feature is re-enabled, kubelets will once again recognize and enforce the RestartAllContainers rules for any pods that have them defined.

Are there any tests for feature enablement/disablement?

Unit test for the API’s validation with the feature enabled and disabled.
Unit test for the kubelet with the feature enabled and disabled.
Unit test for API on the new field for the Pod API.
- The new API field is immutable.
- The option to allow the new API will be true if either the feature is enabled, or the API is already used in the existing pod.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

During a rollout, a cluster may have a mix of kubelets with the feature enabled and disabled. If a pod using the RestartAllContainers feature is scheduled on a node where the feature is not yet enabled, it will not have the desired restart behavior. This could lead to inconsistent behavior for a given workload during the rollout period, but it will not cause running workloads to fail.

What specific metrics should inform a rollback?

If pods do not specify the new RestartAllContainers action, the upgrade should not be affected by this feature.

If pods specified this action and are in a repeated crash loop backoff, especially if they are not progressing, this is a sign for rollback.

Optionally, if the metric kube_pod_container_status_restarts_total is enabled, a significant increase in the metric is also a sign for rollback.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Yes, the upgrade->downgrade->upgrade path was tested manually to ensure that the RestartAllContainers action is correctly ignored when the feature gate is disabled and resumes working when re-enabled.

Tested with a kind cluster with the feature gate enabled, created a pod with RestartAllContainers action. The validation passed, and all containers were restarted after the source container exit.

Disabled the feature flag and restarted kubelet and kube-apiserver. The action is ignored, and the container did not restart.

Enabled the feature flag and restarted kubelet and kube-apiserver. All containers are restarted again.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

The feature is only enabled if the Pod API specifies the RestartAllContainers action, and the container exited with a matching exit code. Pod not specifying the action will not use this feature.

Also, to verify a pod is restarted by this feature, the pod condition will have AllContainersRestarting=True during the restart. After the restart, container status will show the last termination status as restarted by this action.

Lastly, the restart count of all containers will be incremented during the action. This could be a supplementary evidence that the RestartAllContainers action was triggered if the pod condition and container statuses are unavailable.

How can someone using this feature know that it is working for their instance?

[] Events
- Event Reason:
API .status
- Condition name: AllContainersRestarting
- Other field: reason and message in PodCondition
Other (treat as last resort) API .spec
- Details: pod.containers.RestartPolicyRules with action=RestartAllContainers

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

There is no explicit cluster-wide SLO for this feature. Some best-effort experiment and measurement are listed below.
A reasonable time for restart all containers to succeed should be within the typical container restart latencies (accounting for exponential backoff) as if all containers are restarted sequentially. Each container has a default termination of 2s, a reasonable restart time (not including init container execution) per container is less than 5 seconds.
A simple 2-container pod could restart its first container within 5s from the source container exit in 99%. The restart duration could be longer if the pod contains more containers. This can be measured by the difference between container exit events and container started events of the pod.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name: kube_pod_container_status_restarts_total
- Aggregation method: Sum over time, grouped by container and pod.
- Components exposing the metric: kube-state-metrics
Other (treat as last resort)
- Details: ContainerStatuses API will also include the last termination state of the restarted containers, indicating that the container was restarted due to RestartAllContainers.

Are there any missing metrics that would be useful to have to improve observability of this feature?

No.

Dependencies

Does this feature depend on any specific services running in the cluster?

No. It depends only on the standard Kubernetes components (kube-apiserver, kubelet) and a CRI-compatible container runtime.

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

A new possible value “RestartAllContainers” for RestartRulesAction will be introduced.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

The size of the PodCondition API object will be increased to account for the new AllContainersRestarting status, example:

type: AllContainersRestarting
status: True / False
reason: ContainerExited
message: 'Container my-container exited with code 88, triggering pod restart'

API type: PodCondition
Estimated increase in size: 200B
Estimated amount of new objects: at most one per pod.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

If the API server is unavailable, the kubelet will not be able to update the AllContainersRestarting condition. However, the kubelet will still be able to perform the in-place restart of the containers locally if the trigger conditions are met. Observability of the restart process will be delayed until the API server becomes available again.

What are other known failure modes?

Kubelet lost contact with API server:
- Detection: Kubelet runs in standalone mode or cannot connect to the API server.
- Mitigations: Not needed. RestartAllContainers should keep functioning without the API server.
- Testing: Covered by e2e tests that runs kubelet in standalone mode.
Kubelet restarted:
- Detection: Kubelet log shows that kubelet is restarted.
- Mitigations: Not needed. RestartAllContainers should keep functioning even if the kubelet was restarted during the container restarts.
- Testing: Covered by e2e tests that interrupts kubelet during RestartAllContainer actions.
Infinite Restart Loop:
- Detection: The pod status shows a high number of container restartCount, and/or the pod status frequently transitioned into the AllContainersRestarting condition.
- Mitigations: Kubelet’s exponential backoff mechanism for container restarts will apply to RestartAllContainers actions. Additionally, operators can delete the pod.
- Diagnostics: Kubelet logs will indicate the reason for which container exited and triggered the RestartAllContainers action. Check container logs and pod spec to understand why the container kept exiting.
- Testing: Covered by e2e tests that the RestartAllContainers should only be triggered once if the workload and PodSpec are configured properly.
Stuck in Termination Phase:
- Detection: AllContainersRestarting condition is True for an extended period (beyond the expected termination time of containers).
- Mitigations: Operators can delete the pod.
- Diagnostics: Kubelet logs will indicate the reason for which container exited and triggered the RestartAllContainers action. Container runtime logs will indicate why container restart failed.
- Testing: Not covered. This usually indicates an issue with the container runtime.

What steps should be taken if SLOs are not being met to determine the problem?

Review container logs to understand why container exited.
Review kubelet logs for errors related to container termination or startup.
Verify the container runtime’s health and responsiveness.
Ensure that kubelet can communicate with API server and the AllContainersRestarting condition is being correctly updated.

Implementation History

1.35: Implemented as Alpha feature
- https://github.com/kubernetes/kubernetes/pull/134345

Drawbacks

Alternatives

Giving access to CRI API or subset

Pods could be given direct, albeit limited, access to the Container Runtime Interface (CRI) on the node. A container within the pod could make a CRI call to the kubelet or container runtime to request a restart of itself or other containers. To restart the whole pod, it would need to request the termination and recreation of all containers.

Pros

Provides fine-grained control over the pod’s lifecycle from within the pod itself.

Cons

Major security risk. Exposing the CRI API, even a subset, to workloads is a significant security concern. A compromised container could potentially disrupt other pods on the same node.
Breaks abstraction. It violates the abstraction layer between the pod and the node infrastructure. Pods should be managed by the Kubernetes control plane and kubelet, not manage themselves at the runtime level.
Increased complexity for developers. Application developers would need to understand and interact with the CRI, which is a low-level infrastructure detail.

Pod Self-orchestration

See also https://github.com/kubernetes/enhancements/issues/5309 . This is a more structured version of the CRI access idea is to introduce a formal concept of an “orchestration container” within a pod. One container in the pod would be designated as the orchestrator. This container would have special privileges to manage the lifecycle (start, stop, monitor) of other containers within the same pod.

Pros

Structured and flexible management of container lifecycle. Provides a very flexible way to handle complex inter-container dependencies and startup/shutdown sequences. Powerful for specific use cases Ideal for scenarios where containers have complex, ordered dependencies that go beyond existing pod lifecycle.

Cons

High complexity. This is a much larger and more complex feature than what is needed for “restart the whole pod” action. Probably an overkill for the use-cases.
Significant implementation effort. Implementing full pod self-orchestration is a major undertaking with wide-ranging implications for the kubelet and API.
Imperative management of container lifecycles. Kubernetes tend to manage container and pod lifecycle declaratively. The imperative kill could have a steep learning curve for the users.

Livesness probes on regular containers that point to a sidecar container

Liveness probes can be configured on regular containers that point to a sidecar container. The exit of (or the unexpected response from) the sidecar container can lead to liveness failures and causing regular containers to terminate / restart.

Pros

The sidecar do not necessarily need to terminate, an error response is sufficient to trigger the regular container restart.
Reusing the existing probes, which could have some flexibility and tolerations.

Cons

Does not trigger a full pod restart. It cannot re-run init containers or restart other containers.
Indirect and complex. It can be difficult to debug and understand the relationship between the sidecar’s health and the main container’s lifecycle. It also need to be configured for every regular container to achieve this.
There is a delay between the sidecar detecting the failure and the liveness probe failing, based on the probe’s periodSeconds and failureThreshold.