KEP-5307: Container Restart Policy

KEP-5307: Container Restart Rules

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
- Wrapping entrypoint
- Non-declarative (callbacks based) restart policy
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This KEP introduces container restart rules for a container so kubelet can apply those rules on container exits. This will allow users to configure special exit codes of the container to be treated as non-failure and restart the container in-place even if the Pod has a restartPolicy=Never. This scenario is important for use cases, when rescheduling of a task is very expensive, and restarting in-place is preferable.

This KEP is the first part of a larger plan to improve the container restart behavior, for more discussion and details, see this document ..

Motivation

With the proliferation of AI/ML training jobs where each job takes hundreds of Pods, each using expensive hardware and very expensive to schedule, in-place restarts are becoming more and more important.

Consider the example, the Pod is a part of a large training job. The progress of each training “step” is only made when all Pods completed the calculation for this step. Each Pod starts from a checkpoint, they all make progress together, and write a new checkpoint. If any of Pods failed, the fastest way to restart the calculation is to interrupt all Pods by restarting them, so they all will start from the previous checkpoint. Thus, a special handling of this restart is required.

There are a few reasons why the OnFailure restart policy will not work:

The cases of failed hardware must result in a Pod failure and rescheduling. There needs to be a differentiation of these two failures - caused by hardware issue and caused by a in-place restart request.
Pods are often parts of JobSets with the Job failure policy configured (see https://kubernetes.io/docs/tasks/job/pod-failure-policy/) . The pod failure policy is a server-side policy and is not compatible with the Pods with restartPolicy OnFailure.

Goals

Introduce the Container RestartPolicyRuless API which allows to keep restarting a container on specified exit codes.
Allow extensibility of an API to support more scenarios in future.

Non-Goals

Implement the maxRestartTimes https://github.com/kubernetes/enhancements/issues/3322
Support all possible restart policy rules in this KEP - some may be ideas for the future, for a more detailed discussion on possible actions and conditions, please refer to this document .

Proposal

User Stories (Optional)

Story 1

As a ML researcher, I’m orchestrating a large number of long-running AI/ML training worklaods. Workload failures in such workloads are unavoidable due to various reasons. When a workload fails (with a retriable exit code), I would like the container to be restarted quickly and avoid re-scheduling the pod because this consumes significant amount of time and resource. Restarting the failed container “in-place” is critical for a better utilization of compute resources. The container should only restart “in-place” if it failed due to a retriable error; the container and pod should terminate and possibly reschedule if the failure is not retriable.

Since AI/ML training workloads are often declared as Job, with PodFailurePolicy , some errors should be treated as retriable and restart the container in-place by the kubelet, without re-creating and re-scheduling the Pod.

See https://github.com/kubernetes-sigs/jobset/issues/876 for the detailed description.

Notes/Constraints/Caveats (Optional)

Container Restart Count

Pod with restartPolicy=Never may have containers restarted and have the restart count higher than 0 because container-level restart rules can restart the container.

This is already possible for sidecar containers which have container-level restartPolicy=Always.

Job PodFailurePolicy

This will not affect how Job podFailurePolicy interacts with pod failures, because container-level restart will not be considered as Pod termination. The Job controller only checks and evaluates podFailurePolicy after a Pod is terminated, as mentioned in the KEP-3329 .

This KEP is informed from the discussion about some future improvements we may need to implement as described here for JobSet: https://github.com/kubernetes/enhancements/issues/3329#issuecomment-1571643421 . Instead of making Job controller to handle container-level restart and failures, kubelet is more suitable to handle the container restart policy. This aligns with the current implementation that Job controller only restart / reschedule the Pod after it is terminated, and delegate the rest to the kubelet.

This enables efficient setups to accelerate container restart and to improve resource utilization. For example, Jobs configured with podFailurePolicy for hardware failures (Pod needs to be rescheduled to other nodes), and containers configured with restartPolicyRules to restart in-place for training errors.

maxRestartTimes

maxRestartTimes is another ongoing KEP-3322 that provides an Pod API to allow the user to specify a maximum number of restarts. How should Container restartPolicyRules interacts with the Pod maxRestartTimes is being discussed. The current understanding is that the containers restarted by restartPolicyRules will count towards container restarts of all other APIs.

Sidecar Containers

This proposal does not change how Sidecar containers will be detected and their lifecycles. For future improvements on Sidecar containers, please see below.

Future Improvements

This proposal fits into the larger improvement to support other container restart conditions and actions. Please refer to this document .

Risks and Mitigations

Unintended Restart Loops

A container might persistently exit with an “Restart” exit code due to an unresolvable underlying issue, leading to frequent restarts that consume node resources and potentially mask the problem.

The container restart will still follow the exponential backoff to avoid excessive resource consumption due to restarts.

Although this introduces exponential delay for container restart, it still aligns with the goal of expediting in-place container restart. First, the in-place restart avoids the expensive Pod re-scheduling to a different node. Second, if the container keeps restarting due to an exit code specified in the restart rules and stuck in a CrashLoop, it is probably not a retry-able error, the exponential backoff can avoid overwhelming the node with frequent restarts.

Design Details

The proposal is to extend the Pod’s Container API type with a new field restartPolicyRules.

The Pod’s specified restartPolicy (Always / Never / OnFailure) will act as the default behavior for each container. The user has the ability to specify a restartPolicy on the container, which will override the restartPolicy from the Pod. If the container restartPolicy is not specified, the pod restartPolicy will be used. Same as now, for Sidecar containers, the user needs to specify container restartPolicy=Always on an init container.

Additionally, each container could have its own restartPolicyRules. If the restartPolicyRules field is specified, then the user must also specify the container restartPolicy which is defined next to it. The restartPolicyRules define a list of rules to apply on container exit. Each rule will consist of a condition (onExitCodes, OOM killed, eviction, resource contention etc.) and an action (Restart, Terminate, TerminatePod, etc.) The rules will be evaluated in order; if none of the rules’ conditions matched, the default action will fallback to container’s restartPolicy.

The initial proposal supports only one action, “Restart”, to restart the container.

The initial proposal supports only exit code as requirement for the rules.

The proposed API is as following:

type ContainerRestartPolicy string

const (
  ContainerRestartPolicyAlways ContainerRestartPolicy = "Always"
  ContainerRestartPolicyNever ContainerRestartPolicy = "Never"
  ContainerRestartPolicyOnFailure ContainerRestartPolicy = "OnFailure"
)

type Container struct {
  // Omitting irrelevant fields...
  // RestartPolicy must be specified if RestartPolicyRules is specified.
  RestartPolicy *ContainerRestartPolicy

  // Represents a list of rules to be checked to determine if the
  // container should be restarted on exit. The rules are evaluated in
  // order. Once a rule matches a container exit condition, the remaining
  // rules are ignored. If no rule matches the container exit condition,
  // the Pod-level restart policy determines the whether the container
  // is restarted or not. Constraints on the rules:
  // - At most 20 rules are allowed.
  // - Rules can have the same action.
  // - Identical rules are not forbidden in validations.
  RestartPolicyRules []ContainerRestartRule
}

// ContainerRestartRule describes how a container exit is handled.
type ContainerRestartRule struct {
  // Specifies the action taken on a container exit if the requirements
  // are satisfied. The only possible value is "Restart" to restart the
  // container.
  // +required
  Action ContainerRestartRuleAction

  // Represents the exit codes to check on container exits. The oneOf
  // field must be provided.
  // +optional
  // +oneOf=when
  ExitCodes *ContainerRestartRuleOnExitCodes

  // Other conditions in the future:
  // OOMKill *ContainerRestartRuleConditionOOMKill
  // RestartTimes *ContainerRestartRuleConditionRestartTimes
  // Exit *ContainerRestartRuleConditionExit
}

type ContainerRestartRuleAction string

const (
  ContainerRestartRuleActionRestart ContainerRestartRuleAction = "Restart"

  // Future actions: "Complete", "TerminatePod", "RestartPod".
)

// ContainerRestartRuleOnExitCodes describes the condition
// for handling an exited container based on its exit codes.
type ContainerRestartRuleOnExitCodes struct {
  // Represents the relationship between the container exit code(s) and the
	// specified values. Possible values are:
	//
	// - In: the requirement is satisfied if the container exit code is in the 
  //   set of specified values.
	// - NotIn: the requirement is satisfied if the container exit code is 
  //   not in the set of specified values.
  // +required
  Operator ContainerRestartRuleOnExitCodesOperator

  // Specifies the set of values to check for container exit codes.
  // At most 255 elements are allowed.
  Values []int32
}

type ContainerRestartRuleOnExitCodesOperator string

const (
  ContainerRestartRuleOnExitCodesOpIn ContainerRestartRuleOnExitCodesOperator = "In"
  ContainerRestartRuleOnExitCodesOpNotIn ContainerRestartRuleOnExitCodesOperator = "NotIn"
)

To specify a container to only restart with an exit code of 42, it can be specified as following in a Pod manifest:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
  - name: my-container
    image: nginx:latest
    # restartPolicy must be specified to specify restartPolicyRules
    restartPolicy: Never
    restartPolicyRules:
    - action: Restart
      when:
        exitCodes:
          operator: In
          values: [42]

Below is the example of the shape of the API for future improvements. NOT all actions and conditions are included for this KEP.

To deploy a pod with

an init container that should be retried for 10 times,
a sidecar container,
a regular container that should only be restarted on exit code 42, and
a regular (keystone) container, the exit of which should fail the pod:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  restartPolicy: Never
  initContainers:
  - name: retry-init
    image: xxx
    restartPolicy: Never # This needs to be specified because restart rules is specified.
    restartPolicyRules:
    - action: Complete
      when:
        restartTimes: 10
    - action: Restart
      when:
        exitCodes:
          operator: NotIn
          values: [0]
  - name: sidecar
    image: xxx
    restartPolicy: Always # Indicates a sidecar container
  containers:
  - name: regular-container
    image: xxx
    restartPolicy: Never
    restartPolicyRules:
    - action: Restart
      when:
        exitCodes:
          operator: In
          values: [42]
  - name: keystone-container
    image: xxx
    restartPolicyRules:
    - action: TerminatePod
      when:
        exitCodes:
          operator: NotIn
          values: []

The proposal is to support the following combinations:

The action can only be Restart;
Only onExitCodes rules are allowed, no other conditions;
The operator can be either In or NotIn;
Values only support an array of integers and no wildcard.

With the limitations above, an API will do nothing for containers with pod-level and container-level restartPolicy=Always, as the only action is Restart. Same for the containers with pod-level and container-level restartPolicy=OnFailure. Except that exit code 0 can be configured to be restartable, which is effectively the same as restartPolicy=Always.

For the containers with the restartPolicy=Never, it will allow restarting the container for the subset of exit codes. The sync and restart logic will be implemented in k8s.io/kubelet/container.

Similarly for sidecar init containers with restartPolicy=Always, setting restartPolicyRules has no effect because the only action is Restart.

This API change is only intended to restart the container if the container itself exited with the given list of exit codes. It is not intended to change the behavior of other means that lead to container being restarted, for example, pod resize or pod restart.

See more discussion on how this API interacts with other components like Job controller in Notes/Constraints/Caveats (Optional) .

This API will support regular “app” containers as discussed above.

For init containers, this API will repeatedly restart the init container if it failed with the exit code specified in the restartPolicyRules until it succeeds (exit=0). For Pods with restartPolicy=Never, the restartPolicyRules override it. This means if the container exited with a code specified in the restartPolicyRules, the container will be restarted by kubelet, until it succeeds (exit=0) or fails (exited with a code not in the restartPolicyRules).

For sidecar containers, this API effectively has no affect, because the sidecar container is always restarted.

For ephemeral containers, this API is not allowed, because restarting the ephemeral containers is not meaningful.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

k8s.io/apis/core
k8s.io/apis/core/v1/validations
k8s.io/features
k8s.io/kubelet
k8s.io/kubelet/container

Integration tests

Unit and E2E tests provide sufficient coverage for the feature. Integration tests may be added to cover any gaps that are discovered in the future.

e2e tests

Verify that containers can specify restartPolicyRules.
Verify that containers exited with exit codes specified in the restartPolicyRules are restarted and the pod keeps running.
Verify that containers exited with exit codes not specified in the restartPolicyRules are not restarted and the pod fails.
Verify that PodFailurePolicy works with the restartPolicyRules; containers restarted by the restartPolicyRules should not fail the Pod and trigger PodFailurePolicy.

E2E tests:

Graduation Criteria

Alpha

Container restart policy added to the API.
Container restart policy implemented behind a feature flag.
Initial e2e tests completed and enabled.
Public documentation on pod restart policy is updated to distinguish between pod restart policy, container restart policy, and container restart rules.

Beta

Container restart policy functionality running behind feature flag for at least one release.
Container restart policy runs well with Job controller.
All monitoring requirements completed.
All testing requirements completed.
All known pre-release issues and gaps resolved.

GA

No major bugs reported for three months.
User feedback (ideally from at least two distinct users) is green.

Upgrade / Downgrade Strategy

API server should be upgraded before Kubelets. Kubelets should be downgraded before the API server.

Version Skew Strategy

Previous kubelet client unaware of the container restart policy will ignore this field and keep the existing behavior determined by pod’s restart policy.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: ContainerRestartRules
- Components depending on the feature gate: kubelet, kube-apiserver

Does enabling the feature change any default behavior?

No. The feature introduces a new API field, restartPolicyRules, to the container spec. If this field is not specified, the existing behavior determined by the Pod’s restartPolicy remains unchanged.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. To roll back, the feature gate should be disabled in the API server and kubelets, and they should be restarted.

If a Pod was created with the container-level restart policy and/or restartPolicyRules while the feature gate was enabled, but later the feature gate is disabled, those container-level restart policy and rules will persist, but they will have no effect and will be ignored by the kubelet.

Once the feature is disabled, pods cannot be created with container-level restart policy except Sidecar init containers with restart policy Always. Pods created with restart policy rules will be silently dropped.

What happens if we reenable the feature if it was previously rolled back?

If the feature is re-enabled, the kubelet will once again recognize and enforce the restartPolicyRules for any Pods that have the field defined. The container restart logic described in the KEP will become active again.

Are there any tests for feature enablement/disablement?

Unit test for the API’s validation with the feature enabled and disabled:
- See https://github.com/kubernetes/kubernetes/blob/9630ab9581afbac9835d53f9e620a1240a1d2d91/pkg/apis/core/validation/validation_test.go#L29065 and https://github.com/kubernetes/kubernetes/blob/9630ab9581afbac9835d53f9e620a1240a1d2d91/pkg/apis/core/validation/validation_test.go#L9357
Unit test for the kubelet with the feature enabled
- See https://github.com/kubernetes/kubernetes/blob/9630ab9581afbac9835d53f9e620a1240a1d2d91/pkg/kubelet/kubelet_test.go#L2476 , https://github.com/kubernetes/kubernetes/blob/9630ab9581afbac9835d53f9e620a1240a1d2d91/pkg/kubelet/kubelet_pods_test.go#L3302 , and https://github.com/kubernetes/kubernetes/blob/9630ab9581afbac9835d53f9e620a1240a1d2d91/pkg/kubelet/kuberuntime/kuberuntime_manager_test.go#L2112
Unit test for API on the new field for the Pod API. First enable the feature gate, create a Pod with a container including restartRules, validation should pass and the Pod API should match the expected result. Second, disable the feature gate, validate the Pod API should still pass and it should match the expected result. Lastly, re-enable the feature gate, validate the Pod API should pass and it should match the expected result. This is achieved by the ValidationOptions, if the podSpec contains restart policy, or the feature gate is enabled, then the AllowContainerRestartPolicyRules would be true, see https://github.com/kubernetes/kubernetes/blob/9630ab9581afbac9835d53f9e620a1240a1d2d91/pkg/api/pod/util_test.go#L5965

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

If this feature is being actively used in a cluster that has this feature partially enabled on some nodes, the Pod may behave differently on exit. Pods on nodes with this feature may restart in-place, while pods on nodes without this feature may not be restarted.

What specific metrics should inform a rollback?

Repeated restart of container or pods.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Manual testing was performed to verify the upgrade and rollback paths.

Upgrade: A cluster with the feature disabled was upgraded to a version with the feature enabled. Pods with container-level restartPolicy and restartPolicyRules were deployed and observed to behave as expected.
Rollback: A cluster with the feature enabled was rolled back to a version with the feature disabled. Previously created pods continued to run and have the container-level restartPolicy and restartPolicyRules, but these fields were ignored. New Pods cannot be created with container-level restartPolicy, and restartPolicyRules are dropped silently.
Upgrade->Downgrade->Upgrade: This path was tested by performing the above steps sequentially. The feature behaved as expected at each stage, with restartPolicyRules being respected when the feature was enabled and ignored when disabled.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

Operators can determine if the feature is in use by checking the Pod spec for the presence of the restartPolicyRules field within container definitions. Operators can track the ContainerStatus.RestartCount to see how many times the container has restarted.

Additionally, monitoring the kube_pod_container_status_restarts_total metric can indicate container restarts that might be governed by these rules.

How can someone using this feature know that it is working for their instance?

Events
- Event Reason:
API .status
- Other field: ContainerStatuses
  - Container statuses will have the history of the container restarts.
Other (treat as last resort)
- Details: The metric kube_pod_container_status_restarts_total will show the total count of container restarts.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

The rate of unexpected container restarts (i.e., not matching a restartPolicyRules) should remain below 1%.
The time taken for a container to restart after an exit code matching restartPolicyRules should be within typical container restart latencies, accounting for exponential backoff.
Kubelet SLOs should not be impacted.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name: kube_pod_container_status_restarts_total
- Aggregation method: Sum over time, grouped by container and pod.
- Components exposing the metric: kube-state-metrics
Other (treat as last resort)
- Details: PodStatus API will also have a full history of containers restarted in ContainerStatuses field. Containers restarted by RestartPolicyRules will be included in the statuses history.

Are there any missing metrics that would be useful to have to improve observability of this feature?

No.

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

Enabling this feature will introduce a new field restartPolicyRules on the Container API type .

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

Container API type will be increased. The rules can handle at most 256 int32 exit values, plus the action name (“In” or “NotIn”), the size will increase by at most 1029B.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

The container will keep running or restarted by kubelet. Deletion of the pod / container may be delayed.

What are other known failure modes?

If kubelet becomes unavailable or is being restarted, there might be delays in container restarts.

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

1.34: Implemented in Alpha
- https://github.com/kubernetes/kubernetes/pull/132642
- https://github.com/kubernetes/kubernetes/pull/133243

Drawbacks

Alternatives

Wrapping entrypoint

One way to implement this KEP as a DIY solution is to wrap the entrypoint of the container with the program that will implement this exit code handling policy. This solution does not scale well as it needs to be working on multiple Operating Systems across many images. So it is hard to implement universally.

Non-declarative (callbacks based) restart policy

An alternative to the declarative failure policy is an approach that allows containers to dynamically decide their faith. For example, a callback is called on an “orchestration container” in a Pod when any other container has failed. And the “orchestration container” may decide the fate of this container - restart or keep as failed.

This may be a possibility long term, but even then, both approaches can work in conjunction.