KEP-3085: Pod networking ready condition

Implementation History
BETA Implementable
Created 2021-12-14
Latest v1.36
Milestones
Alpha v1.28
Beta v1.29
Ownership
Owning SIG
SIG Node
Participating SIGs

KEP-3085: Pod Conditions for Starting and Completion of Sandbox Creation

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
    • e2e Tests for all Beta API Operations (endpoints)
    • (R) Ensure GA e2e tests for meet requirements for Conformance Tests
    • (R) Minimum Two Week Window for GA e2e tests to prove flake free
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • “Implementation History” section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Readiness to start the containers in a pod, marked by successful pod sandbox creation, is a critical phase in a pod’s lifecycle that the kubelet orchestrates across multiple components: in-tree volume plugins (ConfigMap, Secret, EmptyDir, etc), CSI plugins and container runtime (which in turn invokes a runtime handler and CNI plugins). Completion of all these phases puts the pod sandbox in a state where the containers in a pod can be started. This KEP proposes a PodReadyToStartContainers condition in pod status to indicate a pod has reached a state where it’s containers are ready to be started. The PodReadyToStartContainers condition will mark an important milestone in the pod’s lifecycle similar to ContainersReady and the overall Ready conditions in pod status today. An alternate name like SandboxReady is avoided since Kubernetes does not directly surface low level sandbox related concepts to all users.

Motivation

Today, the scheduler surfaces a specific pod condition: PodScheduled that clearly identifies whether a pod got scheduled by the scheduler and when scheduling completed. However, no specific conditions around initialization of successfully scheduled pods from the perspective of completion of pod sandbox creation is surfaced to cluster administrators in a scoped and consumable fashion.

There is an existing pod condition: Initialized that tracks execution of init containers. For pods without init containers, the Initialized condition is set when the Kubelet starts to process a pod before any sandbox creation activities start. For pods with init containers, the Initialized condition is set when init containers have been pulled and executed to completion. Therefore, the existing Initialized condition is insufficient and inaccurate for tracking completion of sandbox creation and readiness to start containers for all pods in a cluster. This distinction becomes especially relevant in multi-tenant clusters where individual tenants own the pod specs (including the set of init containers) while the cluster administrators are in charge of storage plugins, networking plugins and container runtime handlers.

The Kubelet can start to launch the containers specified in a pod immediately after pod sandbox creation is completed successfully. A new dedicated condition marking the successful creation of pod sandbox and readiness to start containers

  • PodReadyToStartContainers - will benefit cluster operators (especially of multi-tenant clusters) who are responsible for configuration and operational aspects of the various components that play a role in pod sandbox creation: CSI plugins, CRI runtime and associated runtime handlers, CNI plugins, etc. The duration between lastTransitionTime field of the PodReadyToStartContainers condition (with status set to true for a pod for the first time) and the existing PodScheduled condition will allow metrics collection services to compute total latency of all the components involved in pod sandbox creation as an SLI. Cluster operators can use this to publish SLOs around pod initialization to their customers who launch workloads on the cluster.

Custom pod controllers/operators can use a dedicated condition indicating completion of pod sandbox creation and readiness to start containers to make better decisions around how to reconcile a pod failing to become ready. As a specific example, a custom controller for managing pods that refer to PVCs associated with node local storage (e.g. Rook-Ceph) may decide to recreate PVCs (based on a specified PVC template in the custom resource the controller is managing) if the sandbox creation is repeatedly failing to complete, indicated by the new PodReadyToStartContainers condition reporting false. Such a controller can leave PVCs intact and only recreate pods if sandbox creation completes successfully (indicated by the new PodReadyToStartContainers condition reporting true) but the pod’s containers fail to become ready. Further details of this is covered in a user-story below.

When a pod’s sandbox no longer exists, the status of PodReadyToStartContainers condition will be set to false. The duration between a pod’s DeletionTimeStamp and subsequent lastTransitionTime of PodReadyToStartContainers condition (with status set to false) will indicate the latency of pod termination. This can also be surfaced by metrics collection services as a SLI. Note that surfacing any dedicated conditions around termination of pod sandbox is unnecessary and beyond the scope of this KEP.

Individual container creation (including pulling images from a registry) takes place after the successful completion of pod sandbox creation. Updates to pod container status to report latencies associated with creation of individual containers within a pod is beyond the scope of this KEP.

Goals

  • Surface a new pod condition PodReadyToStartContainers to indicate readiness to start containers immediately following the successful completion of pod sandbox creation by Kubelet.
  • Describe how the new pod condition can be consumed by external services to determine state and duration of pod sandbox creation.

Non-Goals

  • Modify the meaning of the existing Initialized condition
  • Specify metrics collection based on the conditions around pod sandbox creation
  • Specify additional conditions (beyond PodReadyToStartContainers with status set to false) to indicate sandbox teardown
  • Surface beginning and completion of creation of individual containers

Proposal

This KEP proposes enhancements to the Kubelet to report the readiness to start containers of a pod (immediately following successful pod sandbox creation) as a new pod condition with type: PodReadyToStartContainers. Metric collection and monitoring services can use the fields associated with the PodReadyToStartContainers condition to report sandbox creation state and latency either at a per-pod cardinality or aggregate the data based on various properties of the pod: number of volumes, storage class of PVCs, runtime class, custom annotations for CNI and IPAM plugins, arbitrary labels and annotations on pods, etc. Certain pod controllers can use the pod sandbox conditions to determine an optimal reconciliation strategy for pods and associated resources (like PVCs).

User Stories (Optional)

User Stories For Consuming PodReadyToStartContainers Condition

Surfacing the readiness to start containers of a pod (immediately following successful pod sandbox creation) as a new pod condition - PodReadyToStartContainers - in pod status can be consumed in different ways:

Story 1: Consuming PodReadyToStartContainers Condition Per Pod In A Monitoring Service

A cluster operator may already depend on a service like Kube State Metrics for monitoring the state of their Kubernetes clusters. The cluster operator may want such a service to surface pod sandbox creation state and latency at a granular level for each pod (due to the ambiguity around Initialized state as described earlier). For this story, we are assuming the service has been enhanced to [1] consume the new PodReadyToStartContainers pod condition as described in this KEP and [2] implement informers and state to distinguish between the first time Kubelet is ready to launch containers in a pod and a subsequent instance of Kubelet being ready to launch containers in a pod (after sandbox destruction) over the lifetime of the pod.

The operator can use PromQL queries to aggregate and analyze data (around pod sandbox creation) based on custom pod labels and annotations (already surfaced by a service like Kube State Metrics) indicating specific workload types across different namespaces. For example, annotations and labels could be used to differentiate pod sandbox creation state and latencies for “sensitive database” workloads, “sensitive analysis” workloads and “untrusted build” workloads each of which maps to pods mounting PVCs from different storage classes (depending on the level of encryption desired), using a specific runtime class (depending on the level of isolation desired - microvm vs runc based) and specific IPAM characteristics around reachability of the pods. Access to the pod labels and annotations along with the sandbox latency data at a per-pod cardinality is essential to enable the aggregation based on factors that have special/custom meaning for the operator’s cluster and tenants. The values associated with such labels and annotations may not map to distinct namespaces, existing pod fields or other API object fields in a Kubernetes cluster.

Depending on the metrics and monitoring pipeline, as the cluster scales up, cardinality of data at a per pod level (surfaced from a service like Kube State Metrics) may lead to excessive load on the monitoring backend like Prometheus. At such a point, the cluster operator may decide to create and deploy their own custom monitoring service that uses a pod informer and aggregates (based on custom pod labels and annotations) state and latency of pod sandbox creation into a histogram which is ultimately reported to Prometheus. As with the previous approach, access to the pod labels and annotations and the sandbox latency data at a per-pod cardinality is essential to enable the aggregation based on factors that have special/custom meaning for the operator’s cluster and tenants and may not map to distinct namespaces pod fields or other API object fields in the cluster.

The data from the above monitoring services can be used as SLIs with associated SLOs configured around sandbox creation state and latency (besides other metrics like scheduling latency) for each specific workload type depending on specific user requirements such as: desired encryption of persistent data (if any), runtime isolation and network reachability (governed by different IPAM plugins).

Story 2: Consuming PodReadyToStartContainers Condition In A Controller

A controller managing a set of pods along with associated resources like networking configuration, storage or arbitrary dynamic resources (in the future) can evaluate the PodReadyToStartContainers condition to optimize the set of actions the controller needs to execute when bringing up pods and encountering failures in the process. Depending on whether the pod sandbox is ready to start containers, the controller may decide to destroy and re-create the associated resources that are required for the sandbox creation to complete (to start containers) or simply try to re-create the pod while keeping the resources intact.

A specific example of the above would be a controller for stateful application pods that mount PVCs that bind to node local PVs. Let’s assume the stateful application has built-in data replication capabilities and the controller supports PVC templates to dynamically generate PVCs. When trying to bring up fresh pods (after earlier pods got terminated), there could be a problem with the CSI plugin that mounts the node local PV into the pod. In such a situation, the sandbox creation will not complete. Based on the PodReadyToStartContainers condition, the controller may decided to create a fresh PVC. If sandbox creation does complete successfully (marked by PodReadyToStartContainers reporting true) but the pod fails to enter a Ready state, the controller will retain the PVC (to avoid any data replication) and only try to recreate the pod. Having access to the new PodReadyToStartContainers condition allows the controller to optimize it’s reconciliation strategy and realize the desired state more efficiently.

PodReadyToStartContainers Condition Fields In Different User Scenarios

In each of the scenarios below, nearly identical PodReadyToStartContainers conditions that would result from different scenarios/problems are grouped together. The unique scenarios are detailed after describing the values associated with the fields of the PodReadyToStartContainers condition. To make each scenario concrete, a specific set of timestamps in the future is chosen. The PodScheduled condition is mentioned in the stories but conditions after pod sandbox creation (e.g. Initialized and Ready) are skipped. A service monitoring latency of initial pod sandbox creation is assumed to implement a pod informer and appropriate state to distinguish between the first time a pod sandbox becomes ready to start containers versus a subsequent instance of readiness over the lifetime of the pod.

Scenario 1: Stateless pod scheduled on a healthy node and cluster

A user launches a simple, stateless runc based pod with no init containers in a healthy cluster. The pod gets successfully scheduled at 2022-12-06T15:33:46Z and pod sandbox is ready after three seconds at 2022-12-06T15:33:49Z.

The pod will report the following conditions in pod status at 2022-12-06T15:33:47Z (right after Kubelet worker starts processing the pod):

status:
  conditions:
  ...
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T15:33:47Z"
    status: "False"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T15:33:46Z"
    status: "True"
    type: PodScheduled

The pod will report the following conditions in pod status at 2022-12-06T15:33:50Z (after pod sandbox creation is complete and containers are ready to start):

status:
  conditions:
  ...
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T15:33:49Z"
    status: "True"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T15:33:46Z"
    status: "True"
    type: PodScheduled

A service monitoring latency of initial pod sandbox creation will record a latency of three seconds in this scenario based on the delta between lastTransitionTime timestamp associated with PodReadyToStartContainers and PodScheduled conditions.

Scenario 2: Pods with startup delays due to problems with CSI, CNI or Runtime Handler plugins

In each of the scenarios under this section, problems or delays with infrastructural plugins like CSI/CNI/CRI result in a ten second delay for pod sandbox creation to complete after which, containers can be started. In each scenario, the pod gets successfully scheduled at 2022-12-06T15:33:46Z while pod sandbox is created and containers are ready to start after ten seconds at 2022-12-06T15:33:56Z.

For each scenario below, the pod will report the following conditions in pod status at 2022-12-06T15:33:47Z (right after Kubelet worker starts processing the pod and the pod sandbox creation has started but not complete):

status:
  conditions:
  ...
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T15:33:47Z"
    status: "False"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T15:33:46Z"
    status: "True"
    type: PodScheduled

For each scenario, the pod will report the following conditions in pod status at 2022-12-06T15:34:00Z (after pod sandbox is ready - and containers are ready to start - after ten seconds):

status:
  conditions:
  ...
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T15:33:56Z"
    status: "True"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T15:33:46Z"
    status: "True"
    type: PodScheduled

A service monitoring duration of pod sandbox creation (marked by readiness to start containers) will record a latency of ten seconds in these scenarios based on the delta between lastTransitionTime timestamps associated with PodReadyToStartContainers and PodScheduled conditions with status set to true. For each observation associated with a scenario below, the monitoring service also associates a label with the metric indicating RuntimeClass of the pods and StorageClass of PVCs referred by the pod. This enables further grouping of the data during analysis.

A cluster-wide SLO around initial pod sandbox creation latencies configured with a threshold of less than ten seconds will record a breach in these scenarios. Further analysis of the metrics based on labels indicating RuntimeClass of the pods and StorageClass of PVCs referred by the pod will enable the cluster administrators to isolate the cause of the breaches to specific infrastructure plugins as detailed below.

Stateful pod encountering sandbox creation delays from attaching a PV backed by a CSI plugin

A Stateful pod refers to a PVC bound to a PV backed by a CSI plugin. After the pod is scheduled on a node, the CSI plugin runs into problems in the storage control plane when trying to attach the PV to the node. This results in several retries that ultimately succeeds after nine seconds.

Stateless pod encountering sandbox creation delays from allocating IP from a CNI/IPAM plugin

A pod is scheduled on a node in an experimental pre-production cluster where the operator has configured a new CNI plugin using a centralized IP allocation mechanism. Due to a spike of load in the IP allocation service, the CNI plugin times out several times but ultimately succeeds getting an IP address and configuring the pod network after nine seconds.

Stateless pod encountering sandbox creation delays from microvm based sandbox initialization

A pod configured with a special microvm based runtime class is scheduled on a node. The runtimeclass handler encounters crashes in the guest kernel multiple times but ultimately initializes the virtual machine based sandbox environment successfully after nine seconds.

Story 3: Pod unable to start due to problems with CSI, CNI or Runtime Handler plugins

In each of the scenarios under this section, problems or delays with infrastructural plugins like CSI/CNI/CRI result in pod sandbox creation never completing and the pod never being ready to start containers. In each scenario, the pod gets successfully scheduled at 2022-12-06T15:33:46Z, but pod sandbox creation runs into problems that do not eventually resolve and results in repeated failures as kubelet tries to start the pod.

For each scenario below, the pod will report the following conditions in pod status at all times after 2022-12-06T15:33:47Z (after pod sandbox creation started until the pod is deleted manually or by a controller):

status:
  conditions:
  ...
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T15:33:47Z"
    status: "False"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T15:33:46Z"
    status: "True"
    type: PodScheduled

A service monitoring state of pod sandbox creation will record a metric indicating failure to create pod sandbox beyond a configured duration.

A cluster-wide SLO around success rate of pod sandbox creation may record a breach due to the pod sandbox creation failures. Further analysis of the metrics aggregated based on labels (associated with the metrics) indicating RuntimeClass of the pods and StorageClass of PVCs referred by the pod will enable the cluster administrators to associate the failures to specific infrastructure plugins as detailed below.

Stateful pod encountering sandbox creation failures when attaching a PV backed by a CSI plugin

A Stateful pod refers to a PVC bound to a PV backed by a CSI plugin. After the pod is scheduled on a node, the CSI plugin runs into problems in the storage control plane when trying to attach the PV to the node. The failure to attach never resolves thus blocking pod sandbox creation.

Stateless pod encountering sandbox creation failures when allocating IP from a CNI/IPAM plugin

A pod is scheduled on a node in an experimental pre-production cluster where the operator has configured a new CNI plugin using a centralized IP allocation mechanism. Due to problems in the IP allocation service, the CNI plugin fails to get an IP address and is unable to configure the pod network. This blocks pod sandbox creation.

Stateless pod encountering sandbox creation failures from microvm based sandbox initialization

A pod configured with a special microvm based runtime class is scheduled on a node. The runtimeclass handler encounters crashes in the guest kernel repeatedly and is unable to initialize the virtual machine based sandbox environment.

Story 4: Pod Sandbox restart after a successful initial startup and crash

In each of the scenarios under this section, a pod sandbox is successfully created but eventually gets destroyed due to problems in the host or the sandbox environment. As a result, the pod sandbox has to be re-created (and pod networking reconfigured) by Kubelet in coordination with CRI runtime. In each scenario, the pod is successfully scheduled at 2022-12-06T15:33:46Z and pod sandbox is ready after 5 seconds. The sandbox is destroyed after two hours. Re-creation of the sandbox runs into problems but eventually succeed after nine seconds.

The pod will report the following conditions in pod status at 2022-12-06T15:34:00Z (few seconds after initial pod sandbox is ready):

status:
  conditions:
  ...
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T15:33:52Z"
    status: "True"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T15:33:46Z"
    status: "True"
    type: PodScheduled

The pod will report the following conditions in pod status at 2022-12-06T17:33:46Z (right after pod sandbox is destroyed):

status:
  conditions:
  ...
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T17:33:46Z"
    status: "False"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T15:33:46Z"
    status: "True"
    type: PodScheduled

The pod will report the following conditions in pod status at 2022-12-06T17:34:00Z (few seconds after the new pod sandbox is ready):

status:
  conditions:
  ...
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T17:33:52Z"
    status: "True"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T15:33:46Z"
    status: "True"
    type: PodScheduled

A service monitoring restarts associated with successfully created pod sandboxes will record a restart in these scenarios. A service measuring initial pod sandbox creation latency will need to implement logic (for example, using pod informers and state) to differentiate the initial pod sandbox creation from the latter pod sandbox creations resulting from node crashes/reboots or sandbox crashes.

Node crash

A regular runc based pod is scheduled on a node whose kernel crashes after two hours of the pod sandbox getting created successfully. The node restarts quickly (resulting in no pod evictions) and kubelet has to re-create the pod sandbox.

Sandbox crash

A pod is configured with a microvm based runtime handler. The virtual machine sandbox is created successfully but suffers a crash due to problems with the guest kernel after two hours of the pod creation. As a result, kubelet has to re-create the pod sandbox (and reconfigure pod networking).

Story 5: Graceful pod sandbox termination

A user launches a pod that runs successfully but eventually deleted by a controller after several hours. The pod was scheduled at 2022-12-06T12:33:46Z and the sandbox became ready at 2022-12-06T12:33:48Z. The delete request is invoked at 2022-12-06T15:33:47Z and the pod is terminated by Kubelet at 2022-12-06T15:33:49Z

The pod will report the following conditions in pod status at 2022-12-06T15:33:46Z (right before the pod delete request is invoked):

status:
  conditions:
  ...
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T12:33:48Z"
    status: "True"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T12:33:46Z"
    status: "True"
    type: PodScheduled

The pod will report the following conditions in pod status at 2022-12-06T15:33:49Z (right after the pod termination has been processed by Kubelet but the pod is yet to be completely deleted from API server):

status:
  conditions:
  ...
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T15:33:49Z"
    status: "False"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2022-12-06T12:33:46Z"
    status: "True"
    type: PodScheduled
Story 6: Volume mounting issues

These are some areas that are useful to deduce information around volume failures.
PodReadyToStartContainers gives a useful condition for these cases.

Mounting Volumes from missing Secrets

A pod is configured to mount a volume from a secret. The corresponding secret is missing so the volume mount fails.

Mounting Volumes from missing ConfigMaps

A pod is configured to mount a volume from a ConfigMap. The corresponding Configmap is missing so the volume mount fails.

In both of these cases, it is possible to deduce a failure but you would have view the events. Without this condition, the latest status is ContainerCreating and no conditions help to distinguish this issue.

Events are best-effort so they are not guaranteed to happen. In many cases, it is possible that these events are missed and users are left confused on why there containers are stuck. The PodReadyToStartContainers adds a condition to reflect a failure in these cases. Both of these cases fail in the volume initialization phase so PodReadyToStartContainers would be set to False giving users notification that their pods had an issue.

Notes/Constraints/Caveats (Optional)

A monitoring service measuring duration of initial sandbox creation of a pod (based on readiness to launch containers in the pod) should differentiate between the initial and subsequent sandbox creations (if any due to node crash/sandbox crash) and track them separately. This can be achieved using a pod informer whose event handler stores (in a persistent store or as custom annotations on the pod) the lastTransitionTime field for PodReadyToStartContainers condition observed when it had status = true for the first time. Later, if the pod sandbox is recreated, the lastTransitionTime for the condition to indicate readiness to start containers can be differentiated from the initial readiness to start containers based on whether the initial data exists (either in the persistent store or pod annotations).

Measuring duration of sandbox creation accurately beyond the initial sandbox creation (based on readiness to launch containers in the pod) is not possible with the PodReadyToStartContainers condition alone. This is similar to other ready conditions like ContainersReady and overall pod Ready which gets updated after containers are restarted without a specific marker of when the process of restarting the containers or brining the pod back into a ready state began following an event like a node crash.

When deriving SLOs based on SLIs around state and duration of sandbox creation, user error scenarios should be filtered. In the context of pod sandbox creation, such errors can surface due to:

  • References to a secret or configmap that does not exist and never gets created. As a result, a pod referencing a missing secret or configmap will never go past the volume initialization phase.
  • References to a secret or configmap that get created at a point of time after the pod gets scheduled. In such scenarios, volume initialization phase of the pod will be stuck until the referenced secrets/configmaps are created in the cluster. The metric collection service that generates SLIs can filter pods affected by the above situations by evaluating FailedMount pod events associated with the pod and matching a regular expression of the form MountVolume.SetUp failed for volume "(secret|config-map) .*" : (secret|config-map) ".*" not found".

Risks and Mitigations

The main risk associated with PodReadyToStartContainers is any potential confusion with the existing Initialized condition. Both the existing Initialized condition and the new proposed condition refer to distinct stages in a pod’s overall initialization. Documentation will help mitigate this risk.

Design Details

The Kubelet will set a new condition on a pod: PodReadyToStartContainers to surface that Kubelet is ready to start containers in a pod sandbox immediately following successful completion of sandbox creation for the pod. A new PodConditionType corresponding to PodReadyToStartContainers will be added in api/core/v1/types.go. No changes are required in the Pod Status API for this enhancement.

Determining status of sandbox creation for a pod

Today, syncPod() in Kubelet is invoked with the kubecontainer.PodStatus (distinct from the v1.PodStatus API) associated with a given pod. podSandboxChanged() in kubeGenericRuntimeManager evaluates the SandboxStatuses field in PodStatus to determine whether a new pod sandbox will need to be created for a pod. The same logic will be used to determine whether a sandbox is ready for a pod in the Kubelet status manager.

PodReadyToStartContainers condition details

Kubelet will initially generate the PodReadyToStartContainers condition as part of existing calls to generateAPIPodStatus() early during syncPod(). The status field will be set to true if a sandbox is ready (determined by invoking podSandboxChanged() as described above ). The status field will be set to false if a sandbox is found to be not ready.

Kubelet will generate the PodReadyToStartContainers condition for the final time (in the life of a pod) as part of existing calls to generateAPIPodStatus() early during syncTerminatedPod(). Prior invocations of killPod() (as part of syncTerminatingPod) will result in the absence of a sandbox corresponding to the pod. As a result, the status field of the PodReadyToStartContainers condition will be set to false (determined by invoking podSandboxChanged() as described above ).

During periods of API server or etcd unavailability combined with a Kubelet restart/crash (covered in more details below ), the lastTransitionTime field of PodReadyToStartContainers condition that ultimately gets persisted upon Kubelet restarting and API server becoming available again is as close as possible to an actual change in the condition (that could not be persisted).

Changes of the status field will result in lastTransitionTime field getting updated (by the Kubelet Status Manager).

Enhancements in Kubelet Status Manager

Today, the Kubelet Status Manager surfaces APIs for other Kubelet components to issue pod status updates. It caches the pod status and issues patches to the API server when necessary. This infrastructure will be used for managing the new pod conditions as well.

The Kubelet Status Manager will surface a new GeneratePodReadyToStartContainers API. This will be invoked by Kubelet’s generateAPIPodStatus() to populate the pod status that is passed to setPodStatus. This is similar to the existing pod conditions generator functions: GeneratePodReadyCondition and GeneratePodInitializedCondition. If updates through generateAPIPodStatus() is found to be inaccurate (for example if Kubelet is very busy), invocation of GeneratePodReadyToStartContainers could also be added right after createSandbox in kubeGenericRuntimeManager returns successfully.

updateStatusInternal() in the Kubelet Status Manager will be enhanced to mark updateLastTransitionTime for the new PodReadyToStartContainers condition when changes in the status of the conditions are detected.

Unavailability of API Server or etcd along with Kubelet Restart

If pod sandbox creation completed successfully on a node but API server became unavailable, the Kubelet status manager will retry issuing the patches to the API server. However, the Kubelet may get restarted (or crash) while the API server is unavailable with the pod status updates not yet persisted. In such a situation (expected to be quite rare), the timestamp associated with the lastTransitionTime field in the new conditions will not be accurate due to inability to persist or cache them. The lastTransitionTime field will get updated on subsequent generateAPIPodStatus() calls based on the state of the CRI sandbox and the corresponding timestamps will be persisted. This aligns with handling of other Kubelet managed conditions (ContainersReady, (Pod) Ready) when API server is unavailable and Kubelet restarts resulting in the status manager cache getting dropped.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

A review of the existing E2E tests reveal that coverage of the basic, existing pod conditions (populated by Kubelet) is sparse. While the existing pod conditions are quite mature, we will consider adding explicit validation of some of the subtle aspects of current behavior around the pod conditions (e.g. ensuring Initialized condition of a pod without init containers is set very early for a pod that will never reach the sandbox creation state due to missing volume dependencies and will thus never actually initialize).

Unit tests

New unit tests will be mainly scoped to the Kubelet status package that a bulk of the enhancements above will target.

  • k8s.io/kubernetes/pkg/kubelet/status: June 13, 2022 - 82.2
  • k8s.io/kubernetes/pkg/kubelet/kubelet: June 13, 2022 - 64.5
  • k8s.io/kubernetes/pkg/kubelet/kubelet_pods.go: June 13, 2022 - 71.3

Note above that while the Kubelet package overall has low coverage, the changes in the context of this KEP is scoped to the generateAPIPodStatus method which is a tiny portion of the overall Kubelet package in the kubelet_pods.go file.

Integration tests

N/A. See notes about e2e tests below.

e2e tests

Tests List

  • Pod Conditions Test
  • GracefulNodeShutdown test
    • Add test to check status of pod ready to start condition are set to false after terminating (added as part of k/k PR#121044 )
  • [] Volume Mounting Issues
    • Add test to verify sandbox condition for missing configmap. (added as part of k/k PR#121321 )
    • Add test to verify sandbox condition for missing secret.
  • [] Dynamic Resource Allocation (DRA) Allocation Ordering
    • Add test to verify the order between the PodReadyToStartContainers condition and devicemanager Allocate() gRPC calls to the device plugin, ensuring the condition is set at the expected point relative to resource allocation.

E2E tests will be introduced to cover the user scenarios mentioned above. Tests will involve launching pods with characteristics mentioned below and examining the pod status has the new PodReadyToStartContainers condition with status and reason fields populated with expected values:

  1. A basic pod that launches successfully without any problems.
  2. A pod with references to a configmap (as a volume) that has not been created causing the pod sandbox creation to not complete until the configmap is created later.
  3. A pod with references to a secret (as a volume) that has not been created causing the pod sandbox creation to not complete until the secret is created later.
  4. A pod whose node is rebooted leading to the sandbox being recreated.
  5. A pod that requests device resources managed by the DRA framework, to verify the order between the PodReadyToStartContainers condition and DeviceManager Allocate() gRPC calls.

Tests for pod conditions in the GracefulNodeShutdown e2e_node test will be enhanced to check the status of the new pod sandbox conditions are false after graceful termination of a pod.

Testing updates of Pod conditions in the Conformance Test Pods, completes the lifecycle of a Pod and the PodStatus will be enhanced to cover resetting the new pod sandbox conditions.

Graduation Criteria

Alpha

  • Kubelet will report pod sandbox conditions if the feature flag PodReadyToStartContainersCondition is enabled.
  • E2E tests added for pod conditions.
  • E2E test for sandbox condition if pod fails to mount volume.

Beta

  • Condition is moved from a package constant in Kubelet to a API Defined Condition
  • Gather feedback from cluster operators and developers of services or controllers that consume these conditions.
  • Implement suggestions from feedback as feasible.
  • Feature Flag defaults to enabled.
  • Add test case for graceful shutdown.
  • Add test case for sandbox condition if pod fails to mount volume from a missing secret.
  • Clarify and define the order between the PodReadyToStartContainers condition and the DRA Allocate() (devicemanager’s Allocate gRPC calls to the device plugin).
    Add documententation and a test to verify the behaviour.

GA

  • All tests are passing with no known flakiness.
  • All feedback addressed around the new pod sandbox conditions.
  • No open decision items around the new pod sandbox conditions.
  • Feature Flag removed.

Upgrade / Downgrade Strategy

The new condition will be managed by the Kubelet. When upgrading a node to a version of the Kubelet that can set the new condition, new pods launched on that node will surface the new condition. If Kubelet on the node is later downgraded, there may remain evicted pods that are not deleted. Foe such pods, a node with a version of the Kubelet that does not support the new condition will continue to report pods associated with it with the new conditions.

Version Skew Strategy

The new condition will be managed by the Kubelet. Since the control plane components are not involved, handling of version skew is not applicable.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?
  • Feature gate (also fill in values in kep.yaml)
    • Feature gate name: PodReadyToStartContainersCondition
    • Components depending on the feature gate: Kubelet
Does enabling the feature change any default behavior?

Yes, there will be a new condition for all pods.
In normal cases, the PodReadyToStartContainers condition will be set to true.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, the feature can be disabled once it has been enabled. However the new pod sandbox condition will get persisted in pods and would continue to be reported after the feature is disabled until those pods are deleted.

What happens if we reenable the feature if it was previously rolled back?

New pods created since re-enablement will report the new pod sandbox condition.

Are there any tests for feature enablement/disablement?

Unit tests (as outlined in the Unit tests section above) will be used to confirm that the new pod condition introduced is being:

  • evaluated and applied by the Kubelet Status manager when the feature is enabled.
  • not evaluated nor applied when the feature is disabled.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

This flag is only relevant for the Kubelet. Therefore, the new condition will be reported for pods scheduled on nodes that have the feature enabled.

A controller or service that consumes the new pod condition should be enabled only after rollout of the new condition has succeeded on all nodes. Similarly, the controller or service that consumes the new pod condition should be disabled before the rollback. This helps prevent a controller/service consuming the condition getting the data from pods running in a subset of the nodes in the middle of a rollout or rollback.

If a controller or service consumes these pod conditions but the cluster has turned off this feature, the controller or service will never match on this pod condition as new pods will never set this condition.

If the feature is rolled back and the conditions are set, the controller or service will never see a new condition update. Conditions can assume to be locked in place as no future patches will done to this condition.

What specific metrics should inform a rollback?

A sharp increase in the number of PATCH requests to API Server from Kubelets after enabling this feature is a sign of potential problem and can inform a rollback. A cluster operator may monitor

apiserver_request_total{verb="PATCH", resource="pods", subresource="status"}

for this.

This may be the case in clusters that use a special runtime environment like microVM/Kata, where the sandbox may crash repeatedly (without ever getting a chance to start containers) resulting in lots of potential updates due to the new condition “flapping”. However, in such environments, this may already be the case with existing pod conditions like ContainersReady and Ready (unless the sandbox environment/VM crashes very early before a single container is run). Batching of pod status updates from the Kubelet status manager will also help.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

In the node-e2e tests , we test upgrade and rollback via toggling the feature gates on/off.

The Upgrade->downgrade->upgrade testing was done manually using the alpha version in 1.28 with the following steps:

  1. Start the cluster with the PodReadyToStartContainersCondition enabled:
kind create cluster --name per-index --image kindest/node:v1.28.0 --config config.yaml

using config.yaml:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
  "PodReadyToStartContainersCondition": true
nodes:
- role: control-plane
- role: worker

Create a pod that has a failed pod sandbox. Easiest way to do this is to create a pod that fails to mount a volume. This case fails because the ConfigMap (clusters-config-file) does not exist.

using pod.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: config-map-mount
spec:
  containers:
    - name: test-container
      image: registry.k8s.io/busybox
      command: [ "/bin/sh", "-c", "env" ]
      volumeMounts:
      - mountPath: /clusters-config
        name: clusters-config-volume
  volumes:
    - configMap:
        name: clusters-config-file
      name: clusters-config-volume
kubectl create -f pod.yaml

This pod will have a PodReadyToStartContainers condition that says False.

    conditions:
    - lastProbeTime: null
      lastTransitionTime: "2023-10-03T22:14:22Z"
      status: "False"
      type: PodReadyToStartContainers
  1. To test the downgrade, we create a new kind cluster with the feature turned off.
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
  "PodReadyToStartContainersCondition": false
nodes:
- role: control-plane
- role: worker

When you inspect the conditions, the PodReadyToStartContainers condition will be not existent.

  1. To test the enable, we create a new kind cluster with the feature turned on.
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
  "PodReadyToStartContainersCondition": true
nodes:
- role: control-plane
- role: worker

This pod will have a PodReadyToStartContainers condition that says False.

    conditions:
    - lastProbeTime: null
      lastTransitionTime: "2023-10-03T22:14:22Z"
      status: "False"
      type: PodReadyToStartContainers

This demonstrates that the feature is working again for the job.

If you take a pod that is able to create a sandbox, then you should see True in cases where the condition exists.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

This question isn’t totally relevant for this feature, since this is an administrator-enabled feature controlled via kubelet flag, not something the user controls with an API server resource spec.

Checking the Pod conditions on nodes with this feature enabled is the simplest way to check if the feature is enabled properly on a vanilla k8s cluster.

How can someone using this feature know that it is working for their instance?
  • Events
    • Event Reason:
  • API .status
    • Condition name: PodReadyToStartContainers reported for pod
    • Other field:
  • Other (treat as last resort)
    • Details:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?

There are no SLOs for this feature. We don’t expect any changes to the existing SLOs.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
  • Other (treat as last resort)
    • Details: There are no specific SLIs for the Kubelet Status Manager
Are there any missing metrics that would be useful to have to improve observability of this feature?

New metrics may be added to the Kubelet status manager to surface fine grained information about updates to overall pod status as well as specific pod conditions. However, such a change affects the whole Kubelet Status Manager (rather than specific pod conditions) and thus beyond the scope of this KEP.

A general Kubernetes metrics collector like Kube State Metrics (that already consume pod condifitions and surface those as metrics) will need to be enhanced to consume the new pod condition in this KEP.

Dependencies

Does this feature depend on any specific services running in the cluster?

No, this feature does not have any dependencies. Other metric oriented services in the cluster may depend on this.

Scalability

Will enabling / using this feature result in any new API calls?

Yes, the new pod condition will result in the Kubelet Status Manager making additional PATCH calls on the pod status fields.

The Kubelet Status Manager already has infrastructure to cache pod status updates (including pod conditions) and issue the PATCH in a batch.

Will enabling / using this feature result in introducing new API types?

No

Will enabling / using this feature result in any new calls to the cloud provider?

No

Will enabling / using this feature result in increasing size or count of the existing API objects?

Slight increase (a few bytes) of the Pod API object due to persistence of the additional condition in the pod status.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

If etcd/API server is unavailable, pod status cannot be updated. So the PodReadyToStartContainers condition associated with pod status cannot be updated either. The pod status manager already retries the API server requests later (based on data cached in the Kubelet) and that should help.

If the Kubelet is ready to start containers in a pod (right after pod sandbox creation completes) on a node but API server becomes unavailable (before the condition to indicate readiness to start containers can be patched) and Kubelet crashes or restarts (shortly after API server becoming and staying unavailable), the lastTransitionTime field may be inaccurate. This is described in the section above .

What are other known failure modes?

None so far

What steps should be taken if SLOs are not being met to determine the problem?

SLOs are not applicable to pod status fields. Overall Kubernetes node level SLOs may leverage this feature.

Implementation History

Drawbacks

The main drawback associated with the new pod sandbox conditions involves a slight potential increase in calls to the API Server from Kubelet to patch status = true for the new PodReadyToStartContainers condition in a pod’s status. Typically, this would involve an extra patch call for pod status in the lifetime of most pods (if the status manager does not batch them with other pod status updates): one when pod sandbox creation completes and another when the pod is terminated. However, there could be a higher number of patch calls to API Server if the pod sandbox environment (like a microvm) starts successfully and then crashes in a re-start loop.

Caching of updates to pod status by the pod status manager and batching pod status updates (which is already in place) can help mitigate frequent patch calls to API server.

Alternatives

Dedicated fields or annotations for the pod sandbox creation timestamps

Timestamps around completion of pod sandbox creation may be surfaced as a dedicated field in the pod status rather than a pod condition. However, since the successful creation of pod sandbox is essentially a “milestones” in the life of a pod (similar to Scheduled, Ready, etc), pod conditions is the ideal place to surface these and aligns well with the existing conditions like ContainersReady and overall Ready.

A dedicated annotation on the pod for surfacing this data is another potential approach. However, usage of annotations for Kubelet managed data is typically discouraged.

Surface pod sandbox creation latency instead of timestamps

Surfacing the amount of time it took to successfully create a pod sandbox is an alternative to surfacing the condition around completion of pod sandbox creation (whose delta from pod scheduled condition reflects the latency). The latency data would surface the same information from a pod initialization SLI perspective as mentioned in the Motivations section. Implementing this approach would require an API change on the pod status to surface the latency data (as this no longer fits the structure of a pod condition). This data cannot be consumed by other controllers as mentioned in User Stories section.

Report sandbox creation latency as an aggregated metric

The duration it took pod sandbox to become ready can be directly reported as a prometheus metrics aggregated in a histogram. However, aggregating the data at the Kubelet level prevents a metric collection service from classifying the data based on interesting fields on a pod (runtime class, storage class of PVCs, number of PVCs, etc) or using custom labels and annotations on pods that indicate workload characteristics (that the cluster operator may wish to use as a basis for aggregating the metrics).

This also prevents other controllers from acting on sandbox status as mentioned in User Stories section.

Report sandbox creation stages using Kubelet tracing

The Kubelet is being instrumented to emit traces based on OpenTelemetry around sandbox creation stages (as well several other parts of the pod lifecycle).

To implement the pod sandbox creation latency SLI/SLO use cases, the tracing infrastructure needs to be able to:

  • Collect all traces around CRI sandbox creation for all pods with no sampling.
  • Look-up pod fields from API server (associated with a pod’s trace) like labels/annotations/storage classes of PVCs referred by the pod/runtimeclass/etc. that is of interest to cluster operators and their users for classifying and aggregating the metrics.
  • Look-up a pod’s Scheduled condition fields to determine the beginning of pod sandbox creation.

Since the lookup of the pod fields and existing conditions is necessary for SLIs around pod sandbox creation latency, surfacing the PodReadyToStartContainers condition in pod status will allow a metric collection service to directly access the relevant data without requiring the ability to collect and parse OpenTelemetry traces. As mentioned in the User Stories, popular community managed services like Kube State Metrics can consume the PodReadyToStartContainers condition with a trivial set of changes. Enhancing them to collect and parse OpenTelemetry traces with no sampling and mapping the data to associated data from API server data will be complex from an engineering and operational perspective.

For controllers using the pod sandbox conditions to determine reconciliation strategy, access to the pod is typically necessary while collecting and parsing traces would be unusual.

Have CSI/CNI/CRI plugins mark their start and completion timestamps while setting up their respective portions for a pod

Each infrastructural plugin that Kubelet calls out to (in the process of setting up a pod sandbox) can mark start and completion timestamps on the pod as conditions. This approach would be similar to how readiness gates work today. However, CSI and CRI plugins will need to be enlightened about fields in a pod (like status conditions) and setup a client to the API server (to update the conditions) which they may not implement to stay orchestrator agnostic.

Use a dedicated service between Kubelet and CRI runtime to mark sandbox ready condition on a pod

An on-host binary that runs as a service and proxies CRI API calls between the CRI runtime and Kubelet can intercept the successful creation of a pod sandbox in response to CRI RunPodSandbox. Next, using an API server client, the binary can mark extended conditions on a pod to indicate state of sandbox creation. While this approach works, without requiring any additional changes to Kubelet, it had a couple of disadvantages: First, this approach requires configuration and management of a separate proxy binary between Kubelet and CRI runtime in the cluster nodes. Second, the proxy binary will need to replicate the logic in Kubelet status manager to efficiently interact with the API server (as well as cache the status and retry in case of API server outages) regarding updates to pod sandbox status. Therefore isolating the logic around pod sandbox conditions to a separate binary intercepting API calls between kubelet and the CRI runtime is not preferred.

Have Kubelet mark sandbox ready condition on a pod using extended conditions

Instead of a “native” condition as proposed in this KEP, an “extended” condition maybe used by Kubelet to mark the PodReadyToStartContainers condition. Such a condition may look like: kubernetes.io/pod-ready-to-start-containers. However, internal/core Kubernetes components (like Kubelet) do not use “extended” conditions today. So this approach may be unusual.

Infrastructure Needed (Optional)