KEP-3085: Pod networking ready condition
KEP-3085: Pod Conditions for Starting and Completion of Sandbox Creation
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- User Stories (Optional)
- User Stories For Consuming PodReadyToStartContainers Condition
- PodReadyToStartContainers Condition Fields In Different User Scenarios
- Scenario 1: Stateless pod scheduled on a healthy node and cluster
- Scenario 2: Pods with startup delays due to problems with CSI, CNI or Runtime Handler plugins
- Story 3: Pod unable to start due to problems with CSI, CNI or Runtime Handler plugins
- Story 4: Pod Sandbox restart after a successful initial startup and crash
- Story 5: Graceful pod sandbox termination
- Story 6: Volume mounting issues
- Notes/Constraints/Caveats (Optional)
- Risks and Mitigations
- User Stories (Optional)
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Dedicated fields or annotations for the pod sandbox creation timestamps
- Surface pod sandbox creation latency instead of timestamps
- Report sandbox creation latency as an aggregated metric
- Report sandbox creation stages using Kubelet tracing
- Have CSI/CNI/CRI plugins mark their start and completion timestamps while setting up their respective portions for a pod
- Use a dedicated service between Kubelet and CRI runtime to mark sandbox ready condition on a pod
- Have Kubelet mark sandbox ready condition on a pod using extended conditions
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests for meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
Readiness to start the containers in a pod, marked by successful pod sandbox
creation, is a critical phase in a pod’s lifecycle that the kubelet orchestrates
across multiple components: in-tree volume plugins (ConfigMap, Secret, EmptyDir,
etc), CSI plugins and container runtime (which in turn invokes a runtime handler
and CNI plugins). Completion of all these phases puts the pod sandbox in a
state where the containers in a pod can be started. This KEP proposes a
PodReadyToStartContainers condition in pod status to indicate a pod has
reached a state where it’s containers are ready to be started. The
PodReadyToStartContainers condition will mark an important milestone in the
pod’s lifecycle similar to ContainersReady and the overall Ready conditions
in pod status today. An alternate name like SandboxReady is avoided since
Kubernetes does not directly surface low level sandbox related concepts to all
users.
Motivation
Today, the scheduler surfaces a specific pod condition: PodScheduled that
clearly identifies whether a pod got scheduled by the scheduler and when
scheduling completed. However, no specific conditions around initialization of
successfully scheduled pods from the perspective of completion of pod sandbox
creation is surfaced to cluster administrators in a scoped and consumable
fashion.
There is an existing pod condition: Initialized that tracks execution of init
containers. For pods without init containers, the Initialized condition is set
when the Kubelet starts to process a pod before any sandbox creation activities
start. For pods with init containers, the Initialized condition is set when
init containers have been pulled and executed to completion. Therefore, the
existing Initialized condition is insufficient and inaccurate for tracking
completion of sandbox creation and readiness to start containers for all pods in
a cluster. This distinction becomes especially relevant in multi-tenant clusters
where individual tenants own the pod specs (including the set of init
containers) while the cluster administrators are in charge of storage plugins,
networking plugins and container runtime handlers.
The Kubelet can start to launch the containers specified in a pod immediately after pod sandbox creation is completed successfully. A new dedicated condition marking the successful creation of pod sandbox and readiness to start containers
PodReadyToStartContainers- will benefit cluster operators (especially of multi-tenant clusters) who are responsible for configuration and operational aspects of the various components that play a role in pod sandbox creation: CSI plugins, CRI runtime and associated runtime handlers, CNI plugins, etc. The duration betweenlastTransitionTimefield of thePodReadyToStartContainerscondition (withstatusset totruefor a pod for the first time) and the existingPodScheduledcondition will allow metrics collection services to compute total latency of all the components involved in pod sandbox creation as an SLI. Cluster operators can use this to publish SLOs around pod initialization to their customers who launch workloads on the cluster.
Custom pod controllers/operators can use a dedicated condition indicating
completion of pod sandbox creation and readiness to start containers to make
better decisions around how to reconcile a pod failing to become ready. As a
specific example, a custom controller for managing pods that refer to PVCs
associated with node local storage (e.g. Rook-Ceph) may decide to recreate PVCs
(based on a specified PVC template in the custom resource the controller is
managing) if the sandbox creation is repeatedly failing to complete, indicated
by the new PodReadyToStartContainers condition reporting false. Such a
controller can leave PVCs intact and only recreate pods if sandbox creation
completes successfully (indicated by the new PodReadyToStartContainers
condition reporting true) but the pod’s containers fail to become ready.
Further details of this is covered in a
user-story
below.
When a pod’s sandbox no longer exists, the status of
PodReadyToStartContainers condition will be set to false. The duration
between a pod’s DeletionTimeStamp and subsequent lastTransitionTime of
PodReadyToStartContainers condition (with status set to false) will
indicate the latency of pod termination. This can also be surfaced by metrics
collection services as a SLI. Note that surfacing any dedicated conditions
around termination of pod sandbox is unnecessary and beyond the scope of this
KEP.
Individual container creation (including pulling images from a registry) takes place after the successful completion of pod sandbox creation. Updates to pod container status to report latencies associated with creation of individual containers within a pod is beyond the scope of this KEP.
Goals
- Surface a new pod condition
PodReadyToStartContainersto indicate readiness to start containers immediately following the successful completion of pod sandbox creation by Kubelet. - Describe how the new pod condition can be consumed by external services to determine state and duration of pod sandbox creation.
Non-Goals
- Modify the meaning of the existing
Initializedcondition - Specify metrics collection based on the conditions around pod sandbox creation
- Specify additional conditions (beyond
PodReadyToStartContainerswithstatusset tofalse) to indicate sandbox teardown - Surface beginning and completion of creation of individual containers
Proposal
This KEP proposes enhancements to the Kubelet to report the readiness to start
containers of a pod (immediately following successful pod sandbox creation) as a
new pod condition with type: PodReadyToStartContainers. Metric collection and
monitoring services can use the fields associated with the
PodReadyToStartContainers condition to report sandbox creation state and
latency either at a per-pod cardinality or aggregate the data based on various
properties of the pod: number of volumes, storage class of PVCs, runtime class,
custom annotations for CNI and IPAM plugins, arbitrary labels and annotations on
pods, etc. Certain pod controllers can use the pod sandbox conditions to
determine an optimal reconciliation strategy for pods and associated resources
(like PVCs).
User Stories (Optional)
User Stories For Consuming PodReadyToStartContainers Condition
Surfacing the readiness to start containers of a pod (immediately following
successful pod sandbox creation) as a new pod condition -
PodReadyToStartContainers - in pod status can be consumed in different ways:
Story 1: Consuming PodReadyToStartContainers Condition Per Pod In A Monitoring Service
A cluster operator may already depend on a service like Kube State
Metrics
for monitoring the
state of their Kubernetes clusters. The cluster operator may want such a service
to surface pod sandbox creation state and latency at a granular level for each
pod (due to the ambiguity around Initialized state as described earlier). For
this story, we are assuming the service has been enhanced to [1] consume the new
PodReadyToStartContainers pod condition as described in this KEP and [2]
implement informers and state to distinguish between the first time Kubelet is
ready to launch containers in a pod and a subsequent instance of Kubelet being
ready to launch containers in a pod (after sandbox destruction) over the
lifetime of the pod.
The operator can use PromQL queries to aggregate and analyze data (around pod sandbox creation) based on custom pod labels and annotations (already surfaced by a service like Kube State Metrics) indicating specific workload types across different namespaces. For example, annotations and labels could be used to differentiate pod sandbox creation state and latencies for “sensitive database” workloads, “sensitive analysis” workloads and “untrusted build” workloads each of which maps to pods mounting PVCs from different storage classes (depending on the level of encryption desired), using a specific runtime class (depending on the level of isolation desired - microvm vs runc based) and specific IPAM characteristics around reachability of the pods. Access to the pod labels and annotations along with the sandbox latency data at a per-pod cardinality is essential to enable the aggregation based on factors that have special/custom meaning for the operator’s cluster and tenants. The values associated with such labels and annotations may not map to distinct namespaces, existing pod fields or other API object fields in a Kubernetes cluster.
Depending on the metrics and monitoring pipeline, as the cluster scales up, cardinality of data at a per pod level (surfaced from a service like Kube State Metrics) may lead to excessive load on the monitoring backend like Prometheus. At such a point, the cluster operator may decide to create and deploy their own custom monitoring service that uses a pod informer and aggregates (based on custom pod labels and annotations) state and latency of pod sandbox creation into a histogram which is ultimately reported to Prometheus. As with the previous approach, access to the pod labels and annotations and the sandbox latency data at a per-pod cardinality is essential to enable the aggregation based on factors that have special/custom meaning for the operator’s cluster and tenants and may not map to distinct namespaces pod fields or other API object fields in the cluster.
The data from the above monitoring services can be used as SLIs with associated SLOs configured around sandbox creation state and latency (besides other metrics like scheduling latency) for each specific workload type depending on specific user requirements such as: desired encryption of persistent data (if any), runtime isolation and network reachability (governed by different IPAM plugins).
Story 2: Consuming PodReadyToStartContainers Condition In A Controller
A controller managing a set of pods along with associated resources like
networking configuration, storage or arbitrary dynamic resources (in the future)
can evaluate the PodReadyToStartContainers condition to optimize the set of
actions the controller needs to execute when bringing up pods and encountering
failures in the process. Depending on whether the pod sandbox is ready to start
containers, the controller may decide to destroy and re-create the associated
resources that are required for the sandbox creation to complete (to start
containers) or simply try to re-create the pod while keeping the resources
intact.
A specific example of the above would be a controller for stateful application
pods that mount PVCs that bind to node local PVs. Let’s assume the stateful
application has built-in data replication capabilities and the controller
supports PVC templates to dynamically generate PVCs. When trying to bring up
fresh pods (after earlier pods got terminated), there could be a problem with
the CSI plugin that mounts the node local PV into the pod. In such a situation,
the sandbox creation will not complete. Based on the PodReadyToStartContainers
condition, the controller may decided to create a fresh PVC. If sandbox creation
does complete successfully (marked by PodReadyToStartContainers reporting
true) but the pod fails to enter a Ready state, the controller will retain the
PVC (to avoid any data replication) and only try to recreate the pod. Having
access to the new PodReadyToStartContainers condition allows the controller to
optimize it’s reconciliation strategy and realize the desired state more
efficiently.
PodReadyToStartContainers Condition Fields In Different User Scenarios
In each of the scenarios below, nearly identical PodReadyToStartContainers
conditions that would result from different scenarios/problems are grouped
together. The unique scenarios are detailed after describing the values
associated with the fields of the PodReadyToStartContainers condition. To make
each scenario concrete, a specific set of timestamps in the future is chosen.
The PodScheduled condition is mentioned in the stories but conditions after
pod sandbox creation (e.g. Initialized and Ready) are skipped. A service
monitoring latency of initial pod sandbox creation is assumed to implement a pod
informer and appropriate state to distinguish between the first time a pod
sandbox becomes ready to start containers versus a subsequent instance of
readiness over the lifetime of the pod.
Scenario 1: Stateless pod scheduled on a healthy node and cluster
A user launches a simple, stateless runc based pod with no init containers in a healthy cluster. The pod gets successfully scheduled at 2022-12-06T15:33:46Z and pod sandbox is ready after three seconds at 2022-12-06T15:33:49Z.
The pod will report the following conditions in pod status at 2022-12-06T15:33:47Z (right after Kubelet worker starts processing the pod):
status:
conditions:
...
- lastProbeTime: null
lastTransitionTime: "2022-12-06T15:33:47Z"
status: "False"
type: PodReadyToStartContainers
- lastProbeTime: null
lastTransitionTime: "2022-12-06T15:33:46Z"
status: "True"
type: PodScheduled
The pod will report the following conditions in pod status at 2022-12-06T15:33:50Z (after pod sandbox creation is complete and containers are ready to start):
status:
conditions:
...
- lastProbeTime: null
lastTransitionTime: "2022-12-06T15:33:49Z"
status: "True"
type: PodReadyToStartContainers
- lastProbeTime: null
lastTransitionTime: "2022-12-06T15:33:46Z"
status: "True"
type: PodScheduled
A service monitoring latency of initial pod sandbox creation will record a
latency of three seconds in this scenario based on the delta between
lastTransitionTime timestamp associated with PodReadyToStartContainers and
PodScheduled conditions.
Scenario 2: Pods with startup delays due to problems with CSI, CNI or Runtime Handler plugins
In each of the scenarios under this section, problems or delays with infrastructural plugins like CSI/CNI/CRI result in a ten second delay for pod sandbox creation to complete after which, containers can be started. In each scenario, the pod gets successfully scheduled at 2022-12-06T15:33:46Z while pod sandbox is created and containers are ready to start after ten seconds at 2022-12-06T15:33:56Z.
For each scenario below, the pod will report the following conditions in pod status at 2022-12-06T15:33:47Z (right after Kubelet worker starts processing the pod and the pod sandbox creation has started but not complete):
status:
conditions:
...
- lastProbeTime: null
lastTransitionTime: "2022-12-06T15:33:47Z"
status: "False"
type: PodReadyToStartContainers
- lastProbeTime: null
lastTransitionTime: "2022-12-06T15:33:46Z"
status: "True"
type: PodScheduled
For each scenario, the pod will report the following conditions in pod status at 2022-12-06T15:34:00Z (after pod sandbox is ready - and containers are ready to start - after ten seconds):
status:
conditions:
...
- lastProbeTime: null
lastTransitionTime: "2022-12-06T15:33:56Z"
status: "True"
type: PodReadyToStartContainers
- lastProbeTime: null
lastTransitionTime: "2022-12-06T15:33:46Z"
status: "True"
type: PodScheduled
A service monitoring duration of pod sandbox creation (marked by readiness to
start containers) will record a latency of ten seconds in these scenarios based
on the delta between lastTransitionTime timestamps associated with
PodReadyToStartContainers and PodScheduled conditions with status set to
true. For each observation associated with a scenario below, the monitoring
service also associates a label with the metric indicating RuntimeClass of the
pods and StorageClass of PVCs referred by the pod. This enables further grouping
of the data during analysis.
A cluster-wide SLO around initial pod sandbox creation latencies configured with a threshold of less than ten seconds will record a breach in these scenarios. Further analysis of the metrics based on labels indicating RuntimeClass of the pods and StorageClass of PVCs referred by the pod will enable the cluster administrators to isolate the cause of the breaches to specific infrastructure plugins as detailed below.
Stateful pod encountering sandbox creation delays from attaching a PV backed by a CSI plugin
A Stateful pod refers to a PVC bound to a PV backed by a CSI plugin. After the pod is scheduled on a node, the CSI plugin runs into problems in the storage control plane when trying to attach the PV to the node. This results in several retries that ultimately succeeds after nine seconds.
Stateless pod encountering sandbox creation delays from allocating IP from a CNI/IPAM plugin
A pod is scheduled on a node in an experimental pre-production cluster where the operator has configured a new CNI plugin using a centralized IP allocation mechanism. Due to a spike of load in the IP allocation service, the CNI plugin times out several times but ultimately succeeds getting an IP address and configuring the pod network after nine seconds.
Stateless pod encountering sandbox creation delays from microvm based sandbox initialization
A pod configured with a special microvm based runtime class is scheduled on a node. The runtimeclass handler encounters crashes in the guest kernel multiple times but ultimately initializes the virtual machine based sandbox environment successfully after nine seconds.
Story 3: Pod unable to start due to problems with CSI, CNI or Runtime Handler plugins
In each of the scenarios under this section, problems or delays with infrastructural plugins like CSI/CNI/CRI result in pod sandbox creation never completing and the pod never being ready to start containers. In each scenario, the pod gets successfully scheduled at 2022-12-06T15:33:46Z, but pod sandbox creation runs into problems that do not eventually resolve and results in repeated failures as kubelet tries to start the pod.
For each scenario below, the pod will report the following conditions in pod status at all times after 2022-12-06T15:33:47Z (after pod sandbox creation started until the pod is deleted manually or by a controller):
status:
conditions:
...
- lastProbeTime: null
lastTransitionTime: "2022-12-06T15:33:47Z"
status: "False"
type: PodReadyToStartContainers
- lastProbeTime: null
lastTransitionTime: "2022-12-06T15:33:46Z"
status: "True"
type: PodScheduled
A service monitoring state of pod sandbox creation will record a metric indicating failure to create pod sandbox beyond a configured duration.
A cluster-wide SLO around success rate of pod sandbox creation may record a breach due to the pod sandbox creation failures. Further analysis of the metrics aggregated based on labels (associated with the metrics) indicating RuntimeClass of the pods and StorageClass of PVCs referred by the pod will enable the cluster administrators to associate the failures to specific infrastructure plugins as detailed below.
Stateful pod encountering sandbox creation failures when attaching a PV backed by a CSI plugin
A Stateful pod refers to a PVC bound to a PV backed by a CSI plugin. After the pod is scheduled on a node, the CSI plugin runs into problems in the storage control plane when trying to attach the PV to the node. The failure to attach never resolves thus blocking pod sandbox creation.
Stateless pod encountering sandbox creation failures when allocating IP from a CNI/IPAM plugin
A pod is scheduled on a node in an experimental pre-production cluster where the operator has configured a new CNI plugin using a centralized IP allocation mechanism. Due to problems in the IP allocation service, the CNI plugin fails to get an IP address and is unable to configure the pod network. This blocks pod sandbox creation.
Stateless pod encountering sandbox creation failures from microvm based sandbox initialization
A pod configured with a special microvm based runtime class is scheduled on a node. The runtimeclass handler encounters crashes in the guest kernel repeatedly and is unable to initialize the virtual machine based sandbox environment.
Story 4: Pod Sandbox restart after a successful initial startup and crash
In each of the scenarios under this section, a pod sandbox is successfully created but eventually gets destroyed due to problems in the host or the sandbox environment. As a result, the pod sandbox has to be re-created (and pod networking reconfigured) by Kubelet in coordination with CRI runtime. In each scenario, the pod is successfully scheduled at 2022-12-06T15:33:46Z and pod sandbox is ready after 5 seconds. The sandbox is destroyed after two hours. Re-creation of the sandbox runs into problems but eventually succeed after nine seconds.
The pod will report the following conditions in pod status at 2022-12-06T15:34:00Z (few seconds after initial pod sandbox is ready):
status:
conditions:
...
- lastProbeTime: null
lastTransitionTime: "2022-12-06T15:33:52Z"
status: "True"
type: PodReadyToStartContainers
- lastProbeTime: null
lastTransitionTime: "2022-12-06T15:33:46Z"
status: "True"
type: PodScheduled
The pod will report the following conditions in pod status at 2022-12-06T17:33:46Z (right after pod sandbox is destroyed):
status:
conditions:
...
- lastProbeTime: null
lastTransitionTime: "2022-12-06T17:33:46Z"
status: "False"
type: PodReadyToStartContainers
- lastProbeTime: null
lastTransitionTime: "2022-12-06T15:33:46Z"
status: "True"
type: PodScheduled
The pod will report the following conditions in pod status at 2022-12-06T17:34:00Z (few seconds after the new pod sandbox is ready):
status:
conditions:
...
- lastProbeTime: null
lastTransitionTime: "2022-12-06T17:33:52Z"
status: "True"
type: PodReadyToStartContainers
- lastProbeTime: null
lastTransitionTime: "2022-12-06T15:33:46Z"
status: "True"
type: PodScheduled
A service monitoring restarts associated with successfully created pod sandboxes will record a restart in these scenarios. A service measuring initial pod sandbox creation latency will need to implement logic (for example, using pod informers and state) to differentiate the initial pod sandbox creation from the latter pod sandbox creations resulting from node crashes/reboots or sandbox crashes.
Node crash
A regular runc based pod is scheduled on a node whose kernel crashes after two hours of the pod sandbox getting created successfully. The node restarts quickly (resulting in no pod evictions) and kubelet has to re-create the pod sandbox.
Sandbox crash
A pod is configured with a microvm based runtime handler. The virtual machine sandbox is created successfully but suffers a crash due to problems with the guest kernel after two hours of the pod creation. As a result, kubelet has to re-create the pod sandbox (and reconfigure pod networking).
Story 5: Graceful pod sandbox termination
A user launches a pod that runs successfully but eventually deleted by a controller after several hours. The pod was scheduled at 2022-12-06T12:33:46Z and the sandbox became ready at 2022-12-06T12:33:48Z. The delete request is invoked at 2022-12-06T15:33:47Z and the pod is terminated by Kubelet at 2022-12-06T15:33:49Z
The pod will report the following conditions in pod status at 2022-12-06T15:33:46Z (right before the pod delete request is invoked):
status:
conditions:
...
- lastProbeTime: null
lastTransitionTime: "2022-12-06T12:33:48Z"
status: "True"
type: PodReadyToStartContainers
- lastProbeTime: null
lastTransitionTime: "2022-12-06T12:33:46Z"
status: "True"
type: PodScheduled
The pod will report the following conditions in pod status at 2022-12-06T15:33:49Z (right after the pod termination has been processed by Kubelet but the pod is yet to be completely deleted from API server):
status:
conditions:
...
- lastProbeTime: null
lastTransitionTime: "2022-12-06T15:33:49Z"
status: "False"
type: PodReadyToStartContainers
- lastProbeTime: null
lastTransitionTime: "2022-12-06T12:33:46Z"
status: "True"
type: PodScheduled
Story 6: Volume mounting issues
These are some areas that are useful to deduce information around volume failures.PodReadyToStartContainers gives a useful condition for these cases.
Mounting Volumes from missing Secrets
A pod is configured to mount a volume from a secret. The corresponding secret is missing so the volume mount fails.
Mounting Volumes from missing ConfigMaps
A pod is configured to mount a volume from a ConfigMap. The corresponding Configmap is missing so the volume mount fails.
In both of these cases, it is possible to deduce a failure but you would have view the events. Without this condition, the latest status is ContainerCreating and no conditions help to distinguish this issue.
Events are best-effort so they are not guaranteed to happen. In many cases, it is possible that these events are missed and users are left confused
on why there containers are stuck. The PodReadyToStartContainers adds a condition to reflect a failure in these cases. Both of these cases fail in the volume initialization phase so PodReadyToStartContainers would be set to False giving users notification that their pods had an issue.
Notes/Constraints/Caveats (Optional)
A monitoring service measuring duration of initial sandbox creation of a pod
(based on readiness to launch containers in the pod) should differentiate
between the initial and subsequent sandbox creations (if any due to node
crash/sandbox crash) and track them separately. This can be achieved using a
pod informer whose event handler stores (in a persistent store or as custom
annotations on the pod) the lastTransitionTime field for
PodReadyToStartContainers condition observed when it had status = true for
the first time. Later, if the pod sandbox is recreated, the lastTransitionTime
for the condition to indicate readiness to start containers can be
differentiated from the initial readiness to start containers based on whether
the initial data exists (either in the persistent store or pod annotations).
Measuring duration of sandbox creation accurately beyond the initial sandbox
creation (based on readiness to launch containers in the pod) is not
possible with the PodReadyToStartContainers condition alone. This is
similar to other ready conditions like ContainersReady and overall pod Ready
which gets updated after containers are restarted without a specific marker of
when the process of restarting the containers or brining the pod back into a
ready state began following an event like a node crash.
When deriving SLOs based on SLIs around state and duration of sandbox creation, user error scenarios should be filtered. In the context of pod sandbox creation, such errors can surface due to:
- References to a secret or configmap that does not exist and never gets created. As a result, a pod referencing a missing secret or configmap will never go past the volume initialization phase.
- References to a secret or configmap that get created at a point of time after
the pod gets scheduled. In such scenarios, volume initialization phase of the
pod will be stuck until the referenced secrets/configmaps are created in the
cluster.
The metric collection service that generates SLIs can filter pods affected by
the above situations by evaluating
FailedMountpod events associated with the pod and matching a regular expression of the formMountVolume.SetUp failed for volume "(secret|config-map) .*" : (secret|config-map) ".*" not found".
Risks and Mitigations
The main risk associated with PodReadyToStartContainers is any potential
confusion with the existing Initialized condition. Both the existing
Initialized condition and the new proposed condition refer to distinct
stages in a pod’s overall initialization. Documentation will help mitigate this
risk.
Design Details
The Kubelet will set a new condition on a pod: PodReadyToStartContainers to
surface that Kubelet is ready to start containers in a pod sandbox immediately
following successful completion of sandbox creation for the pod. A new
PodConditionType corresponding to PodReadyToStartContainers will be added in
api/core/v1/types.go. No changes are required in the Pod Status API for this
enhancement.
Determining status of sandbox creation for a pod
Today, syncPod() in Kubelet is invoked with the kubecontainer.PodStatus
(distinct from the v1.PodStatus API) associated with a given pod.
podSandboxChanged() in kubeGenericRuntimeManager evaluates the
SandboxStatuses field in PodStatus to determine whether a new pod sandbox
will need to be created for a pod. The same logic will be used to determine
whether a sandbox is ready for a pod in the Kubelet status manager.
PodReadyToStartContainers condition details
Kubelet will initially generate the PodReadyToStartContainers condition as
part of existing calls to generateAPIPodStatus() early during syncPod(). The
status field will be set to true if a sandbox is ready (determined by
invoking podSandboxChanged() as described
above
). The status field
will be set to false if a sandbox is found to be not ready.
Kubelet will generate the PodReadyToStartContainers condition for the final
time (in the life of a pod) as part of existing calls to
generateAPIPodStatus() early during syncTerminatedPod(). Prior invocations
of killPod() (as part of syncTerminatingPod) will result in the absence of a
sandbox corresponding to the pod. As a result, the status field of the
PodReadyToStartContainers condition will be set to false (determined by
invoking podSandboxChanged() as described
above
).
During periods of API server or etcd unavailability combined with a Kubelet
restart/crash (covered in more details
below
),
the lastTransitionTime field of PodReadyToStartContainers condition that
ultimately gets persisted upon Kubelet restarting and API server becoming
available again is as close as possible to an actual change in the condition
(that could not be persisted).
Changes of the status field will result in lastTransitionTime field getting
updated (by the Kubelet Status Manager).
Enhancements in Kubelet Status Manager
Today, the Kubelet Status Manager surfaces APIs for other Kubelet components to issue pod status updates. It caches the pod status and issues patches to the API server when necessary. This infrastructure will be used for managing the new pod conditions as well.
The Kubelet Status Manager will surface a new
GeneratePodReadyToStartContainers API. This will be invoked by Kubelet’s
generateAPIPodStatus() to populate the pod status that is passed to
setPodStatus. This is similar to the existing pod conditions generator
functions: GeneratePodReadyCondition and GeneratePodInitializedCondition. If
updates through generateAPIPodStatus() is found to be inaccurate (for example
if Kubelet is very busy), invocation of GeneratePodReadyToStartContainers
could also be added right after createSandbox in kubeGenericRuntimeManager
returns successfully.
updateStatusInternal() in the Kubelet Status Manager will be enhanced to mark
updateLastTransitionTime for the new PodReadyToStartContainers condition
when changes in the status of the conditions are detected.
Unavailability of API Server or etcd along with Kubelet Restart
If pod sandbox creation completed successfully on a node but API server became
unavailable, the Kubelet status manager will retry issuing the patches to the
API server. However, the Kubelet may get restarted (or crash) while the API
server is unavailable with the pod status updates not yet persisted. In such a
situation (expected to be quite rare), the timestamp associated with the
lastTransitionTime field in the new conditions will not be accurate due to
inability to persist or cache them. The lastTransitionTime field will get
updated on subsequent generateAPIPodStatus() calls based on the state of the
CRI sandbox and the corresponding timestamps will be persisted. This aligns with
handling of other Kubelet managed conditions (ContainersReady, (Pod) Ready) when
API server is unavailable and Kubelet restarts resulting in the status manager
cache getting dropped.
Test Plan
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
A review of the existing E2E tests reveal that coverage of the basic, existing
pod conditions (populated by Kubelet) is sparse. While the existing pod
conditions are quite mature, we will consider adding explicit validation of some
of the subtle aspects of current behavior around the pod conditions (e.g.
ensuring Initialized condition of a pod without init containers is set very
early for a pod that will never reach the sandbox creation state due to
missing volume dependencies and will thus never actually initialize).
Unit tests
New unit tests will be mainly scoped to the Kubelet status package that a bulk
of the enhancements above will target.
k8s.io/kubernetes/pkg/kubelet/status:June 13, 2022-82.2k8s.io/kubernetes/pkg/kubelet/kubelet:June 13, 2022-64.5k8s.io/kubernetes/pkg/kubelet/kubelet_pods.go:June 13, 2022-71.3
Note above that while the Kubelet package overall has low coverage, the
changes in the context of this KEP is scoped to the generateAPIPodStatus
method which is a tiny portion of the overall Kubelet package in the
kubelet_pods.go file.
Integration tests
N/A. See notes about e2e tests below.
e2e tests
Tests List
- Pod Conditions Test
- GracefulNodeShutdown test
- Add test to check status of pod ready to start condition are set to false after terminating (added as part of k/k PR#121044 )
- [] Volume Mounting Issues
- Add test to verify sandbox condition for missing configmap. (added as part of k/k PR#121321 )
- Add test to verify sandbox condition for missing secret.
- [] Dynamic Resource Allocation (DRA) Allocation Ordering
- Add test to verify the order between the
PodReadyToStartContainerscondition and devicemanagerAllocate()gRPC calls to the device plugin, ensuring the condition is set at the expected point relative to resource allocation.
- Add test to verify the order between the
E2E tests will be introduced to cover the user scenarios mentioned above. Tests
will involve launching pods with characteristics mentioned below and
examining the pod status has the new PodReadyToStartContainers condition with
status and reason fields populated with expected values:
- A basic pod that launches successfully without any problems.
- A pod with references to a configmap (as a volume) that has not been created causing the pod sandbox creation to not complete until the configmap is created later.
- A pod with references to a secret (as a volume) that has not been created causing the pod sandbox creation to not complete until the secret is created later.
- A pod whose node is rebooted leading to the sandbox being recreated.
- A pod that requests device resources managed by the DRA framework, to verify the order between the
PodReadyToStartContainerscondition and DeviceManagerAllocate()gRPC calls.
Tests for pod conditions in the GracefulNodeShutdown e2e_node test will be
enhanced to check the status of the new pod sandbox conditions are false after
graceful termination of a pod.
Testing updates of Pod conditions in the Conformance Test Pods, completes the lifecycle of a Pod and the PodStatus will be enhanced to cover resetting the
new pod sandbox conditions.
Graduation Criteria
Alpha
- Kubelet will report pod sandbox conditions if the feature flag
PodReadyToStartContainersConditionis enabled. - E2E tests added for pod conditions.
- E2E test for sandbox condition if pod fails to mount volume.
Beta
- Condition is moved from a package constant in Kubelet to a API Defined Condition
- Gather feedback from cluster operators and developers of services or controllers that consume these conditions.
- Implement suggestions from feedback as feasible.
- Feature Flag defaults to enabled.
- Add test case for graceful shutdown.
- Add test case for sandbox condition if pod fails to mount volume from a missing secret.
- Clarify and define the order between the
PodReadyToStartContainerscondition and the DRAAllocate()(devicemanager’s Allocate gRPC calls to the device plugin).
Add documententation and a test to verify the behaviour.
GA
- All tests are passing with no known flakiness.
- All feedback addressed around the new pod sandbox conditions.
- No open decision items around the new pod sandbox conditions.
- Feature Flag removed.
Upgrade / Downgrade Strategy
The new condition will be managed by the Kubelet. When upgrading a node to a version of the Kubelet that can set the new condition, new pods launched on that node will surface the new condition. If Kubelet on the node is later downgraded, there may remain evicted pods that are not deleted. Foe such pods, a node with a version of the Kubelet that does not support the new condition will continue to report pods associated with it with the new conditions.
Version Skew Strategy
The new condition will be managed by the Kubelet. Since the control plane components are not involved, handling of version skew is not applicable.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name: PodReadyToStartContainersCondition
- Components depending on the feature gate: Kubelet
Does enabling the feature change any default behavior?
Yes, there will be a new condition for all pods.
In normal cases, the PodReadyToStartContainers condition will be set to true.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
Yes, the feature can be disabled once it has been enabled. However the new pod sandbox condition will get persisted in pods and would continue to be reported after the feature is disabled until those pods are deleted.
What happens if we reenable the feature if it was previously rolled back?
New pods created since re-enablement will report the new pod sandbox condition.
Are there any tests for feature enablement/disablement?
Unit tests (as outlined in the Unit tests section above) will be used to confirm that the new pod condition introduced is being:
- evaluated and applied by the Kubelet Status manager when the feature is enabled.
- not evaluated nor applied when the feature is disabled.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
This flag is only relevant for the Kubelet. Therefore, the new condition will be reported for pods scheduled on nodes that have the feature enabled.
A controller or service that consumes the new pod condition should be enabled only after rollout of the new condition has succeeded on all nodes. Similarly, the controller or service that consumes the new pod condition should be disabled before the rollback. This helps prevent a controller/service consuming the condition getting the data from pods running in a subset of the nodes in the middle of a rollout or rollback.
If a controller or service consumes these pod conditions but the cluster has turned off this feature, the controller or service will never match on this pod condition as new pods will never set this condition.
If the feature is rolled back and the conditions are set, the controller or service will never see a new condition update. Conditions can assume to be locked in place as no future patches will done to this condition.
What specific metrics should inform a rollback?
A sharp increase in the number of PATCH requests to API Server from Kubelets after enabling this feature is a sign of potential problem and can inform a rollback. A cluster operator may monitor
apiserver_request_total{verb="PATCH", resource="pods", subresource="status"}
for this.
This may be the case in clusters that use a special runtime environment like microVM/Kata, where the sandbox may crash repeatedly (without ever getting a chance to start containers) resulting in lots of potential updates due to the new condition “flapping”. However, in such environments, this may already be the case with existing pod conditions like ContainersReady and Ready (unless the sandbox environment/VM crashes very early before a single container is run). Batching of pod status updates from the Kubelet status manager will also help.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
In the node-e2e tests , we test upgrade and rollback via toggling the feature gates on/off.
The Upgrade->downgrade->upgrade testing was done manually using the alpha
version in 1.28 with the following steps:
- Start the cluster with the
PodReadyToStartContainersConditionenabled:
kind create cluster --name per-index --image kindest/node:v1.28.0 --config config.yaml
using config.yaml:
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
"PodReadyToStartContainersCondition": true
nodes:
- role: control-plane
- role: worker
Create a pod that has a failed pod sandbox. Easiest way to do this is to create a pod that fails to mount a volume. This case fails because the ConfigMap (clusters-config-file) does not exist.
using pod.yaml:
apiVersion: v1
kind: Pod
metadata:
name: config-map-mount
spec:
containers:
- name: test-container
image: registry.k8s.io/busybox
command: [ "/bin/sh", "-c", "env" ]
volumeMounts:
- mountPath: /clusters-config
name: clusters-config-volume
volumes:
- configMap:
name: clusters-config-file
name: clusters-config-volume
kubectl create -f pod.yaml
This pod will have a PodReadyToStartContainers condition that says False.
conditions:
- lastProbeTime: null
lastTransitionTime: "2023-10-03T22:14:22Z"
status: "False"
type: PodReadyToStartContainers
- To test the downgrade, we create a new kind cluster with the feature turned off.
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
"PodReadyToStartContainersCondition": false
nodes:
- role: control-plane
- role: worker
When you inspect the conditions, the PodReadyToStartContainers condition will be not existent.
- To test the enable, we create a new kind cluster with the feature turned on.
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
"PodReadyToStartContainersCondition": true
nodes:
- role: control-plane
- role: worker
This pod will have a PodReadyToStartContainers condition that says False.
conditions:
- lastProbeTime: null
lastTransitionTime: "2023-10-03T22:14:22Z"
status: "False"
type: PodReadyToStartContainers
This demonstrates that the feature is working again for the job.
If you take a pod that is able to create a sandbox, then you should see True in cases where the condition exists.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
This question isn’t totally relevant for this feature, since this is an administrator-enabled feature controlled via kubelet flag, not something the user controls with an API server resource spec.
Checking the Pod conditions on nodes with this feature enabled is the simplest way to check if the feature is enabled properly on a vanilla k8s cluster.
How can someone using this feature know that it is working for their instance?
- Events
- Event Reason:
- API .status
- Condition name: PodReadyToStartContainers reported for pod
- Other field:
- Other (treat as last resort)
- Details:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
There are no SLOs for this feature. We don’t expect any changes to the existing SLOs.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Other (treat as last resort)
- Details: There are no specific SLIs for the Kubelet Status Manager
Are there any missing metrics that would be useful to have to improve observability of this feature?
New metrics may be added to the Kubelet status manager to surface fine grained information about updates to overall pod status as well as specific pod conditions. However, such a change affects the whole Kubelet Status Manager (rather than specific pod conditions) and thus beyond the scope of this KEP.
A general Kubernetes metrics collector like Kube State Metrics (that already consume pod condifitions and surface those as metrics) will need to be enhanced to consume the new pod condition in this KEP.
Dependencies
Does this feature depend on any specific services running in the cluster?
No, this feature does not have any dependencies. Other metric oriented services in the cluster may depend on this.
Scalability
Will enabling / using this feature result in any new API calls?
Yes, the new pod condition will result in the Kubelet Status Manager making additional PATCH calls on the pod status fields.
The Kubelet Status Manager already has infrastructure to cache pod status updates (including pod conditions) and issue the PATCH in a batch.
Will enabling / using this feature result in introducing new API types?
No
Will enabling / using this feature result in any new calls to the cloud provider?
No
Will enabling / using this feature result in increasing size or count of the existing API objects?
Slight increase (a few bytes) of the Pod API object due to persistence of the additional condition in the pod status.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
No
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No.
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
If etcd/API server is unavailable, pod status cannot be updated. So the
PodReadyToStartContainers condition associated with pod status cannot be
updated either. The pod status manager already retries the API server requests
later (based on data cached in the Kubelet) and that should help.
If the Kubelet is ready to start containers in a pod (right after pod sandbox
creation completes) on a node but API server becomes unavailable (before the
condition to indicate readiness to start containers can be patched) and Kubelet
crashes or restarts (shortly after API server becoming and staying unavailable),
the lastTransitionTime field may be inaccurate. This is described in the
section
above
.
What are other known failure modes?
None so far
What steps should be taken if SLOs are not being met to determine the problem?
SLOs are not applicable to pod status fields. Overall Kubernetes node level SLOs may leverage this feature.
Implementation History
- Alpha in 1.25.
- PodHasNetwork renamed to PodReadyToStartContainers in 1.28.
- Beta promotion to 1.29
- Moving PodReadyToStartContainers to staging/src/k8s.io/api/core/v1/types.go as a API constant
- Added e2e tests:
- Updated pod-lifecyle documentation to reflect beta promotion, and feature-gate enabled by default.
- Published blog: PodReadyToStartContainers Condition Moves to Beta
Drawbacks
The main drawback associated with the new pod sandbox conditions involves a
slight potential increase in calls to the API Server from Kubelet to patch
status = true for the new PodReadyToStartContainers condition in a pod’s
status. Typically, this would involve an extra patch call for pod status in the
lifetime of most pods (if the status manager does not batch them with other pod
status updates): one when pod sandbox creation completes and another when the
pod is terminated. However, there could be a higher number of patch calls to API
Server if the pod sandbox environment (like a microvm) starts successfully and
then crashes in a re-start loop.
Caching of updates to pod status by the pod status manager and batching pod status updates (which is already in place) can help mitigate frequent patch calls to API server.
Alternatives
Dedicated fields or annotations for the pod sandbox creation timestamps
Timestamps around completion of pod sandbox creation may be surfaced as a
dedicated field in the pod status rather than a pod condition. However, since
the successful creation of pod sandbox is essentially a “milestones” in the life
of a pod (similar to Scheduled, Ready, etc), pod conditions is the ideal place
to surface these and aligns well with the existing conditions like
ContainersReady and overall Ready.
A dedicated annotation on the pod for surfacing this data is another potential approach. However, usage of annotations for Kubelet managed data is typically discouraged.
Surface pod sandbox creation latency instead of timestamps
Surfacing the amount of time it took to successfully create a pod sandbox is an alternative to surfacing the condition around completion of pod sandbox creation (whose delta from pod scheduled condition reflects the latency). The latency data would surface the same information from a pod initialization SLI perspective as mentioned in the Motivations section. Implementing this approach would require an API change on the pod status to surface the latency data (as this no longer fits the structure of a pod condition). This data cannot be consumed by other controllers as mentioned in User Stories section.
Report sandbox creation latency as an aggregated metric
The duration it took pod sandbox to become ready can be directly reported as a prometheus metrics aggregated in a histogram. However, aggregating the data at the Kubelet level prevents a metric collection service from classifying the data based on interesting fields on a pod (runtime class, storage class of PVCs, number of PVCs, etc) or using custom labels and annotations on pods that indicate workload characteristics (that the cluster operator may wish to use as a basis for aggregating the metrics).
This also prevents other controllers from acting on sandbox status as mentioned in User Stories section.
Report sandbox creation stages using Kubelet tracing
The Kubelet is being instrumented to emit traces based on OpenTelemetry around sandbox creation stages (as well several other parts of the pod lifecycle).
To implement the pod sandbox creation latency SLI/SLO use cases, the tracing infrastructure needs to be able to:
- Collect all traces around CRI sandbox creation for all pods with no sampling.
- Look-up pod fields from API server (associated with a pod’s trace) like labels/annotations/storage classes of PVCs referred by the pod/runtimeclass/etc. that is of interest to cluster operators and their users for classifying and aggregating the metrics.
- Look-up a pod’s Scheduled condition fields to determine the beginning of pod sandbox creation.
Since the lookup of the pod fields and existing conditions is necessary for SLIs
around pod sandbox creation latency, surfacing the PodReadyToStartContainers
condition in pod status will allow a metric collection service to directly
access the relevant data without requiring the ability to collect and parse
OpenTelemetry traces. As mentioned in the User Stories, popular community
managed services like Kube State Metrics can consume the
PodReadyToStartContainers condition with a trivial set of changes. Enhancing
them to collect and parse OpenTelemetry traces with no sampling and mapping the
data to associated data from API server data will be complex from an engineering
and operational perspective.
For controllers using the pod sandbox conditions to determine reconciliation strategy, access to the pod is typically necessary while collecting and parsing traces would be unusual.
Have CSI/CNI/CRI plugins mark their start and completion timestamps while setting up their respective portions for a pod
Each infrastructural plugin that Kubelet calls out to (in the process of setting up a pod sandbox) can mark start and completion timestamps on the pod as conditions. This approach would be similar to how readiness gates work today. However, CSI and CRI plugins will need to be enlightened about fields in a pod (like status conditions) and setup a client to the API server (to update the conditions) which they may not implement to stay orchestrator agnostic.
Use a dedicated service between Kubelet and CRI runtime to mark sandbox ready condition on a pod
An on-host binary that runs as a service and proxies CRI API calls between the
CRI runtime and Kubelet can intercept the successful creation of a pod sandbox
in response to CRI RunPodSandbox. Next, using an API server client, the binary
can mark extended conditions on a pod to indicate state of sandbox creation.
While this approach works, without requiring any additional changes to Kubelet,
it had a couple of disadvantages: First, this approach requires configuration
and management of a separate proxy binary between Kubelet and CRI runtime in the
cluster nodes. Second, the proxy binary will need to replicate the logic in
Kubelet status manager to efficiently interact with the API server (as well as
cache the status and retry in case of API server outages) regarding updates to
pod sandbox status. Therefore isolating the logic around pod sandbox conditions
to a separate binary intercepting API calls between kubelet and the CRI runtime
is not preferred.
Have Kubelet mark sandbox ready condition on a pod using extended conditions
Instead of a “native” condition as proposed in this KEP, an “extended” condition
maybe used by Kubelet to mark the PodReadyToStartContainers condition. Such a
condition may look like: kubernetes.io/pod-ready-to-start-containers. However,
internal/core Kubernetes components (like Kubelet) do not use “extended”
conditions today. So this approach may be unusual.