KEP-3756: Robust VolumeManager reconstruction after kubelet restart

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Introduction
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- [ ] e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

After kubelet is restarted, it looses track of all volume it mounted for running Pods. It tries to restore this state from the API server, where kubelet can find Pods that should be running, and from the host’s OS, where it can find actually mounted volumes. We know this process is imperfect. This KEP tries to rework the process. While the work is technically a bugfix, it changes large parts of kubelet, and we’d like to have it behind a feature gate to provide users a way to get to the old implementations in case of problems.

This work started as part of KEP 1790 and even went alpha in v1.26, but we’d like to have a separate feature + feature gate to be able to graduate VolumeManager reconstruction faster.

Motivation

Goals

During kubelet startup, allow it to populate additional information about how are existing volumes mounted. KEP 1710 needs to know what mount options did the previous kubelet used when mounting the volumes, to be able to tell if they need any change or not.
Fix #105536 : Volumes are not cleaned up (unmounted) after kubelet restart, which needs a similar VolumeManager refactoring.
In general, make volume cleanup more robust.

Non-Goals

Introduction

VolumeManager is a piece of kubelet that mounts volumes that should be mounted (i.e. a Pod that needs the volume exists) and unmounts volumes that are not needed any longer (all Pods that used them were deleted).

VolumeManager keeps two caches:

DesiredStateOfWorld (DSW) contains volumes that should be mounted.
ActualStateOfWorld (ASW) contains currently mounted volumes. A volume in ASW can be marked as:
- Globally mounted - it is mounted in /var/lib/kubelet/volumes/<plugin>/...
  - This mount is optional and depends on volume plugin / CSI driver capabilities. If it’s supported, each volume has only a single global mount.
- Mounted into Pod local directory - it is mounted in /var/lib/kubelet/pods/<pod UID>/volumes/.... Each pod that uses a volume gets its own local mount, because each pod has a different <pod UID>. If the volume plugin / CSI driver supports the global mount mentioned above, each pod local mount is typically a bind-mount from the global mount.
In addition, both global and local mounts can be marked as uncertain, when kubelet is not 100% sure if the volume is fully mounted there. Typically, this happens when a CSI driver times out NodeStage / NodePublish calls and kubelet can’t be sure if the CSI driver has finished mounting the volume after the timeout. Kubelet then needs to call NodeStage / NodePublish again if the volume is still needed by some Pods, or call NodeUnstage / NodeUnpublish if all Pods that needed the volume were deleted.

VolumeManager runs two separate goroutines:

*reconciler that periodically compares ASW and DSW and tries to move ASW towards DSW.
DesiredStateOfWorldPopulator (DSWP) that periodically lists Pods from PodManager and adds them to DSW . This DSWP is marked as hasAddedPods=true (“fully populated”) only after it has read all Pods from files (static pods) and the API server (i.e. sourcesReady.AllReady returns true here ).

Both ASW and DSW caches exist only in memory and are lost when kubelet process dies. It’s relatively easy to populate DSW - just list all Pods from the API server and static pods and collect their volumes. Populating ASW is complicated and actually source of several problems that we want to change in this KEP.

Volume reconstruction is a process where kubelet tries to create a single valid PersistentVolumeSpec or VolumeSpec for a volume from the OS. Typically from mount table by looking at what’s mounted at /var/lib/kubelet/pods/*/volumes/XYZ. This process is imperfect, it populates only (Persistent)VolumeSpec fields that are necessary to unmount the volume (i.e. to call volumePlugin.TearDown + UnmountDevice calls).

Today, kubelet populates VolumeManager’s DSW first, from static Pods and pods received from the API server. ASW is populated from the OS after DSW is fully populated (hasAddedPods==true) and only volumes missing in DSW are added there. In other words, kubelet reconstructs only the volumes for Pods that were running, but were deleted from API server before kubelet started. (If the pod is still in the API server, Running, its volumes would be in DSW).

We assumed that this was enough, because if a volume is in DSW, the VolumeManager will try to mount the volume, and it will eventually reach ASW.

We needed to add a complex workaround to actually unmount a volume if it’s initially in DSW, but user deletes all Pods that need it before the volume reaches ASW.

Proposal

We propose to reverse the kubelet startup process.

Quickly reconstruct ASW from the OS and add all found volumes to ASW when kubelet starts as uncertain. “Quickly” means the process should look only at the OS and files/directories in /var/lib/kubelet/pods and it should not require the API server or any network calls. Esp. the API server may not be available at this stage of kubelet startup.
In parallel to 1., start DSWP and populate DSW from the API server and static pods.
When connection to the API server becomes available, complete reconstructed information in ASW with data from the API server (e.g. from node.status). This typically happens in parallel to the previous step.

Benefits:

All volumes are reconstructed from the OS. As result, ASW can contain the real information how are the volumes mounted, e.g. their mount options. This will help with KEP 1710 .
Some issues become much easier to fix, e.g.
- #105536
- We can remove workarounds for #96635 and #70044 , they will get fixed naturally by the refactoring.

We also propose to split this work out of KEP 1710 , as it can be useful outside of SELinux relabeling and could graduate separately. to split the feature, we propose feature gate NewVolumeManagerReconstruction.

User Stories (Optional)

Story 1

(This is not a new story, we want to keep this behavior)

As a cluster admin, I want kubelet to resume where it stopped when it was restarted or its machine was rebooted, so I don’t need to clean up / unmount any volumes manually.

It must be able to recognize what happened in the meantime and either unmount any volumes of Pods that were deleted in the API server or mount volumes for newly created Pods.

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

The whole VolumeManager startup was rewritten as part of KEP 1710 . It can contain bugs that are not trivial to find, because kubelet can be used in number of situations that we don’t have in CI. For example, we found out (and fixed) a case where the API server is actually a static Pod in kubelet that is starting. We don’t know what other kubelet configurations people use, so we decided to write a KEP and move the new VolumeManager startup behind a feature gate.

Design Details

This section serves as a design document of the proposed and the old VolumeManager startup + volume reconstruction during that.

Proposed VolumeManager startup

When kubelet starts, VolumeManager starts DSWP and reconciler in parallel .

However, the first thing that the reconciler does before reconciling DSW and ASW is that it scans /var/lib/kubelet/pods/* and reconstructs all found volumes and adds them to ASW as uncertainly mounted and uncertainly attached. Only information that is available in the Pod directory on the disk are reconstructed into ASW, because kubelet may not have connection to the API server at this point.

The volume reconstruction can be imperfect:

It can miss devicePath, which may not be possible to reconstruct from the OS.
For CSI volumes, it cannot decide if a volume is attach-able to put it into , or to remove it from node.status.volumesInUse, because it cannot read CSIDriver from the API server yet.

Kubelet puts the volumes to ASW as uncertainly attached and with possibly wrong devicePath it got from the volume plugin. Kubelet stores list of the reconstructed volumes in volumesNeedUpdateFromNodeStatus to fix both devicePath and attach-ability from node.status.volumesAttached once it establishes connection to the API server.

After ASW is populated, reconciler starts its reconciliation loop :

mountOrAttachVolumes() - mounts (and attaches, if necessary) volumes that are in DSW, but not in ASW. This can happen even before DSW is fully populated.
updateReconstructedFromNodeStatus() - once kubelet gets connection to the API server and reads its own node.status, volumes in volumesNeedUpdateFromNodeStatus (i.e. all reconstructed volumes) are updated from node.status.volumesAttached, overwriting any previous uncertain attach-ability and devicePath of uncertain mounts (i.e. potentially overwriting the reconstructed devicePath or even devicePath from MountDevice / SetUp that ended as uncertain). This happens only once, volumesNeedUpdateFromNodeStatus is cleared afterwards.
(Only once): Add all reconstructed volumes to node.status.volumesInUse.
Only after DSW was fully populated (i.e. VolumeManager can tell if a volume is really needed or not), and DSW was fixed from node.status, VolumeManager can start unmounting volumes and calls:
1. unmountVolumes() - unmounts pod local volume mounts (TearDown) that are in ASW and are not in DSW.
2. unmountDetachDevices() - unmounts global volume mounts (UnmountDevice) of volumes that are in ASW and are not in DSW.
3. cleanOrphanVolumes() - tries to clean up volumesFailedReconstruction. Here kubelet cannot call appropriate volume plugin to unmount a volume, because kubelet failed to reconstruct the volume spec from /var/lib/kubelet/pods/<uid>/volumes/xyz. Kubelet at least tries to unmount the directory and clean up any orphan files there. This happens only once, volumesFailedReconstruction is cleared afterwards.

Note that e.g. mountOrAttachVolumes can call volumePlugin.MountDevice / SetUp() on a reconstructed volume (because it was added to ASW as uncertain) and finally update ASW, while the VolumeManager is still waiting for the API server to update devicePath of the same volume in ASW (step 2. above). We made sure that updateReconstructedDevicePaths() will update the devicePath only for volumes that are still uncertain, not to overwrite the certain ones.

Old VolumeManager startup

When kubelet starts, VolumeManager starts DSWP and the reconciler in parallel .

The reconciler then periodically does:

unmountVolumes() - unmounts (TearDown) pod local volumes that are in ASW and are not in DSW. Since the ASW is initially empty, this call becomes useful later.
mountOrAttachVolumes() - mounts (and attaches, if necessary) volumes that are in DSW, but not in ASW. This will eventually happen for all volumes in DSW, because ASW is empty. This actually the way how AWS is populated.
unmountDetachDevices() - unmounts (UnmountDevice) global volume mounts of volumes that are in ASW and are not in DSW.
Only once after DSW is fully populated:
1. VolumeManager calls sync(), which scans /var/lib/kubelet/pods/* and reconstructs only volumes that are not already in ASW. In addition, volumes that are in DSW are reconstructed, but not added to ASW (If a volume is in DSW, we expect that it reaches ASW during step 3.)
  - devicePath of reconstructed volumes is populated from node.status.attachedVolumes right away.
  - In the next reconciliation loop, reconstructed volumes that are not in DSW are finally unmounted in step 1. above.
  - There is a workaround to add a reconstructed volume to ASW when it was initially in DSW, but all pods that used the volume were deleted before the volume was mounted and reached ASW. (#110670 )
2. VolumeManager reports all reconstructed volumes in node.status.volumesInUse (that’s why VolumeManager reconstructs volumes, even if it does not add them to DSW).
3. For volumes that failed reconstruction kubelet cannot call appropriate volume plugin to unmount them. Kubelet at least tries to unmount the directory and clean up any orphan files there.

Observability

Today, any errors during volume reconstruction are exposed only as log messages. We propose adding these new metrics, both to the old and new VolumeManager code:

reconstruct_volume_operations_total / reconstruct_volume_operations_errors_total: nr. of all / unsuccessfully reconstructed volumes.
- In the new VolumeManager code, this will include all volume mounts in /var/lib/kubelet/pods/*/volumes
- In the old VolumeManager it will include only volumes that were not already in ASW (those are not reconstructed).
force_cleaned_failed_volume_operations_total / force_cleaned_failed_volume_operation_errors_total: nr. of all / unsuccessful cleanups of volumes that failed reconstruction.
orphan_pod_cleaned_volumes_errors: nr. of pods that failed cleanup with errors like orphaned pod "<uid>" found, but XYZ failed (example ) in the last sync. These messages can be a symptom of failed reconstruction (e.g. #105536 ). Note that kubelet logs this periodically and bumping this metric periodically would not be useful. cleanupOrphanedPodDirs needs to be changed to collect errors found during one /var/lib/kubelet/pods/ check and report collected “nr of errors during the last housekeeping sweep (every 2 seconds)”. There is no label that would distinguish between each error cause.
orphan_pod_cleaned_volumes: nr. of total pods that were attempted to be cleaned up by cleanupOrphanedPodDirs in the last sync, both successful and failed.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

All files are in k8s.io/kubernetes/pkg/kubelet/volumemanager/reconciler/, data taken on 2023-01-26 .

The old reconciler + reconstruction:

reconciler.go: 77.1
reconstruct.go: 75.7%
The new reconciler + reconstruction
reconciler_new.go: 73.3%
- The coverage is lower than reconciler.go, because parts of reconcile.go code are tested by unit tests in different packages. With force-enabled SELinuxMountReadWriteOnce gate in today’s master(f21c60341740874703ce12e070eda6cdddfd9f7b), I got reconciler_new.go coverage 93.3%.
reconstruct_new.go: 66.2%
- updateReconstructedDevicePaths does not have unit tests, this will be added before Beta release.

Common code:

reconciler_common.go: 86.2%
reconstruct_common.go: 75.8%

Integration tests

None.

e2e tests

“Should test that pv used in a pod that is deleted while the kubelet is down cleans up when the kubelet returns”: https://storage.googleapis.com/k8s-triage/index.html?sig=storage&test=Should%20test%20that%20pv%20used%20in%20a%20pod%20that%20is%20deleted%20while%20the%20kubelet%20is%20down%20cleans%20up%20when%20the%20kubelet%20returns
“Should test that pv used in a pod that is force deleted while the kubelet is down cleans up when the kubelet returns”: https://storage.googleapis.com/k8s-triage/index.html?sig=storage&test=Should%20test%20that%20pv%20used%20in%20a%20pod%20that%20is%20force%20deleted%20while%20the%20kubelet%20is%20down%20cleans%20up%20when%20the%20kubelet%20returns

Both are for the old reconstruction code, we don’t have a job that enables alpha features + runs [Disruptive] tests.

Graduation Criteria

Alpha

Feature implemented behind a feature flag

Beta

Gather feedback from developers

GA

Allowing time for feedback.
No flakes in CI.

Deprecation

Announce deprecation and support policy of the existing flag
No need to wait for two versions passed since introducing the functionality that deprecates the flag (to address version skew). The feature is local to a single kubelet.
Address feedback on usage/changed behavior, provided on GitHub issues
Deprecate the flag

Upgrade / Downgrade Strategy

The feature is enabled by a single feature gate on kubelet and does not require any special upgrade / downgrade handling.

Version Skew Strategy

The feature affects only how kubelet starts. It has no implications on other Kubernetes components or other kubelets. Therefore, we don’t see any issues with any version skew.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: NewVolumeManagerReconstruction
- Components depending on the feature gate: kubelet

Does enabling the feature change any default behavior?

It changes how kubelet starts and how it cleans volume mounts. It has no visible effect in any API object.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

The feature can be disabled without any issues.

What happens if we reenable the feature if it was previously rolled back?

Nothing interesting happens. This feature changes how kubelet starts and how it cleans volume mounts. It has no visible effect in any API object nor structure of data / mount table in the host OS.

Are there any tests for feature enablement/disablement?

We have unit tests for the feature disabled or enabled. It affects only kubelet startup and we don’t change format of data present in the OS (mount table, content of /var/lib/kubelet/pods/), so we don’t have automated tests to start kubelet with the feature enabled and then disable it or a vice versa.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

If this feature is buggy, kubelet either does not come up at all (crashes, hangs) or does not unmount volumes that it should unmount.

What specific metrics should inform a rollback?

reconstruct_volume_operations_total, reconstruct_volume_operations_errors_total, force_cleaned_failed_volume_operations_total, force_cleaned_failed_volume_operation_errors_total, orphaned_volumes_cleanup_errors_total

See Observability in the detail design section. All newly introduced metrics will be added both to “old” and “new” VolumeManager, so users can compare these metrics with the feature gate enabled and disabled and see if downgrade actually helped.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Yes, see https://github.com/kubernetes/enhancements/issues/3756#issuecomment-1906255361 (and expand Details).

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

They can check if the FeatureGate is enabled on a node, e.g. by monitoring kubernetes_feature_enabled metric. Or read kubelet logs.

How can someone using this feature know that it is working for their instance?

Events
- Event Reason:
API .status
- Condition name:
- Other field:
Other (treat as last resort)
- Details: logs during kubelet startup.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

These two metrics are populated during kubelet startup:

reconstruct_volume_operations_errors_total should be zero. An error here means that kubelet was not able to reconstruct its cache of mounted volumes and appropriate volume plugin was not called to clean up a volume mount. There could be a leaked file or directory on the filesystem.
force_cleaned_failed_volume_operation_errors_total should be zero. An error here means that kubelet was not able to unmount a volume even with all fallbacks it has. There is at least a leaked directory on the filesystem, there could be also a leaked mount.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name:
  - reconstruct_volume_operations_total
  - reconstruct_volume_operations_errors_total
  - force_cleaned_failed_volume_operations_total
  - force_cleaned_failed_volume_operation_errors_total
  - orphaned_volumes_cleanup_errors_total
- Components exposing the metric: kubelet

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Kubelet startup could be slower, but that would be a bug. In theory, the old and new VolumeManager startup does the same things, just in a different order.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

Kubelet won’t start unmounting volumes that are not needed. But that was the behavior also before this KEP.

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Check kubelet logs. There should be errors about a failed volume reconstruction, together with the directory where the volume was supposed to be mounted. Ensure that:

There is no Pod that uses the volume on the node.
The directory of the volume is not mounted there.
The directory and all its parents up to /var/lib/kubelet/pods/<uid>/volumes are removed.
If possible, locate global mount of the volume (if it exists) in /var/lib/kubelet/plugins/<volume plugin name> and unmount + remove it. The actual directory varies by volume plugin.
- For CSI volumes, if the CSI driver supports NodeStageVolume CSI call, the location is /var/lib/kubelet/plugins/kubernetes.io/csi/<csi driver name>/<sha256sum of pv.spec.csi.volumeHandle>/globalmount. Otherwise, there is no global mount directory.
- EmptyDir, Projected, DownwardAPI, Secrets and ConfigMaps do not have global mount directory.

Implementation History

1.26: Alpha version was implemented as part of KEP 1710 and behind SELinuxMountReadWriteOnce feature gate.
1.27: Splitting out as a separate KEP, targeting Beta in this release.
1.30: GA.

KEP-3756: Robust VolumeManager reconstruction after kubelet restart

KEP-3756: Robust VolumeManager reconstruction after kubelet restart

Release Signoff Checklist

Summary

Motivation

Goals

Non-Goals

Introduction

Proposal

User Stories (Optional)

Story 1

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Design Details

Proposed VolumeManager startup

Old VolumeManager startup

Observability

Test Plan

Prerequisite testing updates

Unit tests

Integration tests

e2e tests

Graduation Criteria

Alpha

Beta

GA

Deprecation

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

Infrastructure Needed (Optional)