KEP-5381: Mutable PersistentVolume Node Affinity

Implementation History
ALPHA Implementable
Created 2025-06-02
Latest v1.35
Milestones
Alpha v1.35
Beta v1.36
Stable v1.38
Ownership
Owning SIG
SIG Storage
Primary Authors

KEP-5381: Mutable PersistentVolume Node Affinity

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
    • e2e Tests for all Beta API Operations (endpoints)
    • (R) Ensure GA e2e tests meet requirements for Conformance Tests
    • (R) Minimum Two Week Window for GA e2e tests to prove flake free
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • “Implementation History” section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This KEP proposes to make PersistentVolume.spec.nodeAffinity field mutable, making it possible to change the affinity of the volume. This allows user to migrate data or enabling features without interrupting workloads.

Motivation

Currently, PersistentVolume.spec.nodeAffinity is set at creation time and cannot be changed. But user may modify the volume to taking advantage of new features provided by the storage provider, or accommodate to the changes of business requirements. These modification can be expressed by VolumeAttributesClass in Kubernetes. But sometimes, A modification to volume comes with change to its accessibility, such as:

  1. migration of data from one zone to regional storage;
  2. enabling features that is not supported by all the client nodes.

In these scenarios, the nodeAffinity becomes inaccurate, causing the scheduler to make decisions based on outdated information. This results in pods:

  • being scheduled to nodes that cannot access the volume, getting stuck in a ContainerCreating state;
  • or being rejected from nodes that actually can access the volume, getting stuck in a Pending state.

By making PersistentVolume.spec.nodeAffinity field mutable, we give storage providers a chance to propagate latest accessibility requirement to the scheduler, improving the reliability of stateful pod scheduling.

Goals

  • Make PersistentVolume.spec.nodeAffinity field mutable.

Non-Goals

  • Enable CSI drivers to return a new accessibility requirement on ControllerModifyVolume (future work).
  • Modifying the core scheduling logic of Kubernetes.
  • Implementing cloud provider-specific solutions within Kubernetes core.
  • Re-scheduling running pods with volumes being modified, or directly interrupting workloads.

Proposal

  1. Change APIServer validation to allow PersistentVolume.spec.nodeAffinity to be mutable.
  2. Ensure scheduler will re-schedule pending pods that using a PV that has been updated (already implemented).
  3. When a Pod is scheduled to a node that does not match volume node affinity, kubelet should fail the Pod.

User Stories (Optional)

Story 1

As the owner of a stateful workload, I want to take advantage of the new regional storage provided by the storage provider, to improve the availability of my application. I need a way to tell scheduler that the volume is now accessible in every zone, so that the pod can be scheduled to another zone when the current zone is down.

In this case, the old affinity would be:

required:
  nodeSelectorTerms:
  - matchExpressions:
    - key: topology.kubernetes.io/zone
      operator: In
      values:
      - cn-beijing-g

We would like to change it to:

required:
nodeSelectorTerms:
- matchExpressions:
  - key: topology.kubernetes.io/region
    operator: In
    values:
    - cn-beijing

manually currently, hopefully integrated into CSI in the future.

Story 2

As a cluster operator, I’m conducting an upgrade to new storage category provided by our storage provider. However, once upgraded, the volume cannot be attached to some legacy nodes in the cluster. I need a way to convey this new requirement to the scheduler, so that my pod will not getting stuck in a ContainerCreating state.

In this case, the old affinity would be:

required:
  nodeSelectorTerms:
  - matchExpressions:
    - key: provider.com/disktype.cloud_ssd
      operator: In
      values:
      - available

Type A node only supports cloud_ssd, while Type B node supports both cloud_ssd and cloud_essd. We will only allow the modification if the volume is attached to type B nodes. And I want to make sure the Pods using new cloud_essd volume not to be scheduled to type A nodes.

We would like to change the affinity to:

required:
  nodeSelectorTerms:
  - matchExpressions:
    - key: provider.com/disktype.cloud_essd
      operator: In
      values:
      - available

Notes/Constraints/Caveats (Optional)

It is storage provider’s responsibility to ensure that the running workload is not interrupted while the data is being moved.

Whoever modifies the PersistentVolume.spec.nodeAffinity field should ensure that no running Pods on nodes with incompatible labels are using the PV. Kubernetes will not verify this. It is expensive and racy.

If the incompatibility does happen (i.e. someone updated nodeAffinity, making running Pods violate the new nodeAffinity), we don’t guarantee that those Pods will continue to run without any issue. However, we try our best not to interrupt them:

  • For volumes that not yet present in the Node.status.volumesAttached field, we fail the Pods that use them, since we are sure the Pods have never been running. (see Handling race condition below)
  • We will not detach the volume. So if the volume is actually accessible (depends on the storage provider), the Pod can continue to run.
  • For CSI drivers with requiresRepublish set to true, we will stop calling NodePublishVolume periodically. and an event is emitted.
  • For CSI drivers with requiresRepublish set to false, an event is emitted on kubelet restart. Otherwise the pod should continue to run. It is not re-evaluated when the pod is already running.

Note that Pod.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution is similar. Currently if the node labels change and the Pod nodeAffinity becomes incompatible, the pod will continue to run until kubelet restarts, which will fail the pod.

Risks and Mitigations

User may likely rollout workload and PV nodeAffinity changes at the same time. This may trigger a race condition where the workload pods are scheduled to a node matchs the old nodeAffinity, but the volume cannot be used on the node.

To mitigate this risk, we let kubelet to fail the mis-scheduled pods. Hopefully, workload controller will create a replacement pod for it.

If the user is running an incompatible scheduler which does not respect PV nodeAffinity, we may ended up in an endless loop of creating then failing pods. This should be fine since we already have many cases like this. We mitigate this by adding an note in the release note.

Design Details

Handling race condition

There is a race condition between volume modification and pod scheduling:

  1. User modifies the volume from storage provider.
  2. A new Pod is created and scheduler schedules it with the old affinity.
  3. User sets the new affinity to the PV.
  4. KCM/external-attacher attaches the volume to the node, and find the affinity mismatch.

If this happens, the pod will be stuck in a ContainerCreating state. Kubelet should detect this condition and reject the pod. Hopefully some other controllers (StatefulSet controller) will re-create the pod and it will be scheduled to the correct node.

Specifically, kubelet should reject the pod (setting pod phase to ‘Failed’) if the volume is not present in the node.status.volumesAttached list and the volume nodeAffinity does not match the current node in waitForVolumeAttach().

We check volumesAttached to ensure the Pods have never been running, to avoid interrupting running Pods. We don’t check the VolumeAttachment to make this also work for non-CSI volumes.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates
Unit tests
  • pkg/apis/core/validation: 2025-09-30 - 85.1%

  • pkg/kubelet/volumemanager: 2025-09-30 - 72.1%

  • pkg/kubelet/volumemanager/reconciler: 2025-09-30 - 82.7%

  • Will test kubelet volume manager correctly fails the pods with mismatch volume node affinity

  • Will test kubelet volume manager will not fail the pods with volumes already attached

  • Will test that API validation allows volume node affinity update if the feature gate is enabled

Integration tests
  • Test modifying PV nodeAffinity will trigger reschedule of pending pods.
e2e tests

Graduation Criteria

Alpha

  • Feature implemented behind a feature flag
  • Initial e2e tests completed and enabled

Beta

  • Gather feedback from developers and surveys
  • Additional tests are in Testgrid and linked in KEP
  • All monitoring requirements completed
  • All testing requirements completed
  • All known pre-release issues and gaps resolved

GA

  • 2 examples of real-world usage
  • Allowing 2 releases for feedback
  • All issues and gaps identified as feedback during beta are resolved

Upgrade / Downgrade Strategy

APIServer and kubelet can be update / downgraded independently.

Upgrade the external storage controller after APIServer to take advantage of the new feature if desired. Otherwise, admin can also utilize the new feature manually with kubectl.

Downgrade/Reconfigure the external storage controller before APIServer to avoid updating PV nodeAffinity being rejected.

Version Skew Strategy

This feature involves changes to the kubelet, and APIServer. But they are not strongly coupled.

An n-3 kubelet will not able to fail the mis-scheduled pods. The mis-scheduled pods will stuck at ContainerCreating status. If the kubelet is upgraded afterwards, it will properly fail those pods. User can also manually delete the pods if they don’t want to upgrade kubelet soon. If user does not actually update the PV nodeAffinity, there will be no such mis-scheduled pods and everything should be fine.

kube-scheduler is not directly affected. It just read the latest PV nodeAffinity for scheduling decision regardless of whether it’s being updated or not.

An old external storage controller should work fine with new APIServer, since it will not update PV nodeAffinity.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?
  • Feature gate (also fill in values in kep.yaml)
    • Feature gate name: MutablePVNodeAffinity
    • Components depending on the feature gate: kubelet, kube-apiserver
Does enabling the feature change any default behavior?

PV spec.nodeAffinity becomes mutable.

If a pod being scheduled to a node that is incompatible with the PV’s nodeAffinity, the pod will fail. Previously, it will be stuck at ContainerCreating status.

This should be rare before enabling this feature, since we don’t allow PV nodeAffinity to be updated, nor CSI driver can change the topology reported from NodeGetInfo. So this is only possible if the user edited the node labels manually, or is running an incompatible scheduler. Existing workflow will unlikely be affected by this behavior change.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. Once disabled, PV node affinity cannot be updated any more. Already updated PVs will still keep the updated node affinity.

What happens if we reenable the feature if it was previously rolled back?

Nothing special.

Are there any tests for feature enablement/disablement?

Will add unit test to verify the validation and kubelet behavior when the feature gate is enabled or disabled.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?
What specific metrics should inform a rollback?

High value of kubelet_admission_rejections_total{reason="VolumeNodeAffinity"}

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

Unfortunately, no metrics records the update of a specific field. Operator should check APIServer audit log.

Operator may also use the storage controller specific metrics.

How can someone using this feature know that it is working for their instance?
  1. nodeAffinity can now be updated for existing volumes
  2. pods that cannot be run due volume that can’t be attached are now being failed by kubelet

As the consequences, if a Pod is previously stuck due to out-of-date PV nodeAffinity, now user can update the PV to correct the nodeAffinity, and see the Pod entering Running state eventually. For Pods stuck in ContainerCreating due to storage provider unable to attach the volume to the scheduled node, The Pod will be rejected by kubelet and re-created at the correct node. For Pods stuck in Pending due to no suitable node available, scheduler will retry to schedule For Pods stuck in ContainerCreating due Pod according to the updated nodeAffinity.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
  • Metrics
    • Metric name:
    • [Optional] Aggregation method:
    • Components exposing the metric:
  • Other (treat as last resort)
    • Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?

Count of PV nodeAffinity field update. We have so many fields, it is not reasonable to add a metric for each field or specific to this field.

Dependencies

Does this feature depend on any specific services running in the cluster?

No. But an external storage controller can depend on this feature.

Scalability

Will enabling / using this feature result in any new API calls?

No if unused. One PATCH PV request from external storage controller or human operator per affinity update.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No if unused. Depends on the external storage controller implementation to make API calls to actually migrate the data in the volume.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No. Slightly increased CPU usage to check node affinity in kubelet.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

Nothing changed.

What are other known failure modes?
  • endless loop of pod failure and recreation
    • Detection: rapidly increasing kubelet_admission_rejections_total{reason="VolumeNodeAffinity"}
    • Mitigations: scale down the workload to zero. These pods should already not work
    • Diagnostics: scheduler logs to see why PV node affinity is ignored
    • Testing: No. This should not happen in a conformant cluster.
What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

  • 2025-09: targeting alpha in v1.35
  • 2025-09-30: prototype implemented

Drawbacks

Alternatives

Integrate with the CSI spec and VolumeAttributesClass

We have proposed the plan to integrate in the previous version of this KEP. But we did not reach consensus due to lack of SP want to implement this feature. The main concerns were about race condition between scheduler and update PV.

We will try this again in the future.

Infrastructure Needed (Optional)