KEP-4876: Mutable CSINode Allocatable Property

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This KEP proposes changes to make the CSINode.Spec.Drivers[*].Allocatable.Count field mutable and introduces a mechanism to update it dynamically based on user configuration at the CSI driver level. These updates can be triggered either by periodic intervals or by failure detection (such as volume attachment failures due to insufficient capacity). This improvement enhances the reliability of stateful pod scheduling by addressing mismatches between reported and actual attachment capacity on nodes.

Motivation

Currently, a mismatch between the reported and actual attachment capacity on nodes can result in permanent scheduling errors and stuck workloads. This occurs when volume slots are taken after a CSI driver starts up, which results in kube-scheduler assigning stateful pods to nodes lacking the necessary capacity to support them. This mismatch can happen due to various scenarios, such as:

Operations out of band with respect to CSI drivers and Kubernetes:
- Manual attachment of volumes by administrators or external controllers.
Multi-driver scenarios:
- When multiple CSI drivers are used on a node and one driver’s operations affect the available capacity for others.
Other devices consuming available slots:
- Network interfaces taking up slots.
- GPU or specialized hardware attachments that weren’t present during CSI driver initialization.

These scenarios can lead to the CSI driver reporting an initial capacity that becomes inaccurate over time, causing the scheduler to make decisions based on outdated information. This results in pods being scheduled to nodes without sufficient capacity, ultimately getting stuck in a ContainerCreating state.

By making the CSINode.Spec.Drivers[*].Allocatable.Count field mutable and introducing a mechanism to update it dynamically, we can ensure that the scheduler always has information which more accurately represents the actual state of the world, significantly improving the reliability of stateful pod scheduling.

Goals

Make CSINode.Spec.Drivers[*].Allocatable.Count mutable.
Enable CSI drivers to define the interval at which the Allocatable.Count value on each node is updated through the CSIDriver object.
Automatically update CSINode.Spec.Drivers[*].Allocatable.Count upon detecting a failure in volume attachment due to insufficient capacity.

Non-Goals

Modifying the core scheduling logic of Kubernetes.
Implementing cloud provider-specific solutions within Kubernetes core.
Re-scheduling pods stuck in a ContainerCreating state.

Proposal

User Stories (Optional)

Story 1

As a cluster administrator, I want the reported attachment capacity on nodes to accurately reflect the actual capacity, so that stateful pods are reliably scheduled and do not become stuck in a ContainerCreating state due to insufficient capacity.

Story 2

As a cluster operator, I use volumes during node setup for components like kubelet, containerd, and additional drivers. These boot volumes, which are not managed by CSI, may be detached after setup, and I need a way to reclaim these slots for other uses. The current static capacity reporting doesn’t allow for this flexibility.

Story 3

As a cluster operator, I need the Kubernetes scheduler to accurately count the number of available device slots for both storage volumes and network interfaces. On certain machine types, network interfaces and volumes share device slots, and network interfaces may be dynamically attached after the CSI driver is registered. This results in an inaccurate Allocatable.Count for volumes, causing stateful pods to be scheduled on nodes with insufficient capacity, ultimately getting stuck in a ContainerCreating state.

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

The following risks are identified:

Frequent updates/retrieval of the CSINode object could increase API server load.
Frequent calls to a CSI driver’s NodeGetInfo RPC endpoint may become expensive, particularly if the operation involves retrieving information from a remote server or performing resource-intensive tasks. Specifically, this is a concern at scale, where the cumulative cost of multiple nodes repeatedly querying for updates is more impactful.
There’s a race condition where the scheduler might assign a stateful pod to a node with insufficient capacity if the CSINode.Spec.Drivers[*].Allocatable.Count value hasn’t been updated in time.

The risks are mitigated as follows:

The use of the Kubernetes informer pattern in the scheduler. The scheduler uses a CSINode informer and lister to efficiently access and watch CSINode objects (this logic is already present).
Allow users to opt in to this feature at a per-CSI driver granularity by configuring the CSIDriver object. Specifically, administrators will be able to fine-tune the interval update value via the NodeAllocatableUpdatePeriodSeconds attribute in the CSIDriver object as per their specific requirement.
A reactive update mechanism is implemented to immediately update the CSINode.Spec.Drivers[*].Allocatable.Count value if a pod fails to enter a running state due volume attachment failures as a result of insufficient capacity. This ensures that even if a race occurs, Kubernetes quickly corrects itself and prevents further scheduling errors.

Design Details

Feature Gate

A new feature gate - MutableCSINodeAllocatableCount - will be introduced to control the functionality implemented by this KEP. When the feature gate is disabled, the CSINode object will remain immutable, maintaining the current behavior.

API Changes

CSINode

The CSINode.Spec.Drivers[*].Allocatable.Count field will be made mutable. No changes to the object structs are needed, only the validation logic needs to be revised. For reference, these are the API fields this KEP proposes to make mutable.

// CSINodeDriver holds information about the specification of one CSI driver installed
type CSINodeDriver struct {
    ...
    // allocatable represents the volume resources of a node that are available for sc
    // +optional
    Allocatable *VolumeNodeResources
}

// VolumeNodeResources is a set of resource limits for scheduling of volumes.
type VolumeNodeResources struct {
    // Maximum number of unique volumes managed by the CSI driver that can be used on
    // A volume that is both attached and mounted on a node is considered to be used o
    // The same rule applies for a unique volume that is shared among multiple pods on
    // If this field is not specified, then the supported number of volumes on this no
    // +optional
    Count *int32
}

CSIDriver

A new field, NodeAllocatableUpdatePeriodSeconds, will be added to the CSIDriverSpec struct. This field allows a CSI driver to specify the interval at which the Kubelet should periodically query a driver’s NodeGetInfo RPC endpoint to update the CSINode object. If this field is not set, no updates occur (neither periodic nor upon detecting capacity-related failures), and the allocatable count remains static.

// CSIDriverSpec is the specification of a CSIDriver.
type CSIDriverSpec struct {
    ...
	// nodeAllocatableUpdatePeriodSeconds specifies the interval between periodic updates of
	// the CSINode allocatable capacity for this driver. When set, both periodic updates and
	// updates triggered by capacity-related failures are enabled. If not set, no updates
	// occur (neither periodic nor upon detecting capacity-related failures), and the
	// allocatable.count remains static. The minimum allowed value for this field is 10 seconds.
	//
	//
	// This field is mutable.
	//
	// +featureGate=MutableCSINodeAllocatableCount
	// +optional
    NodeAllocatableUpdatePeriodSeconds *int64
}

VolumeError

A new field, ErrorCode, will be added to the VolumeError struct to facilitate detection of capacity-related errors:

// Captures an error encountered during a volume operation.
type VolumeError struct {
   ...
  // errorCode is a numeric gRPC code representing the error encountered during Attach or Detach operations.
  //
  // This is an optional field that requires the MutableCSINodeAllocatableCount feature gate being enabled to be set.
  //
  // +featureGate=MutableCSINodeAllocatableCount
  // +optional
    ErrorCode *int32
}

Validation Changes

The ValidateCSINodeUpdate function in the API validation code path will be modified to allow updates to the Allocatable.Count field when the feature gate is enabled:

func ValidateCSINodeUpdate(new, old *storage.CSINode) field.ErrorList {
    allErrs := ValidateCSINode(new)
    
    if utilfeature.DefaultFeatureGate.Enabled(features.MutableCSINodeAllocatableCount) {
        for _, oldDriver := range old.Spec.Drivers {
           for _, newDriver := range new.Spec.Drivers {
                // Allow Allocatable.Count to be modified
                // Ensure all other fields are unchanged
            }
        }
    } else {
        // Existing validation logic for when feature gate is disabled
    }
    return allErrs
}

This updated logic allows the Allocatable.Count field to be modified when the feature gate is enabled, while ensuring all other fields remain immutable. When the feature gate is disabled, it falls back to the existing validation logic for backward compatibility.

CSI Node Updater

A new plugin-level updated will be implemented in kubernetes/pkg/volume/csi/csi_node_updater.go to manage periodic updates of CSINode allocatable counts. This updater watches for changes to CSIDriver objects and manages per-driver update goroutines based on the NodeAllocatableUpdatePeriodSeconds setting.

Implementation details

// csiNodeUpdater watches for changes to CSIDriver objects and manages the lifecycle
// of per-driver goroutines that periodically update CSINodeDriver.Allocatable information
type csiNodeUpdater struct {
    // Informer for CSIDriver objects
    driverInformer cache.SharedIndexInformer
    
    // Map of driver names to stop channels for update goroutines
    driverUpdaters sync.Map
    
    // Ensures the updater is only started once
    once sync.Once
}

Update behavior

When a CSIDriver object is added or updated with NodeAllocatableUpdatePeriodSeconds set, the updater checks if the driver is installed on the node before running periodic updates.

When NodeAllocatableUpdatePeriodSeconds is modified, the updater automatically adjusts by stopping the old goroutine and starting a new one. Setting the period to 0 or nil stops updates entirely. Driver uninstallation or CSIDriver object deletion also stops the update goroutine for that specific driver.

func (u *csiNodeUpdater) runPeriodicUpdate(driverName string, period time.Duration, stopCh <-chan struct{}) {
    ticker := time.NewTicker(period)
    defer ticker.Stop()
    
    for {
        select {
        case <-ticker.C:
            if err := updateCSIDriver(driverName); err != nil {
                klog.ErrorS(err, "Failed to update CSIDriver", "driver", driverName)
            }
        case <-stopCh:
            return
        }
    }
}

Error handling

If updateCSIDriver() fails, the error is logged but the allocatable count retains its current value. Updates continue at the configured interval regardless of individual failures.

NodeInfoManager Interface Extension

The existing NodeInfoManager interface will be extended to include a new method for updating the CSINode object:

// Interface implements an interface for managing labels of a node
type Interface interface {
    CreateCSINode() (*storagev1.CSINode, error)
    ...
    // UpdateCSINode updates the CSINode object
    UpdateCSINode() error
}

CSINode Update Behavior

This table explains how updates to the CSINode.Spec.Drivers[*].Allocatable.Count field are handled, depending on the status of the MutableCSINodeAllocatableCount feature flag and the NodeAllocatableUpdatePeriodSeconds field in the CSIDriver object.

Feature Flag Status	`NodeAllocatableUpdatePeriodSeconds`	Behavior
Enabled	Set	Periodic updates occur at the defined interval + when invalid state is detected (volume attachment failures due to `ResourceExhausted`)
Enabled	Not set	No updates occur; `Allocatable.Count` remains static
Disabled	Set	`NodeAllocatableUpdatePeriodSeconds` is ignored; `Allocatable.Count` remains static and immutable
Disabled	Not set	No updates occur; `Allocatable.Count` remains static and immutable

Pod Construction Changes

To address race conditions where the scheduler assigns stateful pods to nodes with insufficient capacity, Kubelet’s pod construction process during WaitForAttachAndMount will now handle ResourceExhausted errors returned by CSI drivers during the ControllerPublishVolume RPC.

The ResourceExhausted error is directly reported on the VolumeAttachment object associated with the relevant attachment. To facilitate easier detection of ResourceExhausted errors from VolumeAttachment statuses, we propose adding a ErrorCode field to the VolumeError struct.

if err := kl.volumeManager.WaitForAttachAndMount(pod); err != nil {
    if isResourceExhaustedError(err) {
        // Update CSINode using a backoff mechanism
        // Generate event for affected pod
    } else {
        // Existing error handling
    }
}

This change ensures that when a pod fails to be constructed due to insufficient volume attachment capacity, that both:

The CSINode object is promptly updated to reflect the actual available capacity, improving future scheduling decisions.
An event is added to the pod, providing visibility to cluster operators and enabling automated actions by components like the Kubernetes descheduler to fix the stateful pods stuck in ContainerCreating.

Scheduler Enhancements

The CSI volume limits scheduler plugin currently only registers for CSINode “Add” events. To ensure the scheduler promptly reacts to “Update” events as well, we need to modify the EventsToRegister() function in the scheduler plugin to include:

{Event: framework.ClusterEvent{Resource: framework.CSINode, ActionType: framework.Update}}

This enhancement makes it such that when the Allocatable.Count property is updated, the scheduler re-queues previously unschedulable pods to attempt scheduling with updated capacity information.

Test Plan

[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

k8s.io/kubernetes/pkg/kubelet: 2024-09-24 - 51%
k8s.io/kubernetes/pkg/apis/storage/validation: 2024-09-24 - 96%
k8s.io/kubernetes/pkg/volume/plugins.go: 2024-09-24 - 27.9%
k8s.io/kubernetes/pkg/volume/csi/nodeinfomanager: 2024-09-24 - 76.6%

Integration tests

N/A, this enhancement does not introduce configuration parameters or CLI options that are used to start binaries. See e2e and graduation criteria for a comprehensive list of code coverage.

e2e tests

Test the end-to-end workflow of updating CSINode.Spec.Drivers[*].Allocatable.Count using a CSI driver.

testgrid.k8s.io: https://testgrid.k8s.io/presubmits-kubernetes-nonblocking#pull-kubernetes-e2e-kind-alpha-beta-features

Graduation Criteria

Alpha

Beta

All unit tests/integration/e2e tests completed and enabled:

[✅] Test the end-to-end workflow of updating CSINode.Spec.Drivers[*].Allocatable.Count using a CSI driver.
[✅] CSINode Updater
[✅] VolumeAttachment Error Code
[✅] Scheduler QueueingHintFn
- Test when Allocatable value is increased that Stateful pod is queued.
- Test when Allocatable value is decreased that Stateful pod is not queued.
[✅] Feature Gate / API Validation
- Test Allocatable value is updated when feature gate is enabled.
- Test Allocatable value is unchanged when feature gate is disabled.

GA

[✅] No bug reports / feedback / improvements to address in k/k.
[✅] No bug reports in Cluster Autoscalar as a result of this enhancement (this KEP does not affect CA, but we have added this requirement out of an abundance of caution, as requested by CA.)

Upgrade / Downgrade Strategy

Upgrade Strategy
- Upgrade the API server first to support mutable CSINode.Spec.Drivers[*].Allocatable.Count and the new NodeAllocatableUpdatePeriodSeconds field in CSIDriver object.
- Upgrade nodes
- Update CSI drivers to take advantage of the new feature, if desired.
Downgrade Strategy
- If downgrading the API server, ensure that nodes are downgraded first to avoid rejected CSINode update attempts.
- CSI drivers using the NodeAllocatableUpdatePeriodSeconds feature should be reconfigured to not use this field before downgrading the API server.

Version Skew Strategy

This enhancement primarily involves changes to the kubelet and the API server, with no impact on the scheduler. Here’s how the system will behave in various version skew scenarios:

API Server considerations
- Older API server versions will reject updates to the CSINode.Spec.Drivers[*].Allocatable.Count field and won’t recognize the NodeAllocatableUpdatePeriodSeconds field in the CSIDriver object.
Kubelet version considerations
- Newer kubelet (with this feature) + Older API server: The kubelet will attempt to update the CSINode.Spec.Drivers[*].Allocatable.Count field due to capacity failures, but these updates will be rejected by the API server.
- Older kubelet + Newer API server: Volume attachment failures due to capacity issues will not trigger CSINode updates during pod construction.
Scheduler considerations
- The scheduler is not directly affected by this change and will continue to use the latest CSINode.Spec.Drivers[*].Allocatable.Count value for scheduling decisions, regardless of whether it’s being updated or not.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: MutableCSINodeAllocatableCount
- Components depending on the feature gate: kube-apiserver, kubelet, kube-scheduler.

Does enabling the feature change any default behavior?

The CSINode.Spec.Drivers[*].Allocatable.Count field becomes mutable and the kubelet will attempt to update this field when a pod fails to enter a ready state due to a volume attachment failure due to insufficient capacity.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, the feature can be disabled by turning off the feature gate.

What happens if we reenable the feature if it was previously rolled back?

The CSINode.Spec.Drivers[*].Allocatable.Count field will become mutable again.

Are there any tests for feature enablement/disablement?

Yes, unit tests will be implemented to verify the behavior of the ValidateCSINodeUpdate function when the feature gate is enabled and disabled.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

The rollout or rollback of this feature is designed such that it cannot fail in a way that impacts cluster operation.

During rollout, if the API server / Kubelet doesn’t support the feature or if there’s a version mismatch, update attempts to CSINode.Allocatable will fail gracefully, maintaining the existing value. This ensures that the worst-case scenario is simply a continuation of the current behavior, rather than a failure state.

For rollback, disabling the feature gate will immediately stop any updates to the allocatable property. Kubernetes will continue using the last known value, which may be outdated but won’t cause operational issues.

In essence, the feature’s best-effort nature and feature gate protection make it resilient against rollout or rollback failures. The primary risk is temporary inconsistency in reported capacities during transition periods, but this does not impact running workloads or overall cluster stability.

What specific metrics should inform a rollback?

Unexpected API server errors

apiserver_request_total{group="storage.k8s.io", resource="csinodes", verb=~"UPDATE|PATCH", code=~"4..|5.."} - A sustained failure rate of 4xx/5xx HTTP responses for >= 2 minutes indicates the feature is misbehaving and warrants rollback.

API server latency degradation

apiserver_request_duration_seconds{resource="csinodes", verb=~"UPDATE|PATCH"} - Significant increases in p95 or p99 latency for CSINode updates are not expected and may suggest API server contention.

Besides this, since the enhancement implements best-effort updates to the CSINode.Allocatable property, the only scenarios that would necessitate a rollback are:

Unexpected kubelet crashes after enabling the feature.
API server crashes related to CSINode updates.

In both cases, component crashes would be evident through standard monitoring of node and control plane health.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Yes, the following test scenarios were validated in the Alpha release:

Upgrade path: API server and Kubelet upgrades were tested with the feature gate enabled, confirming that CSINode updates begin working once both components support the feature.
Downgrade path: When the feature gate is disabled or components are downgraded, confirmed that CSINode.Allocatable remains at its last value and becomes immutable again.
upgrade->downgrade->upgrade path: Verified that the full cycle works as expected, with CSINode updates resuming when the feature is re-enabled without requiring additional configuration.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

An operator can determine if this feature is in use by checking the CSIDriver objects in their cluster for the nodeAllocatableUpdatePeriodSeconds field. If this field is set on a CSI driver, the feature is being used. This is similar to how operators check for other CSI capabilities through fields in the CSIDriver object, such as fsGroupPolicy or podInfoOnMount.

How can someone using this feature know that it is working for their instance?

API .status
- VolumeAttachment.Status.Errors[].ErrorCode will be populated with the gRPC error code when a ResourceExhausted error occurs during a driver’s ControllerPublishVolume RPC.
- CSINode.Spec.Drivers[*].Allocatable.Count will be updated periodically based on the nodeAllocatableUpdatePeriodSeconds configuration in the CSIDriver object.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

For this enhancement, the following SLOs are reasonable:

99.9% of CSINode updates (both periodic and reactive) should complete within 1 second of being triggered.
The introduction of this feature should not increase the overall API server error rate (5xx errors) by more than 0.1%.
No measurable impact on pod startup latency, as CSINode updates are performed asynchronously.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Operators can measure the rate of PATCH/UPDATE calls to the csinodes API resource that return a status code of 200. A consistent rate matching the configured NodeAllocatableUpdatePeriodSeconds indicates that periodic updates are working as expected:

apiserver_request_total{group="storage.k8s.io", resource="csinodes", verb=~"PATCH|UPDATE", code="200"}

Are there any missing metrics that would be useful to have to improve observability of this feature?

While the following metrics could provide more granular visibility into the feature’s operation, they weren’t added because the Kubernetes API server already exposes metrics that provide sufficient visibility into CSINode update activity more generally (allows for tracking status code responses and latency).

csi_node_updates_total: Could track CSINode.Spec.Drivers[*].Allocatable updates attempted (periodic/reactive).
csi_node_update_errors_total: Could track failed update attempts.
csi_node_update_duration_seconds: Could track update latency.

Dependencies

Does this feature depend on any specific services running in the cluster?

This feature primarily depends on CSI drivers implementing the NodeGetInfo RPC to report volume attachment limits. If a CSI driver is unavailable, the CSINode.Spec.Drivers[*].Allocatable value remains at its last known value. Degraded performance or high error rates in CSI drivers may cause periodic or reactive updates to fail, but this only results in using the last known value, with no impact on existing workloads.

Beyond CSI drivers, which are already a requirement for volume operations, this feature introduces no additional service dependencies. It builds upon existing Kubernetes components (kubelet and API server) and their normal operation.

Scalability

Will enabling / using this feature result in any new API calls?

Yes, there will be new API calls to update the CSINode object:

API call type: PATCH
Estimated throughput: Depends on the `NodeAllocatableUpdatePeriodSeconds` setting and the frequency of volume attachment failures.
Originating component: Kubelet

Will enabling / using this feature result in introducing new API types?

No, this feature does not introduce new API types.

Will enabling / using this feature result in any new calls to the cloud provider?

No, this feature does not introduce new calls to the cloud provider directly. However, CSI drivers may make additional calls to retrieve updated capacity information.

Will enabling / using this feature result in increasing size or count of the existing API objects?

API Object: CSIDriver
Estimated increase in size: New `NodeAllocatableUpdatePeriodSeconds` field (approximately 32 bytes)
Estimated amount of new objects: No new objects, only modification of existing CSIDriver objects

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

This feature should not impact existing SLIs/SLOs. The CSINode updates are asynchronous and should not directly affect pod startup times or API responsiveness.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

The feature may result in a slight increase in CPU and network usage on nodes due to periodic CSINode updates and more frequent calls to the CSI driver’s NodeGetInfo RPC.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

This feature should not result in resource exhaustion of node resources. The additional goroutine and API calls are minimal and should not significantly impact the node’s resources.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

When the API server is unavailable, CSINode update attempts fail and are logged, however, the periodic update goroutines will continue running and retry at their configured intervals. Additionally, ResourceExhausted errors cannot trigger immediate updates since VolumeAttachment statuses cannot be read. Existing allocatable values remain unchanged and stateful workloads continue running normally.

What are other known failure modes?

No other known failure modes.

What steps should be taken if SLOs are not being met to determine the problem?

N/A

Implementation History

2024-08-08 - Enhancement proposed in sig-storage.
2024-09-25 - Enhancement officially submitted to Kubernetes.
2025-04-23 - Kubernetes v1.33: Enhancement implemented and released in Alpha.
2025-08-27 - Kubernetes v1.34: Enhancement graduated to Beta.
2025-12-17 - Kubernetes v1.35: Feature gate enabled by default in Beta.

Drawbacks

Alternatives

Implementing a custom scheduler: This approach was rejected for several reasons.
- It would significantly degrade the customer experience, as users would need to deploy and manage an additional component.
- This issue is not a niche use case; it affects a wide range of CSI drivers and cloud providers.
- The default Kubernetes scheduler heavily relies on the CSINode allocatable object to make informed decisions about node capacity. Implementing a custom scheduler is arguably workaround solution that does not address the root cause and inherent limitation of the immutable CSINode object today.

KEP-4876: Mutable CSINode Allocatable Property

KEP-4876: Mutable CSINode Allocatable Property

Release Signoff Checklist

Summary

Motivation

Goals

Non-Goals

Proposal

User Stories (Optional)

Story 1

Story 2

Story 3

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Design Details

Feature Gate

API Changes

CSINode

CSIDriver

VolumeError

Validation Changes

CSI Node Updater

Implementation details

Update behavior

Error handling

NodeInfoManager Interface Extension

CSINode Update Behavior

Pod Construction Changes

Scheduler Enhancements

Test Plan

Prerequisite testing updates

Unit tests

Integration tests

e2e tests

Graduation Criteria

Alpha

Beta

GA

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

Infrastructure Needed (Optional)