KEP-5030: Integrate CSI Volume attach limits with cluster autoscaler
KEP-5030: Integrate Volume Attach limit into cluster autoscaler
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Cluster Autoscaler changes
- Kubernetes Scheduler change
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
Fix cluster-autoscaler (CAS) to be aware of node’s volume attach limits when scaling new nodes and prevent scheduler from placing pods on nodes that do not have a particular CSI driver installed.
Motivation
When scaling new nodes to satisfy pending pods in a cluster, currently cluster-autoscaler (CAS) does not take into account volume attach limits (available via CSI) an upcoming node may have, this could result in insufficient number of nodes created to satisfy pending pods. With this KEP, we will make changes into CAS so that when running simulations to estimate number of nodes necessary to satisfy pending pods or when running scheduler simulations on upcoming nodes, it takes into account CSI volume attach limits via templated CSINode objects.
There is also a gap in implementation of NodeVolumeLimits scheduler plugin which was left intentionally to take into account the fact that, CAS will run this plugin without any templated CSINode objects during creation of new nodes and hence plugin permits placement of unlimited number of pods to nodes even if no CSI driver is installed on them. With this KEP - we aim to close the gap in NodeVolumeLimits scheduler plugin, so that scheduler will not place pods on nodes which aren’t reporting any CSI driver information, if a CSI driver decides to do so.
To summarize:
- Scheduler CSI plugin assumes that “no information about a CSI driver published in a CSINode” means “no limits for volumes from that driver”.
- For existing Nodes with CSI driver information already published, CA correctly takes the volume limits into account when running scheduler filters in simulations (e.g. when packing pending Pods on existing Nodes in the cluster at the beginning of the loop).
- For fake “upcoming” Nodes created in-memory by CA during scale-up simulations the corresponding “upcoming” CSINode is not created/taken into account. So the volume limits are not taken into account when running scheduler filters, which makes CA pack more Pods per Node than actually fit, which makes it undershoot scale-ups.
- For existing Nodes with CSI driver information already published, scheduler correctly takes the volume limits into account when scheduling.
- For new Nodes with not all CSI driver information published yet, scheduler can let Pods in that can’t actually run on the Node.
After:
- By default, the scheduler CSI plugin still assumes that “no information about a CSI driver published in a CSINode” means “the node can handle unlimited amount of volumes”.
- Only when explicitly opted in in CSIDriver instance, the scheduler CSI plugin assumes that “no information about a CSI driver published in a CSINode” means “the node cannot handle any volumes”.
- No change for existing Nodes with CSI driver information already published - CA and scheduler still behave correctly.
- Scheduler waits until all relevant CSI driver info is published before scheduling a Pod, removing the race condition for new Nodes.
- Cluster Autoscaler correctly simulates “upcoming” CSINodes for “upcoming” Nodes and makes correct scale-up decisions.
Goals
- Modify cluster-autoscaler so that it is aware of CSI volume limits.
- Fix scheduler, so that it doesn’t schedule pods that require given CSI volume to a node that doesn’t have CSI driver installed.
Non-Goals
- Deschedule pods that can’t fit on a node because of race conditions.
- Fixing other autoscalers like Karpenter is out of scope for current proposal.
Proposal
As part of this proposal we are proposing changes into both cluster-autoscaler and kubernetes’s built-in scheduler.
- Fix cluster-autoscaler so that it takes into account attach limits when scaling nodes from 0 in a nodegroup.
- Fix cluster-autoscaler so that it takes into account attach limits when scaling nodegroups with existing nodes.
- Fix kubernetes built-in scheduler so that we do not schedule pods to nodes that doesn’t have CSI driver installed with admin opt-in via
CSIDriverobject.
Just to reiterate we are not going to change default scheduling policy of pods that use CSI volumes. Using the new change in scheduler, which actually prevents pod placement to nodes without CSI driver will require explicit opt-in by Cluster admins.
The reason, we decided to make the change an explicit opt-in is because:
- It completely decouples CAS and scheduler changes. When CAS imports the scheduler, we preserve the default behaviour and only if cluster-admin or Kubernetes distributor is sure that it is safe to do, then can enable this behaviour. See Implementation section for when it is safe to enable new behaviour in a cluster.
- For autoscalers such as Karpenter etc, which may still not have CSI node awareness builtin, this allows cluster-admin or Kubernetes distro to make the decision of whether to block pod scheduling to nodes without driver or not.
- This allows us to release scheduler changes sooner and completely decoupled from various autoscalers, because the new feature requires explicit opt-in by the cluster-admin.
User Stories (Optional)
Story 1
- User has more than one pod that is pending because no existing node has any attach limit left.
- Cluster autoscaler evaluates existing nodegroups.
- It picks a nodegroup based on existing critireas and it accurately determines number of nodes it needs to spin up based on volumes that pending pods require.
Story 2
- A Kubernetes admin has one or more node where CSI driver is not installed.
- Without explicitly tainting the node or using node affinity in workloads, nodes which don’t have CSI driver installed aren’t used for scheduling pods that require volume.
Notes/Constraints/Caveats (Optional)
- To fully utilize CSI node limit awareness in cluster-autoscaler, the cloudprovider interface MUST implement
TemplateNodeInfointerface that also returnsCSINodeobject with templated nodeinfo - https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_nodegroup.go#L383 - To prevent pod placement on nodes without CSI driver, the
CSIDriverobject must have an explicit opt-in.
Risks and Mitigations
Design Details
Cluster Autoscaler changes
We can split the implementation in cluster-autoscaler in two parts:
- Scaling a node-group that already has one or more nodes.
- Scaling a node-group that doesn’t have one or more nodes (Scaling from zero).
Scaling a node-group that already has one or more nodes.
To ensure that nodes which were recently started but do not have CSI driver installed yet are considered as upcoming nodes and hence are properly handled via scaleup operation, we propose a mechanism similar to recently introduced mechanism for DRA resources. See section - “Handling Node Readiness” for more details.
We propose that, we add volume limits and installed CSI driver information to framework.NodeInfo objects. So -
type NodeInfo struct {
....
....
// CSINodes contains all CSINodes exposed by this Node.
CSINode *storagev1.CSINode
..
}
- We propose that, when saving
ClusterState, we capture and addCSINodeinformation in cluster snapshot. The updated signature ofSetClusterStatefunction would look like:
SetClusterState(nodes []*apiv1.Node, scheduledPods []*apiv1.Pod, draSnapshot *drasnapshot.Snapshot, csiSnapshot *csisnapshot.Snapshot) error
Both delta and basic snapshot implementation would store csiSnapshot along with dra and other information.
- Since scaling of a nodegroup requires creation of sanitized templateNodeInfo from existing
nodeInfoobjects, we need to ensure that we are creating sanitizedCSINodeobjects from realCSINodeobjects associated with existingnodeInfoobject in nodegroup. We need to make associated changes intonode_info_utils.goto take that into account:
templateNodeInfo := framework.NewNodeInfo(sanitizedExample.Node(), sanitizedExample.LocalResourceSlices, expectedPods...)
if example.CSINode != nil {
templateNodeInfo.AddCSINode(createSanitizedCSINode(example.CSINode, templateNodeInfo))
}
- We propose that, when getting nodeInfosForGroups , the return nodeInfo map also contains csiNode information, which can be used later on for scheduling decisions.
nodeInfosForGroups, autoscalerError := a.processors.TemplateNodeInfoProvider.Process(autoscalingContext, readyNodes, daemonsets, a.taintConfig, currentTime)
This should generally work out of box when nodeInfo is extracted from previously stored cluster snapshot via:
// will return wrapped framework.NodeInfo with both DRA and CSINode information
ctx.ClusterSnapshot.GetNodeInfo(node.Name)
Please note that, we will have to handle the case of scaling from 0, separately from scaling from 1, because in former case - no CSI volume limit information will be available If no node exists in a NodeGroup.
- We further propose creation or extension of existing
StorageInfosinterface, so that both scheduler and CAS can work with the previously created fakeCSINodeobjects. Without this change, both the hinting_simulator and estimator, which triggers scheduler plugin runs will not be able to find the templatedCSINodeobject we created in previous step.
Making aforementioned changes should allow us to handle scaling of nodes from 1.
Scaling from zero
Scaling from zero should work similar to scaling from 1, but the main problem is - we do not have NodeInfo which can tell us what would be the CSI attach limit on the node which is being spun up in a NodeGroup.
We propose to enhance TemplateNodeInfo function to report CSI volume limits via mechanism that was implemented for DRA. As such we aren’t proposing a brand new mechanism for reporting CSI volume limits but rather we are using existing mechanism available from cloudprovide’s implementation of NodeInfosForGroups.
A future enhancement could incorporate https://github.com/kubernetes/autoscaler/issues/7799 when it becomes available.
Kubernetes Scheduler change
We also propose that the new scheduler behavior is opt-in via a new field in CSIDriver. If given node is not reporting any installed CSI drivers and CSIDriver has explicitly opted in, we do not schedule pods that need CSI volumes to that node.
type CSIDriverSpec struct {
....
....
// if set to true, it will cause scheduler to prevent pod placement
// to nodes where no CSI driver is installed.
// Defaults: false
PreventPodPlacementWithoutDriver *bool
}
The proposed change is small and a draft PR is available here - https://github.com/kubernetes/kubernetes/pull/130702 This will stop too many pods crowding a node, when a new node is spun up and node is not yet reporting volume limits.
Along with this, we will also enhance error reporting from scheduler when scheduling of a pod fails in NodeVolumeLimits plugin, due to CSINode related errors:
- When driver is missing on the node, we will return
CSIDriverMissingOnNodeerror. - When
CSINodeobject itself is missing on the node, we will returnCSINodeMissing.
We also need to ensure that StorageInfos interface that is shared between CAS and scheduler is extended for CSINode objects, so that CAS can run scheduler plugins with templated CSINode objects.
Handling Node Readiness
We propose to handle node readiness in a similar way to how it was handled for DRA in - https://github.com/kubernetes/autoscaler/pull/8109
. The basic idea is, we compare using TemplateNodeInfo, what would be the expected CSI drivers available on the node and if node doesn’t yet have those drivers installed, we consider node as not-ready.
Currently handling of TemplateNodeInfo has an issue that reduces its usefulness when cloudprovider has not implemented changes necessary for DRA or CSI, even when nodegroup already has one or more nodes available in it, because current implementation always defers to templated NodeInfo returned by the cloudprovider. While not blocking for this KEP, we will try and address this issue when implementing the necessary changes for CSI.
Alternatives:
1.We propose a similar label as GPULabel added to the node that is supposed to come up with a CSI driver. This would ensure that, nodes which are supposed to have a certain CSI driver installed aren’t considered ready - https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L979 until CSI driver is installed there.
However, we also propose that a node will be considered ready as soon as corresponding CSI driver is being reported as installed via corresponding CSINode object.
A node which is ready but does not have CSI driver installed within certain time limit will be considered as NotReady and removed from the cluster.
- A more exhaustive solution to node readiness is being proposed in - https://github.com/kubernetes/enhancements/pull/5416 , we are open to the idea of using it when it becomes usable from CAS.
When it is safe to Prevent pod placement?
Generally speaking it is safe to prevent pod placement to nodes without CSI driver in scheduler when running an autoscaler that has support for CSI attach limit awareness. Cluster Autoscaler currently supports this with the enable-csi-node-aware-scheduling feature flag (starting with version 1.35). Other autoscalers in the ecosystem may support this in the future.
Obviously it is also safe to prevent pod placement if your cluster doesn’t have any autoscalers.
What happens if cluster-admin opts-in to prevent pod scheduling but autoscaler does not have CSI attach limit awareness?
If autoscaler has updated NodeVolumeLimits plugin from the scheduler but has otherwise has enable-csi-node-aware-scheduling flag disabled in CAS (or has no CSINode awareness), then CAS will not be able to schedule any pods that use CSI volume during its simulations on new nodes. The kube-scheduler will keep rejecting simulated node because, it will not have any CSINode information. This will be bad and autoscaling will be more or less broken for pods that require CSI volumes.
Test Plan
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
Unit tests
After this proposal is implemented, simulated scheduling in CAS should work with fake CSINode objects
which report real volume limits and hence scheduling should accurately count number of required nodes
for pending pods.
We will also update the unit tests in scheduler to handle new error conditions.
- k8s.io/autoscaler/cluster-autoscaler/core: 06/10/2025 - 77.3%
- k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/csi.go: 14/10/2025 - 78%
Integration tests
None
e2e tests
Cluster AutoScaler
We are planning to add e2e tests that verify behaviour of cluster autoscaler when it scales nodes for pods that require volumes.
We will add tests that validate both scaling from 0 and scaling from 1 use cases.
Kube Scheduler
We will add e2e tests in k/k repo for scheduler, so as scheduler behaviour is tested for following conditions:
- When
CSINodeis reported but driver is not installed. - When no
CSINodeis reported from the node at all.
Please note other conditions are already tested via - https://github.com/kubernetes/kubernetes/blob/9b9cd768a05782b6cfeef62bec7696b441d7ad93/test/e2e/storage/csimock/csi_volume_limit.go#L15
Graduation Criteria
Alpha
- All of the planned code changes for alpha will be done in cluster-autoscaler and kubernetes (scheduler in particular) repository.
- We plan to implement changes in cluster-autoscaler so that it can consider volume limits when scaling cluster.
- Make changes in
kube-schedulerso that it can stop scheduling of pods that require CSI volume if underlying CSI volume is not installed on the node, withCSIDriveropt-in. - Initial e2e tests completed and enabled.
- All of the changes in CAS and kube-scheduler will be behind
VolumeLimitScalingfeaturegate.
Upgrade / Downgrade Strategy
In general Upgrade and Downgrade of cluster-autoscaler should be fine, it just means how CA scales nodes will
change.
If customers have opted-in to prevent pod placement via aforementioned CSIDriver change, it is not recommended to disable enable-csi-node-aware-scheduling flag.
Version Skew Strategy
This feature has no interaction with kubelet and other components running on the node.
The interaction between CAS (or other autoscalers such as Karpenter) and kube-scheduler is resolved by requiring explicit opt-in
via CSIDriver to prevent pod placement.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name:
VolumeLimitScaling(inkube-schedulerandkube-apiserver) enable-csi-node-aware-scheduling"flag inCAS.
- Feature gate name:
- Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane? Yes, it should require restart of CAS and kube-scheduler.
Does enabling the feature change any default behavior?
No, the scheduler and autoscaler defaul behavior is the same. A CSI driver must opt-in via its CSIDriver instance to get the new behavior.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
For CAS:
- Yes. This will simply cause old behaviour to be restored, but
PreventPodPlacementWithoutDrivershould be disabled (if enabled manually) inCSIDriverobject before disabling this feature in CAS.
For kube-scheduler:
- The feature gate in kube-scheduler can be disabled without problems.
What happens if we reenable the feature if it was previously rolled back?
For CAS:
- The feature will start working same as before.
For kube-scheduler:
- It should be work fine
Are there any tests for feature enablement/disablement?
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
A rollout of this feature in CAS would be considered failing if somehow CAS is not creating appropriate number of nodes to accommodate CSI volumes required by pods.
A rollout of this feature in kube-scheduler would be considered failing if kube-scheduler is still placing pods to nodes that doesn’t have CSI driver installed.
What specific metrics should inform a rollback?
In CAS if unschedulable_pods_count metric consistently reports a number of pods pending of scheduling, in general that would be
a good indication that something is broken in CAS. In general, this in itself doesn’t mean those pending pods use CSI volumes
but we are considering enhancing existing metrics with that information.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
How can someone using this feature know that it is working for their instance?
- Events
- Event Reason:
- API .status
- Condition name:
- Other field:
- Other (treat as last resort)
- Details:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- Other (treat as last resort)
- Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?
Dependencies
Does this feature depend on any specific services running in the cluster?
Depends on cluster-autoscaler running in the cluster.
Scalability
Will enabling / using this feature result in any new API calls?
After the changes in this PR are merged, CAS now may have to read CSINode objects
before scaling decisions, but CAS was already reading CSINode objects via
scheduler plugins it vendors, because those plugins need CSINode listers.
Overall - this should not result in any new API calls.
Will enabling / using this feature result in introducing new API types?
In the v1.35 alpha release, we are not considering introducing new API types yet.
In v1.36 we are making changes into CSIDriver object by adding the field PreventPodPlacementWithoutDriver.
Will enabling / using this feature result in any new calls to the cloud provider?
In general I think, it should result in not any new calls to the cloud provider. If anything, once this feature is enabled in both CAS and kube-scheduler, it should prevent scheduling of pods to the nodes which can’t reasonably accommodate them. And hence it should result in reduction of API calls we make to the cloudprovider.
Will enabling / using this feature result in increasing size or count of the existing API objects?
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
What are other known failure modes?
What steps should be taken if SLOs are not being met to determine the problem?
Implementation History
Drawbacks
Alternatives
Certain Kubernetes vendors taint the node when a new node is created and CSI driver has logic to remove the taint when CSI driver starts on the node.