KEP-5030: Integrate CSI Volume attach limits with cluster autoscaler

Implementation History
ALPHA Implementable
Created 2025-01-09
Latest v1.36
Milestones
Alpha v1.35
Beta v1.37
Stable v1.38
Ownership
Owning SIG
SIG Autoscaling
Primary Authors

KEP-5030: Integrate Volume Attach limit into cluster autoscaler

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
    • e2e Tests for all Beta API Operations (endpoints)
    • (R) Ensure GA e2e tests meet requirements for Conformance Tests
    • (R) Minimum Two Week Window for GA e2e tests to prove flake free
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • “Implementation History” section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Fix cluster-autoscaler (CAS) to be aware of node’s volume attach limits when scaling new nodes and prevent scheduler from placing pods on nodes that do not have a particular CSI driver installed.

Motivation

When scaling new nodes to satisfy pending pods in a cluster, currently cluster-autoscaler (CAS) does not take into account volume attach limits (available via CSI) an upcoming node may have, this could result in insufficient number of nodes created to satisfy pending pods. With this KEP, we will make changes into CAS so that when running simulations to estimate number of nodes necessary to satisfy pending pods or when running scheduler simulations on upcoming nodes, it takes into account CSI volume attach limits via templated CSINode objects.

There is also a gap in implementation of NodeVolumeLimits scheduler plugin which was left intentionally to take into account the fact that, CAS will run this plugin without any templated CSINode objects during creation of new nodes and hence plugin permits placement of unlimited number of pods to nodes even if no CSI driver is installed on them. With this KEP - we aim to close the gap in NodeVolumeLimits scheduler plugin, so that scheduler will not place pods on nodes which aren’t reporting any CSI driver information, if a CSI driver decides to do so.

To summarize:

  • Scheduler CSI plugin assumes that “no information about a CSI driver published in a CSINode” means “no limits for volumes from that driver”.
  • For existing Nodes with CSI driver information already published, CA correctly takes the volume limits into account when running scheduler filters in simulations (e.g. when packing pending Pods on existing Nodes in the cluster at the beginning of the loop).
  • For fake “upcoming” Nodes created in-memory by CA during scale-up simulations the corresponding “upcoming” CSINode is not created/taken into account. So the volume limits are not taken into account when running scheduler filters, which makes CA pack more Pods per Node than actually fit, which makes it undershoot scale-ups.
  • For existing Nodes with CSI driver information already published, scheduler correctly takes the volume limits into account when scheduling.
  • For new Nodes with not all CSI driver information published yet, scheduler can let Pods in that can’t actually run on the Node.

After:

  • By default, the scheduler CSI plugin still assumes that “no information about a CSI driver published in a CSINode” means “the node can handle unlimited amount of volumes”.
  • Only when explicitly opted in in CSIDriver instance, the scheduler CSI plugin assumes that “no information about a CSI driver published in a CSINode” means “the node cannot handle any volumes”.
  • No change for existing Nodes with CSI driver information already published - CA and scheduler still behave correctly.
  • Scheduler waits until all relevant CSI driver info is published before scheduling a Pod, removing the race condition for new Nodes.
  • Cluster Autoscaler correctly simulates “upcoming” CSINodes for “upcoming” Nodes and makes correct scale-up decisions.

Goals

  • Modify cluster-autoscaler so that it is aware of CSI volume limits.
  • Fix scheduler, so that it doesn’t schedule pods that require given CSI volume to a node that doesn’t have CSI driver installed.

Non-Goals

  • Deschedule pods that can’t fit on a node because of race conditions.
  • Fixing other autoscalers like Karpenter is out of scope for current proposal.

Proposal

As part of this proposal we are proposing changes into both cluster-autoscaler and kubernetes’s built-in scheduler.

  1. Fix cluster-autoscaler so that it takes into account attach limits when scaling nodes from 0 in a nodegroup.
  2. Fix cluster-autoscaler so that it takes into account attach limits when scaling nodegroups with existing nodes.
  3. Fix kubernetes built-in scheduler so that we do not schedule pods to nodes that doesn’t have CSI driver installed with admin opt-in via CSIDriver object.

Just to reiterate we are not going to change default scheduling policy of pods that use CSI volumes. Using the new change in scheduler, which actually prevents pod placement to nodes without CSI driver will require explicit opt-in by Cluster admins.

The reason, we decided to make the change an explicit opt-in is because:

  1. It completely decouples CAS and scheduler changes. When CAS imports the scheduler, we preserve the default behaviour and only if cluster-admin or Kubernetes distributor is sure that it is safe to do, then can enable this behaviour. See Implementation section for when it is safe to enable new behaviour in a cluster.
  2. For autoscalers such as Karpenter etc, which may still not have CSI node awareness builtin, this allows cluster-admin or Kubernetes distro to make the decision of whether to block pod scheduling to nodes without driver or not.
  3. This allows us to release scheduler changes sooner and completely decoupled from various autoscalers, because the new feature requires explicit opt-in by the cluster-admin.

User Stories (Optional)

Story 1

  • User has more than one pod that is pending because no existing node has any attach limit left.
  • Cluster autoscaler evaluates existing nodegroups.
  • It picks a nodegroup based on existing critireas and it accurately determines number of nodes it needs to spin up based on volumes that pending pods require.

Story 2

  • A Kubernetes admin has one or more node where CSI driver is not installed.
  • Without explicitly tainting the node or using node affinity in workloads, nodes which don’t have CSI driver installed aren’t used for scheduling pods that require volume.

Notes/Constraints/Caveats (Optional)

  1. To fully utilize CSI node limit awareness in cluster-autoscaler, the cloudprovider interface MUST implement TemplateNodeInfo interface that also returns CSINode object with templated nodeinfo - https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_nodegroup.go#L383
  2. To prevent pod placement on nodes without CSI driver, the CSIDriver object must have an explicit opt-in.

Risks and Mitigations

Design Details

Cluster Autoscaler changes

We can split the implementation in cluster-autoscaler in two parts:

  • Scaling a node-group that already has one or more nodes.
  • Scaling a node-group that doesn’t have one or more nodes (Scaling from zero).

Scaling a node-group that already has one or more nodes.

  1. To ensure that nodes which were recently started but do not have CSI driver installed yet are considered as upcoming nodes and hence are properly handled via scaleup operation, we propose a mechanism similar to recently introduced mechanism for DRA resources. See section - “Handling Node Readiness” for more details.

  2. We propose that, we add volume limits and installed CSI driver information to framework.NodeInfo objects. So -

type NodeInfo struct {
....
....
// CSINodes contains all CSINodes exposed by this Node.
CSINode *storagev1.CSINode
..
}
  1. We propose that, when saving ClusterState , we capture and add CSINode information in cluster snapshot. The updated signature of SetClusterState function would look like:
SetClusterState(nodes []*apiv1.Node, scheduledPods []*apiv1.Pod, draSnapshot *drasnapshot.Snapshot, csiSnapshot *csisnapshot.Snapshot) error

Both delta and basic snapshot implementation would store csiSnapshot along with dra and other information.

  1. Since scaling of a nodegroup requires creation of sanitized templateNodeInfo from existing nodeInfo objects, we need to ensure that we are creating sanitized CSINode objects from real CSINode objects associated with existing nodeInfo object in nodegroup. We need to make associated changes into node_info_utils.go to take that into account:
templateNodeInfo := framework.NewNodeInfo(sanitizedExample.Node(), sanitizedExample.LocalResourceSlices, expectedPods...)
if example.CSINode != nil {
  templateNodeInfo.AddCSINode(createSanitizedCSINode(example.CSINode, templateNodeInfo))
}
  1. We propose that, when getting nodeInfosForGroups , the return nodeInfo map also contains csiNode information, which can be used later on for scheduling decisions.
nodeInfosForGroups, autoscalerError := a.processors.TemplateNodeInfoProvider.Process(autoscalingContext, readyNodes, daemonsets, a.taintConfig, currentTime)

This should generally work out of box when nodeInfo is extracted from previously stored cluster snapshot via:

// will return wrapped framework.NodeInfo with both DRA and CSINode information
ctx.ClusterSnapshot.GetNodeInfo(node.Name)

Please note that, we will have to handle the case of scaling from 0, separately from scaling from 1, because in former case - no CSI volume limit information will be available If no node exists in a NodeGroup.

  1. We further propose creation or extension of existing StorageInfos interface, so that both scheduler and CAS can work with the previously created fake CSINode objects. Without this change, both the hinting_simulator and estimator, which triggers scheduler plugin runs will not be able to find the templated CSINode object we created in previous step.

Making aforementioned changes should allow us to handle scaling of nodes from 1.

Scaling from zero

Scaling from zero should work similar to scaling from 1, but the main problem is - we do not have NodeInfo which can tell us what would be the CSI attach limit on the node which is being spun up in a NodeGroup.

We propose to enhance TemplateNodeInfo function to report CSI volume limits via mechanism that was implemented for DRA. As such we aren’t proposing a brand new mechanism for reporting CSI volume limits but rather we are using existing mechanism available from cloudprovide’s implementation of NodeInfosForGroups.

A future enhancement could incorporate https://github.com/kubernetes/autoscaler/issues/7799 when it becomes available.

Kubernetes Scheduler change

We also propose that the new scheduler behavior is opt-in via a new field in CSIDriver. If given node is not reporting any installed CSI drivers and CSIDriver has explicitly opted in, we do not schedule pods that need CSI volumes to that node.

type CSIDriverSpec struct {
    ....
    ....
    // if set to true, it will cause scheduler to prevent pod placement
    // to nodes where no CSI driver is installed.
    //   Defaults: false
    PreventPodPlacementWithoutDriver *bool
}

The proposed change is small and a draft PR is available here - https://github.com/kubernetes/kubernetes/pull/130702 This will stop too many pods crowding a node, when a new node is spun up and node is not yet reporting volume limits.

Along with this, we will also enhance error reporting from scheduler when scheduling of a pod fails in NodeVolumeLimits plugin, due to CSINode related errors:

  1. When driver is missing on the node, we will return CSIDriverMissingOnNode error.
  2. When CSINode object itself is missing on the node, we will return CSINodeMissing.

We also need to ensure that StorageInfos interface that is shared between CAS and scheduler is extended for CSINode objects, so that CAS can run scheduler plugins with templated CSINode objects.

Handling Node Readiness

We propose to handle node readiness in a similar way to how it was handled for DRA in - https://github.com/kubernetes/autoscaler/pull/8109 . The basic idea is, we compare using TemplateNodeInfo, what would be the expected CSI drivers available on the node and if node doesn’t yet have those drivers installed, we consider node as not-ready.

Currently handling of TemplateNodeInfo has an issue that reduces its usefulness when cloudprovider has not implemented changes necessary for DRA or CSI, even when nodegroup already has one or more nodes available in it, because current implementation always defers to templated NodeInfo returned by the cloudprovider. While not blocking for this KEP, we will try and address this issue when implementing the necessary changes for CSI.

Alternatives:

1.We propose a similar label as GPULabel added to the node that is supposed to come up with a CSI driver. This would ensure that, nodes which are supposed to have a certain CSI driver installed aren’t considered ready - https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L979 until CSI driver is installed there.

However, we also propose that a node will be considered ready as soon as corresponding CSI driver is being reported as installed via corresponding CSINode object.

A node which is ready but does not have CSI driver installed within certain time limit will be considered as NotReady and removed from the cluster.

  1. A more exhaustive solution to node readiness is being proposed in - https://github.com/kubernetes/enhancements/pull/5416 , we are open to the idea of using it when it becomes usable from CAS.

When it is safe to Prevent pod placement?

Generally speaking it is safe to prevent pod placement to nodes without CSI driver in scheduler when running an autoscaler that has support for CSI attach limit awareness. Cluster Autoscaler currently supports this with the enable-csi-node-aware-scheduling feature flag (starting with version 1.35). Other autoscalers in the ecosystem may support this in the future.

Obviously it is also safe to prevent pod placement if your cluster doesn’t have any autoscalers.

What happens if cluster-admin opts-in to prevent pod scheduling but autoscaler does not have CSI attach limit awareness?

If autoscaler has updated NodeVolumeLimits plugin from the scheduler but has otherwise has enable-csi-node-aware-scheduling flag disabled in CAS (or has no CSINode awareness), then CAS will not be able to schedule any pods that use CSI volume during its simulations on new nodes. The kube-scheduler will keep rejecting simulated node because, it will not have any CSINode information. This will be bad and autoscaling will be more or less broken for pods that require CSI volumes.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates
Unit tests

After this proposal is implemented, simulated scheduling in CAS should work with fake CSINode objects which report real volume limits and hence scheduling should accurately count number of required nodes for pending pods.

We will also update the unit tests in scheduler to handle new error conditions.

  • k8s.io/autoscaler/cluster-autoscaler/core: 06/10/2025 - 77.3%
  • k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/csi.go: 14/10/2025 - 78%
Integration tests

None

e2e tests
Cluster AutoScaler

We are planning to add e2e tests that verify behaviour of cluster autoscaler when it scales nodes for pods that require volumes.

We will add tests that validate both scaling from 0 and scaling from 1 use cases.

Kube Scheduler

We will add e2e tests in k/k repo for scheduler, so as scheduler behaviour is tested for following conditions:

  1. When CSINode is reported but driver is not installed.
  2. When no CSINode is reported from the node at all.

Please note other conditions are already tested via - https://github.com/kubernetes/kubernetes/blob/9b9cd768a05782b6cfeef62bec7696b441d7ad93/test/e2e/storage/csimock/csi_volume_limit.go#L15

Graduation Criteria

Alpha

  • All of the planned code changes for alpha will be done in cluster-autoscaler and kubernetes (scheduler in particular) repository.
  • We plan to implement changes in cluster-autoscaler so that it can consider volume limits when scaling cluster.
  • Make changes in kube-scheduler so that it can stop scheduling of pods that require CSI volume if underlying CSI volume is not installed on the node, with CSIDriver opt-in.
  • Initial e2e tests completed and enabled.
  • All of the changes in CAS and kube-scheduler will be behind VolumeLimitScaling featuregate.

Upgrade / Downgrade Strategy

In general Upgrade and Downgrade of cluster-autoscaler should be fine, it just means how CA scales nodes will change.

If customers have opted-in to prevent pod placement via aforementioned CSIDriver change, it is not recommended to disable enable-csi-node-aware-scheduling flag.

Version Skew Strategy

This feature has no interaction with kubelet and other components running on the node.

The interaction between CAS (or other autoscalers such as Karpenter) and kube-scheduler is resolved by requiring explicit opt-in via CSIDriver to prevent pod placement.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?
  • Feature gate (also fill in values in kep.yaml)
    • Feature gate name: VolumeLimitScaling (in kube-scheduler and kube-apiserver)
    • enable-csi-node-aware-scheduling" flag in CAS.
  • Other
    • Describe the mechanism:
    • Will enabling / disabling the feature require downtime of the control plane? Yes, it should require restart of CAS and kube-scheduler.
Does enabling the feature change any default behavior?

No, the scheduler and autoscaler defaul behavior is the same. A CSI driver must opt-in via its CSIDriver instance to get the new behavior.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

For CAS:

  • Yes. This will simply cause old behaviour to be restored, but PreventPodPlacementWithoutDriver should be disabled (if enabled manually) in CSIDriver object before disabling this feature in CAS.

For kube-scheduler:

  • The feature gate in kube-scheduler can be disabled without problems.
What happens if we reenable the feature if it was previously rolled back?

For CAS:

  • The feature will start working same as before.

For kube-scheduler:

  • It should be work fine
Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

A rollout of this feature in CAS would be considered failing if somehow CAS is not creating appropriate number of nodes to accommodate CSI volumes required by pods.

A rollout of this feature in kube-scheduler would be considered failing if kube-scheduler is still placing pods to nodes that doesn’t have CSI driver installed.

What specific metrics should inform a rollback?

In CAS if unschedulable_pods_count metric consistently reports a number of pods pending of scheduling, in general that would be a good indication that something is broken in CAS. In general, this in itself doesn’t mean those pending pods use CSI volumes but we are considering enhancing existing metrics with that information.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?
How can someone using this feature know that it is working for their instance?
  • Events
    • Event Reason:
  • API .status
    • Condition name:
    • Other field:
  • Other (treat as last resort)
    • Details:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
  • Metrics
    • Metric name:
    • [Optional] Aggregation method:
    • Components exposing the metric:
  • Other (treat as last resort)
    • Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Depends on cluster-autoscaler running in the cluster.

Scalability

Will enabling / using this feature result in any new API calls?

After the changes in this PR are merged, CAS now may have to read CSINode objects before scaling decisions, but CAS was already reading CSINode objects via scheduler plugins it vendors, because those plugins need CSINode listers.

Overall - this should not result in any new API calls.

Will enabling / using this feature result in introducing new API types?

In the v1.35 alpha release, we are not considering introducing new API types yet.

In v1.36 we are making changes into CSIDriver object by adding the field PreventPodPlacementWithoutDriver.

Will enabling / using this feature result in any new calls to the cloud provider?

In general I think, it should result in not any new calls to the cloud provider. If anything, once this feature is enabled in both CAS and kube-scheduler, it should prevent scheduling of pods to the nodes which can’t reasonably accommodate them. And hence it should result in reduction of API calls we make to the cloudprovider.

Will enabling / using this feature result in increasing size or count of the existing API objects?
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?
What are other known failure modes?
What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

Certain Kubernetes vendors taint the node when a new node is created and CSI driver has logic to remove the taint when CSI driver starts on the node.

Infrastructure Needed (Optional)