KEP-4049: Storage Capacity Scoring of Nodes for Dynamic Provisioning

Implementation History
ALPHA Implementable
Created 2023-05-30
Latest v1.33
Milestones
Alpha v1.33
Beta TBD
Stable TBD
Ownership
Owning SIG
SIG Storage
Participating SIGs
Primary Authors

KEP-4049: Storage Capacity Scoring of Nodes for Dynamic Provisioning

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
    • e2e Tests for all Beta API Operations (endpoints)
    • (R) Ensure GA e2e tests for meet requirements for Conformance Tests
    • (R) Minimum Two Week Window for GA e2e tests to prove flake free
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • “Implementation History” section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This KEP proposes adding a way to score nodes for dynamic provisioning of PVs. This scoring method is based on storage capacity in the VolumeBinding plugin. By considering the amount of free space that nodes have, it is possible to dynamically schedule pods on the node that has the most or least free space.

Motivation

Storage capacity needs to be considered when:

  • we want to resize after a node-local PV is scheduled. In this case we need to select a node with as much free space as possible.
  • we want to select a node with less free node space to reduce the number of nodes as much as possible.

Goals

  • To modify the scoring logic to count on dynamic provisioning in addition to the current, considering only static provisioning.

Non-Goals

  • To change how to score nodes for static provisioning.

Proposal

  • Node scores based on available space can be taken into account when performing dynamic provisioning.

Cluster admin can configure the scoring logic using a new field in VolumeBindingArgs of kubescheduler.config.k8s.io. The scoring logic is global for the whole cluster and we propose two values:

  • Prefer a node with the least allocatable.
  • Prefer a node with the maximum allocatable.

Considering the common scenario of local storage, we want to leave room for volume expansion after node allocation. The default setting is to prefer a node with the maximum allocatable.

User Stories (Optional)

Story 1

We want to leave room for volume expansion after node allocation. In this case, we want to allocate the node that has the maximum amount of free space.

Story 2

We want to reduce the number of nodes as much as possible to reduce costs when using a cloud environment. In this case, we want to allocate the node that has the smallest amount of sufficiently free space left.

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

RiskImpactMitigation
Misconfiguration of storage capacity scoring parametersMediumProvide documentation
Potential performance overhead due to additional scoring calculationsLowOptimize scoring algorithms
Loss of optimized scheduling after downgrading to a version without this featureMediumExplain the impact of downgrading in documentation

Design Details

We modify the existing VolumeBinding plugin to achieve scoring of nodes for dynamic provisioning.

Modify stateData to be able to store StorageCapacity

We modify the struct called PodVolumes contained in stateData to score nodes for dynamic provisioning.

The struct of stateData is as follows:

type stateData struct {
	...
	// podVolumesByNode holds the pod's volume information found in the Filter
	// phase for each node
	// it's initialized in the PreFilter phase
	podVolumesByNode map[string]*PodVolumes
	...
}

By making the following changes to PodVolumes, CSIStorageCapacity can be stored.

+ type DynamicProvision struct {
+ 	PVC      *v1.PersistentVolumeClaim
+ 	Capacity *storagev1.CSIStorageCapacity
+ }

type PodVolumes struct {
	StaticBindings []*BindingInfo
-   DynamicProvisions []*v1.PersistentVolumeClaim
+ 	DynamicProvisions []*DynamicProvision
}

Get the capacity of nodes for dynamic provisioning

Add CSIStorageCapacity to the return value of the volumeBinder.hasEnoughCapacity method. This returns the DynamicProvision.Capacity field in the case of dynamic provisioning.

- func (b *volumeBinder) hasEnoughCapacity(provisioner string, claim *v1.PersistentVolumeClaim, storageClass *storagev1.StorageClass, node *v1.Node) (bool, error) {
+ func (b *volumeBinder) hasEnoughCapacity(provisioner string, claim *v1.PersistentVolumeClaim, storageClass *storagev1.StorageClass, node *v1.Node) (bool, *storagev1.CSIStorageCapacity, error) {
	quantity, ok := claim.Spec.Resources.Requests[v1.ResourceStorage]
	if !ok {
		// No capacity to check for.
- 		return true, nil
+ 		return true, nil, nil
	}

	// Only enabled for CSI drivers which opt into it.
	driver, err := b.csiDriverLister.Get(provisioner)
	if err != nil {
		if apierrors.IsNotFound(err) {
			// Either the provisioner is not a CSI driver or the driver does not
			// opt into storage capacity scheduling. Either way, skip
			// capacity checking.
- 			return true, nil
+ 			return true, nil, nil
		}
- 		return false, err
+ 		return false, nil, err
	}
	if driver.Spec.StorageCapacity == nil || !*driver.Spec.StorageCapacity {
- 		return true, nil
+ 		return true, nil, nil
	}

	// Look for a matching CSIStorageCapacity object(s).
	// TODO (for beta): benchmark this and potentially introduce some kind of lookup structure (https://github.com/kubernetes/enhancements/issues/1698#issuecomment-654356718).
	capacities, err := b.csiStorageCapacityLister.List(labels.Everything())
	if err != nil {
- 		return false, err
+ 		return false, nil, err
	}

  sizeInBytes := quantity.Value()
	for _, capacity := range capacities {
		if capacity.StorageClassName == storageClass.Name &&
			capacitySufficient(capacity, sizeInBytes) &&
			b.nodeHasAccess(node, capacity) {
			// Enough capacity found.
- 			return true, nil
+ 			return true, capacity, nil
		}
	}

	// TODO (?): this doesn't give any information about which pools where considered and why
	// they had to be rejected. Log that above? But that might be a lot of log output...
	klog.V(4).InfoS("Node has no accessible CSIStorageCapacity with enough capacity for PVC",
		"node", klog.KObj(node), "PVC", klog.KObj(claim), "size", sizeInBytes, "storageClass", klog.KObj(storageClass))
- 	return false, nil
+ 	return false, nil, nil
}

Scoring of nodes for dynamic provisioning

The Score method in the current VolumeBinding plug-in scores nodes considering only static provisioning. The scoring applies to every entry in podVolumes.StaticBindings.

In this KEP, add the scoring of nodes for dynamic provisioning in the Score method of the VolumeBinding plugin. The scoring applies to every entry in podVolumes.DynamicProvisions where Capacity is not equal to nil.

Scoring for dynamic provisioning is executed if there are no StaticBindings. In other words, if there is only static provisioning or both static and dynamic provisioning, the scoring will be done as usual for static provisioning. Then, if there is only dynamic provisioning, the following will be set to classResources and passed to the scorer function:

  • Requested: provision.PVC.Spec.Resources.Requests[v1.ResourceName(v1.ResourceStorage)]
  • Capacity: CSIStorageCapacity

By doing this, we can calculate scores to nodes for dynamic provisioning in a way that is based on the Shape setting of VolumeBindingArgs, and which takes into account the amount of free space the nodes have.

// Score invoked at the score extension point.
func (pl *VolumeBinding) Score(ctx context.Context, cs *framework.CycleState, pod *v1.Pod, nodeName string) (int64, *framework.Status) {
	if pl.scorer == nil {
		return 0, nil
	}
	state, err := getStateData(cs)
	if err != nil {
		return 0, framework.AsStatus(err)
	}
	podVolumes, ok := state.podVolumesByNode[nodeName]
        if !ok {
		return 0, nil
	}
-       // group by storage class
+
        classResources := make(classResourceMap)
-       for _, staticBinding := range podVolumes.StaticBindings {
-               class := staticBinding.StorageClassName()
-               storageResource := staticBinding.StorageResource()
-               if _, ok := classResources[class]; !ok {
-                       classResources[class] = &StorageResource{
-                               Requested: 0,
-                               Capacity:  0,
+       if len(podVolumes.StaticBindings) != 0 {
+               // group static biding volumes by storage class
+               for _, staticBinding := range podVolumes.StaticBindings {
+                       class := staticBinding.StorageClassName()
+                       storageResource := staticBinding.StorageResource()
+                       if _, ok := classResources[class]; !ok {
+                               classResources[class] = &StorageResource{
+                                       Requested: 0,
+                                       Capacity:  0,
+                               }
+                       }
+                       classResources[class].Requested += storageResource.Requested
+                       classResources[class].Capacity += storageResource.Capacity
+               }
+       } else {
+               // group dynamic biding volumes by storage class
+               for _, provision := range podVolumes.DynamicProvisions {
+                       if provision.Capacity == nil {
+                               continue
+                       }
+                       class := *provision.PVC.Spec.StorageClassName
+                       if _, ok := classResources[class]; !ok {
+                               classResources[class] = &StorageResource{
+                                       Requested: 0,
+                                       Capacity:  0,
+                               }
                        }
+                       requestedQty := provision.PVC.Spec.Resources.Requests[v1.ResourceName(v1.ResourceStorage)]
+                       classResources[class].Requested += requestedQty.Value()
+                       classResources[class].Capacity += provision.Capacity.Capacity.Value()
                }
-               classResources[class].Requested += storageResource.Requested
-               classResources[class].Capacity += storageResource.Capacity
        }
+
        return pl.scorer(classResources), nil
}

Users can select the scoring logic from the following options in VolumeBindingArgs. The scoring logic is the same among all Pod + PVC(s).

  • (a) Prefer a node with the least allocatable.
  • (b) Prefer a node with the maximum allocatable.

Considering the common scenario of local storage, we want to leave room for volume expansion after node allocation. The default setting is to prefer a node with the maximum allocatable.

Conditions for scoring static or dynamic provisioning

About the Score function, the score will be calculated with the existing way (only static provisioning is taken into account) if at least one PVC was statically provisioned. Otherwise, the score will be calculated from dynamic provisioning.

Implementation idea:

func (pl *VolumeBinding) Score(ctx context.Context, cs *framework.CycleState, pod *v1.Pod, nodeName string) (int64, *framework.Status) {
	...

+ 	if len(static) != 0 {
+ 		return static_score, nil;	// Same value as the current method
+ 	} else {
+ 		return dynamic_score, nil;	// Propose in this KEP
+ 	}
- 	return pl.scorer(classResources), nil
}

Feature Gate Consolidation

The StorageCapacityScoring feature gate will now control the functionality previously managed by the VolumeCapacityPriority feature gate, which will be deprecated. This consolidation focuses on enabling node scoring based on storage capacity, limited to the behaviors necessary for StorageCapacityScoring. Specifically, the utilization shape points have been supported because they are required for StorageCapacityScoring. However, the weight of storage class has not been implemented (ref1 , ref2 ), and there are no plans to require it for StorageCapacityScoring, so it will not be implemented. For more details on the original proposal, see KEP-1845 .

Test Plan

[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Nothing in particular.

Unit tests

The following unit tests are planned:

  • Are the scores assigned to nodes for dynamic provisioning appropriate for the amount of free space?
  • Are the amount of free space score of nodes for dynamic provisioning and the Static Bindings score both functional?
Integration tests

The scoring function will be tested in test/integration/volumescheduling/storage_capacity_scoring_test.go.

e2e tests

The following e2e tests are planned:

  • When only static provisioning is available, or a mixture of static provisioning and dynamic provisioning is available:
    • Does it pass traditional tests?
  • When only dynamic provisioning is available:
    • Is the Pod placed on the node with the largest available space by default?
    • When VolumeBindingArgs is set to “Prefer a node with the maximum allocatable”, is the Pod placed on the node with the largest available space?
    • When VolumeBindingArgs is set to “Prefer a node with the least allocatable”, is the Pod placed on the node that meets the requested size but has the smallest available space?
    • Does the Pod placement fail if no node meets the requested size?
    • Even when the Pod is recreated, is the placement in the node performed as expected above?

Graduation Criteria

Alpha

  • Add StorageCapacityScoring feature gate
  • E2e tests completed

Beta

  • One release with positive feedback from users

GA

  • No users complaining about the new behavior

Upgrade / Downgrade Strategy

  1. Upgrading the cluster to support storage capacity scoring for dynamic provisioning:

    • After the upgrade, the scheduler will be able to score nodes based on their storage capacity for dynamic provisioning. This will involve additional checks and calculations to ensure that nodes with sufficient capacity are prioritized.
    • Existing configurations and API usage will remain compatible, but administrators may need to review and adjust their storage class configurations to fully leverage the new scoring mechanism.
  2. Downgrading the cluster to a version without storage capacity scoring for dynamic provisioning:

    • If the cluster is downgraded, the scheduler will revert to the previous behavior where storage capacity scoring for dynamic provisioning is not considered.
    • Any Pods created after the upgrade will still exist, but their scheduling will no longer take storage capacity into account, potentially leading to less optimal placement.
    • No additional changes to invocations or configurations are required, but administrators should be aware that the enhanced scheduling capabilities will be lost.

Version Skew Strategy

Nothing in particular.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?
  • Feature gate (also fill in values in kep.yaml)
    • Feature gate name: StorageCapacityScoring
    • Components depending on the feature gate: kube-scheduler
Does enabling the feature change any default behavior?

The scheduling behavior is changed if this function is enabled.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, this feature can be disabled after it has been enabled by setting the feature gate to false again. In doing so, the scoring for VolumeBinding will revert to the current method. This change won’t affect the behavior of existing Pods.

What happens if we reenable the feature if it was previously rolled back?

Re-enabling the feature from a rolled-back state will result in scheduling that considers dynamic provisioning. There will be no impact on existing running Pods.

Are there any tests for feature enablement/disablement?

Yes. We will add unit tests with and without the feature gate enabled.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

Turning the feature gate flag on/off only changes scheduling scoring. So there is no possibility of impacting workloads that are already running.

What specific metrics should inform a rollback?

A spike on metric schedule_attempts_total{result="error|unschedulable"} when this feature gate is enabled.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Not applicable, yet.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No, it isn’t.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

If enabled, this feature applies to all workloads which uses delay binding PVCs. Also non-zero value of metric plugin_execution_duration_seconds{plugin="VolumeBinding",extension_point="Score"} is a sign indicating this feature is in use. Unfortunately, there is no way to distinguish whether only static provisioning is being considered (the current behavior) or both static and dynamic provisioning are being considered (the new behavior).

How can someone using this feature know that it is working for their instance?

Pods that use only dynamically provisioned PVCs will be scheduled to nodes with more available capacity.

  • Events
    • Event Reason:
  • API .status
    • Condition name:
    • Other field:
  • Other (treat as last resort)
    • Details:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?

It may affect the time taken by scheduling. Clarify it during the beta phase.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Clarify this during the beta phase.

  • Metrics
    • Metric name: plugin_execution_duration_seconds{plugin="VolumeBinding",extension_point="Score"}
    • [Optional] Aggregation method:
    • Components exposing the metric:
  • Other (treat as last resort)
    • Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?

Nothing in particular.

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Yes, it may affect the time taken by scheduling.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No, this feature will not exhaust node resources such as PIDs, sockets, or inodes.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

The behavior in such cases does not change. This proposal only modifies one of the plugins in the kube-scheduler.

What are other known failure modes?

Not applicable, yet.

What steps should be taken if SLOs are not being met to determine the problem?

Check the kube-scheduler logs.

Implementation History

  • 2023-05-30 Initial KEP sent out for review

Drawbacks

  • The implementation of storage capacity scoring for dynamic provisioning may introduce complexity in the scheduling process. This could potentially lead to increased scheduling latency as the scheduler performs additional checks and calculations.

Alternatives

Weighting Static Provisioning Scores and Dynamic Provisioning Scores

The scoring function will return the sum of the static score and the dynamic score, each multiplied by their respective weights. The weights are determined by the ratio of static and dynamic capacities.

Implementation idea for the Score function:

func (pl *VolumeBinding) Score(...) (int64, *framework.Status) {
  ...
  return (static_weight) * static_score + (1-static_weight) * dynamic_score;
}

Ultimately, the current design was chosen. The reasons are as follows:

  • Conflict issue: In this approach, there is a possibility that the static provisioning and dynamic provisioning scores could cancel each other out, leading to inaccurate scoring.
  • Feasibility of implementation: The current design was deemed more feasible and clearer in terms of implementation.

Infrastructure Needed (Optional)