KEP-554: Volume Scheduling Limits
Volume Scheduling Limits
Table of Contents
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Implementation History
- Alternatives
Release Signoff Checklist
- kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR)
- KEP approvers have set the KEP status to
implementable - Design details are appropriately documented
- Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- Graduation criteria is in place
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Note: Any PRs to move a KEP to implementable or significant changes once it is marked implementable should be approved by each of the KEP approvers. If any of those approvers is no longer appropriate than changes to that list should be approved by the remaining approvers and/or the owning SIG (or SIG-arch for cross cutting KEPs).
Summary
Number of volumes of certain type that can be attached to a node should be configurable easily and should be based on node type. This proposal implements dynamic attachable volume limits on a per-node basis rather than cluster global defaults that exist today. This proposal also implements a way of configuring volume limits for CSI volumes.
This proposal replaces #730 and integrates volume limits for in-tree volumes (AWS EBS, GCE PD, AZURE DD, OpenStack Cinder) and CSI into one predicate. As result, in-tree volumes and corresponding CSI driver can share the same volume limit.
Motivation
Current scheduler predicates for scheduling of pods with volumes is based on node.status.capacity and node.status.allocatable. It works well for hardcoded predicates for volume limits on AWS (MaxEBSVolumeCount), GCE(MaxGCEPDVolumeCount), Azure (MaxAzureDiskVolumeCount) and OpenStack (MaxCinderVolumeCount).
It is problematic for CSI (MaxCSIVolumeCountPred) outlined in #730
ResourceNameis limited to 63 characters. We must prefixResourceNamewith unique string (such asattachable-volumes-csi-<driver name>) so it cannot collide with existing resources likecpuormemory. But<driver name>itself is up to 63 character long, so we ended up with using SHA-sums of driver name to keep theResourceNameunique, which is not user readable.- CSI driver cannot share its limits with in-tree volume plugin e.g. when running pods with AWS EBS in-tree volumes and
ebs.csi.aws.comCSI driver on the same node.
Goals
When CSI Driver is installed on the node, for in-tree drivers which are being considered for migration to CSI - same predicate will be used to handle Volume limit counting for in-tree as well as CSI Volumes. Similarly same limit will be used when user is using CSI or in-tree volumes on the node.
Existing predicates for in-tree volumes
MaxEBSVolumeCount,MaxGCEPDVolumeCount,MaxAzureDiskVolumeCountandMaxCinderVolumeCount(now deprecated) will be removed when in-tree to CSI migration is GA and enabled by default for that particular volume plugin.- When both deprecated in-tree predicate and CSI predicate are enabled, only
MaxCSIVolumeCountPreddoes useful work and the other is NOOP to save CPU. This requires CSI Driver to be installed on the node.
- When both deprecated in-tree predicate and CSI predicate are enabled, only
Scheduler does not put pods that require CSI volumes to nodes that don’t have the CSI driver installed.
Scheduler does not increase its CPU consumption. Any regression must be approved by sig-scheduling.
- Scheduler benchmark must be extended to schedule pods with volumes as part of this KEP.
Note: Although we are saying existing predicates will become NOOP in this section and elsewhere, existing predicates still have to look up CSINode
object and return early as applicable.
Non-Goals
Heterogenous clusters, i.e. clusters where access to storage is limited only to some nodes. Existing
PV.spec.nodeAffinityhandling, not modified by this KEP, will filter out nodes that don’t have access to the storage, so predicates changed in this KEP don’t need to worry about storage topology and can be simpler.Multiple plugins sharing the same volume limits. We expect that every CSI driver will have its own limits, not shared with other CSI drivers. In this KEP we support only in-tree volume plugins sharing its limits with one hard-coded CSI driver each.
Multiple “units” per single volume. Each volume used on a node takes exactly 1 unit from
allocatable.volumes, regardless of the volume size, its replica count, number of connections to remote servers or other underlying resources needed to use the volume. For example, multipath iSCSI volume with three paths (and thus three iSCSI connections to three different servers) still takes 1 unit fromCSINode.spec.drivers[xyz].allocatable.volumes.Maximum capacity per node. Some cloud environments limit both number of attached volumes (covered in this KEP) and total capacity of attached volumes (not covered in this KEP). For example, this KEP will ensure that scheduler puts max. 128 GCE PD volumes to a typical GCE node , but it won’t ensure that the total capacity of the volumes is less than 64 TB.
Volume limits does not yet integrate with cluster autoscaler if all nodes in the cluster are running at maximum volume limits.
Proposal
Track volume limits for CSI driver in CSINode object instead of Node and update scheduler to use CSINode object to determining volume limits and availability of CSI driver.
Limit in CSINode is used instead of limit coming from Node object whenever available for same in-tree volume type. This mean scheduler will always try to translate in-tree driver name to CSI driver name whenever CSINode object has same in-tree volume type (even if migration is off).
- To get rid of prefix + SHA for
ResourceNameof CSI volumes. - So in-tree volume plugin can share limits with CSI driver that uses the same storage backend.
API Change
CSINode is split into spec and status. spec contains list of drivers installed to the node and their properties that do not change during lifetime of a driver. status is missing right now, but it may be used later e.g. for driver health that changes in time.
We expect that limits of a CSI driver does not change during lifetime of a driver and therefore we put the resource limits into CSINodeSpec. The only way for a driver to change the limits is to deregister and register again, e.g. by restarting its container.
// Until further notice, this is existing API to introduce full context.
type CSINode struct {
...
// spec is the specification of CSINode
Spec CSINodeSpec `json:"spec" protobuf:"bytes,2,opt,name=spec"`}
}
// CSINodeSpec holds information about the specification of all CSI drivers installed on a node
type CSINodeSpec struct {
// drivers is a list of information of all CSI Drivers existing on a node.
// If all drivers in the list are uninstalled, this can become empty.
// +patchMergeKey=name
// +patchStrategy=merge
Drivers []CSINodeDriver `json:"drivers" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,1,rep,name=drivers"`
}
// CSINodeDriver holds information about the specification of one CSI driver installed on a node
type CSINodeDriver struct {
// ...
// NEW API STARTS HERE
// Allocatable represents the resources of a node that are available for scheduling for volumes of this driver.
Allocatable VolumeLimits
}
// VolumeLimits is a set of resource limits for scheduling of volumes.
type VolumeLimits struct {
// Count is maximum number of volumes provided by the CSI driver that can be used by the node
// "nil" represents no limits - the node can handle any number of volumes of the driver.
Count *int32 `json:"count,omitempty" protobuf:"varint,1,opt,name=count`
// Future proof: max. total size of volumes on the node can be added later
}
CSINode example:
apiVersion: storage.k8s.io/v1beta1
kind: CSINode
metadata:
name: ip-172-18-4-112.ec2.internal
spec:
drivers:
- name: ebs.csi.aws.com
# Already existing fields
nodeID: ip-172-18-4-112.ec2.internal
topologyKeys:
# ...
# New API:
allocatable:
# AWS node can attach max. 40 volumes, 1 is reserved for the system
count: 39
- name: org.kernel.nfs
allocatable:
# NFS does not impose any limits of volumes mounted to the node
count: # nil means "no limit"
Implementation
Section below describes behaviour of old predicates, CSI predicate and scheduler after the proposal has been implemented.
For brevity - “old predicates” refers to now deprecated cloudprovider specific predicates - MaxEBSVolumeCount, MaxGCEPDVolumeCount, MaxAzureDiskVolumeCount and MaxCinderVolumeCount.
Implementation Detail for all CSI Drivers
- Kubelet will create
CSINodeinstance during initial CSI Driver registration.- Limits of each CSI volume plugin will be added to
CSINode.spec.drivers[xyz].allocatable. - User may NOT change
CSINode.spec.drivers[xyz].allocatableto override volume plugin / CSI driver values, e.g. to “reserve” some attachment to the operating system. Kubelet will periodically reconcileCSINodeand overwrite the value.- Especially,
kubelet --kube-reservedor--system-reservedcannot be used to “reserve” volumes for kubelet or the OS. It is not possible with existing kubelet and this KEP does not change it. We expect that CSI drivers will have configuration options / cmdline arguments to reserve some volumes and they will report their limit already reduced by that reserved amount.
- Especially,
- Limits of each CSI volume plugin will be added to
- Scheduler will respect
Node.status.allocatableandNode.status.capacityfor CSI volumes ifCSINodeobject is not available or has missing entry inCSINode.spec.drivers[xyz].allocatableduring a deprecation period but kubelet will stop populatingNode.status.allocatableandNode.status.capacityfor CSI volumes.- After deprecation period for CSI volumes, limits coming from
Node.status.allocatableandNode.status.capacitywill be completely ignored by the scheduler.
- After deprecation period for CSI volumes, limits coming from
- Scheduler won’t schedule pods with volumes (in-tree or CSI) for which migration has been enabled and driver is not installed on the node yet.
- Volumes for which there is no in-tree to CSI migration plan will follow a deprecation cycle before.
- Important: this can only be implemented once volume limits are integrated with the cluster autoscaler.
Implementation detail for in-tree Drivers with CSI migration disabled
When no CSI driver for same underlying storage type is installed on the node.
- For Azure, GCEPD, AWS and Cinder - in-tree volume plugins will keep reporting their limits via
Nodeobject and old predicates will work as expected until CSI migration has been enabled (and GA) for given volume plugin.
When CSI driver for same underlying storage type is installed on the node.
- For Azure, GCEPD, AWS and Cinder - in-tree volume plugins will report their limits via
Nodeobject same as before.
Implementation detail for in-tree Drivers with CSI migration enabled
- For Azure, GCEPD, AWS and Cinder - in-tree volume plugins will report their limits via
Nodeobject same as before. - Old predicates will be modified to perform an additional check of
CSINode. If they detect that the CSI migration has been enabled for the volume, the old predicate will return early (with success) andMaxCSIVolumeCountPredwill be responsible for counting both CSI and in-tree volumes of same type.
User Stories
Implementation Details/Notes/Constraints
CSI migration library
is used to find CSI driver name for in-tree volume plugins + its VolumeHandle. This CSI driver name is used as key in CSINode.CSINode.spec.drivers[xyz].allocatable list. The VolumeHandle is unique for each volume and will be used to de-duplicate volumes used by multiple pods on the same node.
Risks and Mitigations
This KEP depends on CSI migration library . It can happen that CSI migration is redesigned / cancelled.
- Countermeasure: CSI migration and this KEP should graduate together.
This KEP depends on CSI migration library ability to handle in-line in-tree volumes. Scheduler will need to get CSI driver name +
VolumeHandlefrom them to count them towards the limit.
Design Details
Existing feature gate AttachVolumeLimit will be re-used for implementation of this KEP. The feature is already beta and is enabled by default.
Test Plan
Scheduler benchmark must be extended to run pods with volumes as part of this KEP. Following matrix will be tested:
- Predicates:
- All volume predicates enabled.
- Only deprecated
MaxEBSVolumeCount,MaxGCEPDVolumeCount,MaxAzureDiskVolumeCountandMaxCinderVolumeCountpredicates enabled. - Only
MaxCSIVolumeCountPredpredicate enabled.
- API objects:
- Both CSINode and Node containing
spec/status.allocatablefor a volume plugin (to simulate kubelet during deprecation period). - Only CSINode containing
spec.drivers[xyz].allocatablefor a volume plugin (to simulate kubelet after deprecation period). - Only Node containing
status.allocatablefor a volume plugin (to simulate old kubelet).
- Both CSINode and Node containing
- Test results should be ideally the same as before the KEP.
- Any deviation needs to be approved by sig-scheduling.
- Predicates:
Run e2e tests and kubelet version skew tests to check that scheduler picks the right values from CSINode or Node.
Add e2e test that runs pods with both in-tree volumes and CSI driver for the same storage backend and check that they share the same volume limits.
Graduation Criteria
Alpha -> Beta Graduation
N/A (AttachVolumeLimit feature is already beta).
Beta -> GA Graduation
It must graduate together with CSI migration. We can enable caching of in-use volumes on a node to improve performance before going GA.
Upgrade / Downgrade / Version Skew Strategy
During upgrade, downgrade or version skew, kubelet may be older that scheduler. Kubelet will not fill CSINode.spec with volume limits and it will fill volume limits into Node.status. Scheduler must fall back to Node.status when CSINode is not available or its spec does not contain a volume plugin / CSI driver.
Interaction with old AttachVolumeLimit implementation
Due to version skew, following situations are possible (scheduler is always with AttachVolumeLimit enabled and with this KEP implemented):
Kubelet has
AttachVolumeLimitoff:- Scheduler does not see any volume limits in
CSINodenorNode. - In-tree volumes: since
CSINodeis missing, scheduler falls back toMaxEBSVolumeCount,MaxGCEPDVolumeCount,MaxAzureDiskVolumeCountandMaxCinderVolumeCountpredicates and schedules in-tree volumes the old way with hardcoded limits. - CSI: from scheduler point of view, the node can handle any number of CSI volumes.
- Scheduler does not see any volume limits in
Kubelet has old implementation of
AttachVolumeLimitand the feature is on (kubelet fillsNode.status.available):- Scheduler does not see any volume limits in
CSINode. - In-tree: Since
CSINodeis missing, scheduler falls back toMaxEBSVolumeCount,MaxGCEPDVolumeCount,MaxAzureDiskVolumeCountandMaxCinderVolumeCountpredicates and schedules in-tree volumes the old way. - CSI: Scheduler falls back to told implementation of
MaxCSIVolumeCountPredfor CSI volumes and uses limits fromNode.status.
- Scheduler does not see any volume limits in
Kubelet has new implementation of
AttachVolumeLimitand the feature is on (kubelet fillsCSINode):- No issue here, see this KEP.
- Since
CSINodeis available, scheduler uses new implementation ofMaxCSIVolumeCountPred.
As implied by the above, the scheduler needs to have both old and new implementation of MaxCSIVolumeCountPred and switch between them based on CSINode availability for a particular node until the old implementation is deprecated and removed (2 releases).
Implementation History
- K8s 1.11: Alpha
- K8s 1.12: Beta
- K8s 1.17: GA
Alternatives
In https://github.com/kubernetes/enhancements/pull/730
we tried to merge volume limits in Node.status.capacity and Node.status.attachable. We discovered these issues:
- We cannot use plain CSI driver name as resource name
Node.status.attachable, as it could collide with other resources (e.g. “memory”), so we added volume specific prefix. - Since CSI driver name can be up to 63 character long , the prefix + driver name it cannot fit 64 character resource name limit. We ended up hashing the driver name to save space.
By moving volume limit to CSINode we fix both issues.