KEP-5075: DRA Consumable Capacity
KEP-5075: DRA Consumable Capacity
- Release Signoff Checklist
- Summary
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Identifying multi-allocatable Property of Device
- Selecting/Deselecting multi-allocatable Devices
- Preventing Same multi-allocatable Device from Being Allocated Multiple Times in the Same Claim
- Identifying shared device in the device status
- Defining a valid range for RequestPolicy
- Alternative words
- Future Possibilities
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
Without this KEP, device sharing is done by having multiple pods (and/or containers) reference the same resource claim, and that resource claim has allocated the device. This can be considered exclusive or dedicated allocation. With this KEP, independent resource claims (and/or requests within a claim) can allocate shares of resources provided by the same underlying device. This enables resource sharing across pods that are completely unrelated, potentially even across different namespaces.
When a device is shared across multiple resource claims, this enhancement enables device resource allocation to be drawn from the device’s overall capacity.
It ensures that the total resources consumed by all claim requests remain within the device’s capacity and comply with any defined requestPolicy, such as minimum per-claim resource requirements, if specified.
This concept is referred to as consumable capacity.
If a request does not specify particular device resource requirements, it implies an expectation of full device capacity.
Notably, each of these independent resource claims can still be referenced by one or more pods. However, the device resources allocated to each request are shared without any isolation guarantees among the pods that reference the same request.
To achieve this, this KEP introduces
- a new device property field to distinguish between devices those can be allocated only once and those can be allocated multiple times,
- a capacity-aware scheduling mechanism that allows limiting or guaranteeing the capacity of devices among the resource claims (or requests) those are sharing,
- a new capacity requirement field in the device request of the resource claim,
- a new consumed capacity field in the allocation result of the resource claim,
- a method to associate the allocated device status to the allocation result in the resource claim.
With those in place, a resource claim with multiple requests might allocate the same device multiple times. This may or may not be desired, so this KEP also introduces:
- a distinct attribute constraint to prevent allocating the same multi-allocatable device in the same claim multiple times.
Relations to other KEPs:
- KEP 4815
: The partitioned devices can be a multi-allocatable device or have mutually exclusive partitions where one partition is multi-allocatable and the other is not.
The partitioning constraints (the remaining
SharedCounters) are only checked once during the allocation of a multi-allocatable partitioned device. Meanwhile, the constraints introduced in this KEP (the remainingCapacityof the multi-allocatable partitioned device) are checked during every allocation of the same device. - KEP 5007 : The allocated share can be provisioned at the pre-bind step.
- KEP 4817
: A single network device can be shared across multiple pods, with each allocated share’s
NetworkDataidentified by a unique Share ID. - KEP 4816 : The enhancement must be able to handle subrequests when the DRAPrioritizedList feature is enabled.
A motivating use case is to allocate a multi-allocatable network device in the CNI DRA driver which can be selected by more than one pod on demand during scheduling. The original discussion is in this PR’s comment thread . The limitation of current implementation has been addressed here . The virtual network device is created and configured once the CNI is called based on the information of the master network device. The configured information specific to the generated device cannot be listed in the ResourceSlice in advance.
This feature is also beneficial for the other multi-allocatable devices which are not within scope of KEP-4815 . For instance, this feature will be allow reserving memory fraction of virtual GPU in the AWS virtual GPU device plugin . In other words, the device capacity allocation is determined by the user’s claim.
Goals
- Introduce an ability to allocate a multi-allocatable device via DRA multiple times in scenarios where pre-defined partitions are not viable, for example because there would be too many of them.
- Let DRA driver declare which device-level resource it can guarantee or reserve to a specific request and what are valid values that can be reserved,
- Let users specify in device requests how much of certain device resources they require.
Non-Goals
- Define driver-specific attributes and configs (such as CNI parameter config).
- Support network security policy.
- Support aggregated resource consumption where multiple devices are allocated to satisfy a single capacity request.
This is related to the comment about
distinctAttributes. - Support an extended use case where the resource guaranteeing behavior is determined by the first user request. For example, if the first request does not require a guarantee, the resource remains unguaranteed. However, if the first request requires a guarantee, the resource is marked as guaranteed, and all subsequent requests must adhere to that guarantee.
Proposal
User Stories (Optional)
Story 1
A DRA driver for networks advertises multi-allocatable devices for two interfaces eth1 and eth2 which each connect to the same virtual LAN, the admin makes those device available through a DeviceClass selecting only multi-allocatable devices, and users request access through a request which references that DeviceClass.
Story 2
When requesting two interfaces, the user requests two devices. To ensure that they don’t end up with the same multi-allocatable device for each request, they specify that the driver-specific “interfaceName” attribute must be different.
Story 3
A DRA driver for networks supports QoS guaranteed bandwidth which can ensures a specific bandwidth amount of the multi-allocatable network can be reserved exclusively to resource requests. A DRA driver also specifies minimum, maximum amount of reserved capacity for each resource request. When requesting the guaranteed network device, users specifies their required guaranteed bandwidth. Otherwise, the default value defined by the DRA driver is applied.
Risks and Mitigations
The requested amount in the resource claim may not satisfy the capacity request policy, especially if the requested amount exceeds the maximum allowed consumption.
This scenario should be handled similarly to other scheduling issues, such as when the request exceeds the allocatable capacity. In such cases, the allocation fails and the pod remains pending.
The driver includes both multi-allocatable and dedicated (non-multi-allocatable) devices. There is a risk that a user may be allocated a multi-allocatable device (e.g., a multi-allocatable network device) and accidentally configure it with the HostDevice CNI plugin. This would move the device from the host into the user’s pod, preventing other users from accessing the multi-allocatable device.
Mitigation:
- Device drivers should define a concrete request policy. If the device is intended to be shared without capacity limits among requests, the request policy should set the consumable value to zero.
- Additionally, administrators should define clear device classes for multi-allocatable and dedicated devices to prevent such misallocations.
When a driver changes a device property from dedicated to multi-allocatable, existing resource claims that have no specified consumed capacity will adopt a default quantity based on the defined request policy. This default may represent a fraction of the device, potentially altering the behavior of existing claims.
Mitigation:
- The existing allocation result, which has no share ID (as it was previously a dedicated device), will be included in the allocated list. Scheduler must ensure that the device cannot be allocated for another resource claim during the scheduling process.
When a driver updates the request policy, the behavior of resource claims changes.
Mitigation:
- Device driver should avoid updating the request policy, or do so with caution, if any devices have already been allocated to the ResourceClaim or are under preparation.
Design Details
This enhancement introduces a AllowMultipleAllocations field within the Device of the ResourceSlice
to mark whether the device is multi-allocatable among multiple resource claims (or requests).
The multi-allocatable device can be assigned to more than one request if it satisfies the selection criteria and constraints.
The select condition device.allowMultipleAllocations == true/false can be used to select the device with a AllowMultipleAllocations property or not in a CEL selector.
The enhancement also adds a RequestPolicy field to DeviceCapacity.
This field specifies how the device resources can be drawn from the device’s capacity for each claim request.
The request policy can either specify a range of valid values or a discrete set of them.
Each policy must have a default value.
If a device with the AllowMultipleAllocations property does not contain any Capacity, it can be allocated multiple times without device capacity constraints—that is, infinitely, as long as other scheduling conditions are met.
Users can define specific per-device resource requests using the newly added Capacity field,
which is available in each supported device request type under DeviceRequest.
Capacity contains a Requests map, where each entry specifies the required amount of a device resource.
The amount available for allocation is determined
by subtracting the aggregated allocation results of current claims from the device’s capacity as defined in the resource slice.
The remaining amount will be used solely by the allocator and will not be reflected in the resource slice.
The calculation of capacity requirements will round the requested capacity up to the nearest valid amount,
based on the capacity’s request policy.
If users do not specify a capacity request, the consumed value will be:
- the device’s full capacity or
- the default value if it was specified by the request policy or
- none if the device usage is unlimited
A device with AllowMultipleAllocations property can only be allocated
when its consumability has been verified and its attributes match the request’s selectors and constraints.
The newly added ConsumedCapacity field in the DeviceRequestAllocationResult will be set to the calculated capacity upon a successful allocation.
This value may differ from the originally requested amount, as it is rounded up to the nearest valid value based on the device’s request policy.
API enhancement
To enable this enhancement, the following API updates are proposed.
ResourceSliceSpec’s Device
type Device struct {
...
// AllowMultipleAllocations marks whether the device is allowed to be allocated to multiple DeviceRequests.
//
// If AllowMultipleAllocations is set to true, the device can be allocated more than once,
// and all of its capacity is consumable, regardless of whether the requestPolicy is defined or not.
//
// +optional
// +featureGate=DRAConsumableCapacity
AllowMultipleAllocations *bool
}
type DeviceCapacity struct {
// Value defines how much of a certain capacity that device has.
//
// This field reflects the fixed total capacity and does not change.
// The consumed amount is tracked separately by scheduler
// and does not affect this value.
//
// +required
Value resource.Quantity
// RequestPolicy defines how this DeviceCapacity must be consumed
// when the device is allowed to be shared by multiple allocations.
//
// The Device must have allowMultipleAllocations set to true in order to set a requestPolicy.
//
// If unset, capacity requests are unconstrained:
// requests can consume any amount of capacity, as long as the total consumed
// across all allocations does not exceed the device's defined capacity.
// If request is also unset, default is the full capacity value.
//
// +optional
// +featureGate=DRAConsumableCapacity
RequestPolicy *CapacityRequestPolicy
}
// CapacityRequestPolicy defines how requests consume device capacity.
//
// Must not set more than one ValidRequestValues.
type CapacityRequestPolicy struct {
// Default specifies how much of this capacity is consumed by a request
// that does not contain an entry for it in DeviceRequest's Capacity.
//
// +optional
Default *resource.Quantity
// ValidValues defines a set of acceptable quantity values in consuming requests.
//
// Must not contain more than 10 entries.
// Must be sorted in ascending order.
//
// If this field is set,
// Default must be defined and it must be included in ValidValues list.
//
// If the requested amount does not match any valid value but smaller than some valid values,
// the scheduler calculates the smallest valid value that is greater than or equal to the request.
// That is: min(ceil(requestedValue) ∈ validValues), where requestedValue ≤ max(validValues).
//
// If the requested amount exceeds all valid values, the request violates the policy,
// and this device cannot be allocated.
//
// +optional
// +listType=atomic
// +oneOf=ValidRequestValues
ValidValues []resource.Quantity
// ValidRange defines an acceptable quantity value range in consuming requests.
//
// If this field is set,
// Default must be defined and it must fall within the defined ValidRange.
//
// If the requested amount does not fall within the defined range, the request violates the policy,
// and this device cannot be allocated.
//
// If the request doesn't contain this capacity entry, Default value is used.
//
// +optional
// +oneOf=ValidRequestValues
ValidRange *CapacityRequestPolicyRange
}
// CapacityRequestPolicyRange defines a valid range for consumable capacity values.
//
// - If the requested amount is less than Min, it is rounded up to the Min value.
// - If Step is set and the requested amount is between Min and Max but not aligned with Step,
// it will be rounded up to the next value equal to Min + (n * Step).
// - If Step is not set, the requested amount is used as-is if it falls within the range Min to Max (if set).
// - If the requested or rounded amount exceeds Max (if set), the request does not satisfy the policy,
// and the device cannot be allocated.
type CapacityRequestPolicyRange struct {
// Min specifies the minimum capacity allowed for a consumption request.
//
// Min must be greater than or equal to zero,
// and less than or equal to the capacity value.
// requestPolicy.default must be more than or equal to the minimum.
//
// +required
Min *resource.Quantity
// Max defines the upper limit for capacity that can be requested.
//
// Max must be less than or equal to the capacity value.
// Min and requestPolicy.default must be less than or equal to the maximum.
//
// +optional
Max *resource.Quantity
// Step defines the step size between valid capacity amounts within the range.
//
// Max (if set) and requestPolicy.default must be a multiple of Step.
// Min + Step must be less than or equal to the capacity value.
//
// +optional
Step *resource.Quantity
}
CELDeviceSelector’s description
type CELDeviceSelector struct {
// ...
// The expression's input is an object named "device", which carries
// the following properties:
// - driver (string): the name of the driver which defines this device.
// - attributes (map[string]object): the device's attributes, grouped by prefix
// (e.g. device.attributes["dra.example.com"] evaluates to an object with all
// of the attributes which were prefixed by "dra.example.com".
// - capacity (map[string]object): the device's capacities, grouped by prefix.
// - allowMultipleAllocations (bool): the allowMultipleAllocations property of the device
// (v1.34+ with the DRAConsumableCapacity feature enabled).
// ...
// +required
Expression string
}
ResourceClaimSpec’s DeviceRequest
The Capacity field is defined within each supported device request type, such as DeviceSubRequest and ExactDeviceRequest.
type DeviceSubRequest struct {
// Capacity define resource requirements against each capacity.
//
// If this field is unset and the device supports multiple allocations,
// the default value will be applied to each capacity according to requestPolicy.
// For the capacity that has no requestPolicy, default is the full capacity value.
//
// Applies to each device allocation.
// If Count > 1,
// the request fails if there aren't enough devices that meet the requirements.
// If AllocationMode is set to All,
// the request fails if there are devices that otherwise match the request,
// and have this capacity, with a value >= the requested amount, but which cannot be allocated to this request.
//
// +optional
// +featureGate=DRAConsumableCapacity
Capacity *CapacityRequirements
}
type ExactDeviceRequest struct {
// Capacity define resource requirements against each capacity.
//
// If this field is unset and the device supports multiple allocations,
// the default value will be applied to each capacity according to requestPolicy.
// For the capacity that has no requestPolicy, default is the full capacity value.
//
// Applies to each device allocation.
// If Count > 1,
// the request fails if there aren't enough devices that meet the requirements.
// If AllocationMode is set to All,
// the request fails if there are devices that otherwise match the request,
// and have this capacity, with a value >= the requested amount, but which cannot be allocated to this request.
//
// +optional
// +featureGate=DRAConsumableCapacity
Capacity *CapacityRequirements
}
// CapacityRequirements defines the capacity requirements for a specific device request.
type CapacityRequirements struct {
// Requests represent individual device resource requests for distinct resources,
// all of which must be provided by the device.
//
// This value is used as an additional filtering condition against the available capacity on the device.
// This is semantically equivalent to a CEL selector with
// `device.capacity[<domain>].<name>.compareTo(quantity(<request quantity>)) >= 0`.
// For example, device.capacity['test-driver.cdi.k8s.io'].counters.compareTo(quantity('2')) >= 0.
//
// When a requestPolicy is defined, the requested amount is adjusted upward
// to the nearest valid value based on the policy.
// If the requested amount cannot be adjusted to a valid value—because it exceeds what the requestPolicy allows—
// the device is considered ineligible for allocation.
//
// For any capacity that is not explicitly requested:
// - If no requestPolicy is set, the default consumed capacity is equal to the full device capacity
// (i.e., the whole device is claimed).
// - If a requestPolicy is set, the default consumed capacity is determined according to that policy.
//
// If the device allows multiple allocation,
// the aggregated amount across all requests must not exceed the capacity value.
// The consumed capacity, which may be adjusted based on the requestPolicy if defined,
// is recorded in the resource claim’s status.devices[*].consumedCapacity field.
//
// +optional
Requests map[QualifiedName]resource.Quantity
}
type DeviceConstraint struct {
// DistinctAttribute requires that all devices in question have this
// attribute and that its type and value are unique across those devices.
//
// This acts as the inverse of MatchAttribute.
//
// This constraint is used to avoid allocating multiple requests to the same device
// by ensuring attribute-level differentiation.
//
// This is useful for scenarios where resource requests must be fulfilled by separate physical devices.
// For example, a container requests two network interfaces that must be allocated from two different physical NICs.
//
// +optional
// +oneOf=ConstraintType
// +featureGate=DRAConsumableCapacity
DistinctAttribute *FullyQualifiedName
}
ResourceClaimStatus’s DeviceRequestAllocationResult
type DeviceRequestAllocationResult struct {
// ShareID uniquely identifies an individual allocation share of the device,
// used when the device supports multiple simultaneous allocations.
// It serves as an additional map key to differentiate concurrent shares
// of the same device.
//
// +optional
// +featureGate=DRAConsumableCapacity
ShareID *types.UID
// ConsumedCapacity tracks the amount of capacity consumed per device as part of the claim request.
// The consumed amount may differ from the requested amount: it is rounded up to the nearest valid
// value based on the device’s requestPolicy if applicable (i.e., may not be less than the requested amount).
//
// The total consumed capacity for each device must not exceed the DeviceCapacity's Value.
//
// This field is populated only for devices that allow multiple allocations.
// All capacity entries are included, even if the consumed amount is zero.
//
// +optional
// +featureGate=DRAConsumableCapacity
ConsumedCapacity map[QualifiedName]resource.Quantity
}
type ResourceClaimStatus struct {
// Devices contains the status of each device allocated for this
// claim, as reported by the driver. This can include driver-specific
// information. Entries are owned by their respective drivers.
//
// +optional
// +listType=map
// +listMapKey=driver
// +listMapKey=device
// +listMapKey=pool
// +listMapKey=shareID
// +featureGate=DRAResourceClaimDeviceStatus
Devices []AllocatedDeviceStatus
}
type AllocatedDeviceStatus struct {
// ShareID uniquely identifies an individual allocation share of the device.
//
// +optional
// +featureGate=DRAConsumableCapacity
ShareID *string
}
Scheduling enhancement
- When the scheduler invokes the
Allocatefunction in the allocator, the total allocated capacity is calculated by aggregating the consumedCapacity from all resource claims’sDeviceRequestAllocationResultthat have already been allocated. - Before allocation proceeds, existing selection criteria (defined by
alloc.isSelectable) are evaluated. These include the class selector and request selector. - A new
device.allowMultipleAllocationskey is introduced in the CEL selector, enabling policies and constraints to recognize whether a device supports allocation by multiple requests. - If a device is considered selectable, the
CmpRequestOverCapacityfunction is invoked to verify whether the consumed capacity would exceed the device’s remaining capacity. The remaining capacity is calculated based on the sum of already allocated and currently allocating capacities.- consumed capacity is derived from the requested amount specified in the resource claim, adjusted by the device’s capacity request policy, if defined.
- This value may differ from the originally requested amount—it is rounded up to the nearest valid capacity according to the policy (e.g., using Min + ⌈(Requested - Min)/Step⌉ × Step logic).
- If the device has enough remaining capacity to satisfy the consumed amount, constraint checks are applied.
In addition to the existing MatchAttribute, this proposal introduces a new constraint:
DistinctAttribute, which ensures attribute uniqueness across allocated devices. - Once all selection and constraint checks pass, the allocation is valid. The allocation result is updated with:
- The share identifier (ShareID), which uniquely identifies the allocation on a device.
- The consumed capacity. This consumed capacity is tracked as part of the device’s
allocatingCapacity, allowing it to be included in remaining capacity calculations for future allocations within the same call.
- Finally, the share identifiers and consumed capacities from all internal results are propagated to the DeviceRequestAllocationResult.
Handles Device Updates for AllowMultipleAllocations and RequestPolicy
- If a device is updated from dedicated (
allowMultipleAllocations: false) to multi-allocatable (allowMultipleAllocations: true), it must continue to behave as a dedicated device and not allow sharing until all existing resource claims for that device are released. - If a device is updated from multi-allocatable to dedicated, it should no longer be available for new allocations. However, already allocated devices should not be deallocated.
- If the request policy is later set, update, or unset, the change will apply only to future allocations. No rollback or changes will be applied to shared devices that have already been allocated.
Examples
DeviceClass’s selector
selectors:
- cel:
expression: |-
device.allowMultipleAllocations == true
ResourceClaim with capacity requirement
kind: ResourceSlice
...
spec:
driver: guaranteed-cni.dra.networking.x-k8s.io
devices:
- name: eth1
basic:
allowMultipleAllocations: true
attributes:
name:
string: "eth1"
capacity:
bandwidth:
requestPolicy:
default: "1Mi"
validRange:
min: "1Mi"
step: "8"
value: "10Gi"
ResourceClaim’s request
kind: ResourceClaim
...
spec:
devices:
requests: # for devices
- name: nic
exactly:
deviceClassName: qos-aware-shared.device.x-k8s.io
capacity:
requests: # for resources which must be provided by those devices
bandwidth: 5Gi
ResourceClaim’s status
kind: ResourceClaim
...
status:
allocation:
devices:
results:
- consumedCapacity:
bandwidth: 1Mi
device: eth1
shareID: "a671734a-e8e5-11e4-8fde-42010af09327"
...
devices:
- data:
cniVersion: 1.1.0
ips:
- address: 10.0.103.49/16
device: eth1
shareID: "a671734a-e8e5-11e4-8fde-42010af09327"
...
ResourceClaim with distinctAttribute
kind: ResourceClaim
...
spec:
devices:
requests:
- name: macvlan-1
exactly:
deviceClassName: simple-multialloc.networking.x-k8s.io
allocationMode: ExactCount
count: 1
- name: macvlan-2
exactly:
deviceClassName: simple-multialloc.networking.x-k8s.io
allocationMode: ExactCount
count: 1
constraints:
- requests:
- macvlan-1
- macvlan-2
distinctAttribute: interfaceName
Test Plan
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
Unit tests
API Validations
Sharing Policy (Device Capacity Test)
- Default value is required.
- The default must be included in the options for
validValues, or fall within the specifiedvalidRange. validValuesandvalidRangemust not be defined at the same time within a singlesharingpolicy.validValuesmust be a list of unique values.validValuesmust be in ascending order.- The
validValuessize should be kept within 10 to avoid excessive growth. - The minimum must be less than or equal to the maximum in the
validRange. - If a chunk size is defined, both the default and the maximum must be multiples of the chunk size.
- The minimum, maximum, and (minimum + chunk size) must each be less than the capacity value.
- If
AllowMultipleAllocationsof the device is not set or set to false,RequestPolicyfor any of its capacity must not be defined.
Distinct Attribute
- Similar to the
matchAttribute, check for a missing domain and required name (invalid request). - If the feature gate is enabled, exactly one of
matchAttributeordistinctAttributemust be provided. - If the feature gate is disabled,
matchAttributeis required.
Share ID
- When this feature gate and
DRAResourceClaimDeviceStatus(KEP 4817 ) are enabled, the combination keys ofdriver,pool,device, andshareIDinStatus.Devicesmust be a one-to-one mapping with those keys inStatus.Allocation.Devices. - Must be a valid UID
Create Strategy
- Keep fields if the feature is enabled for
ResourceSlice,ResourceClaim, andResourceClaimTemplate - Drop fields if the feature is disabled for
ResourceSlice,ResourceClaim, andResourceClaimTemplate
Update Strategy
- Keep existing fields if the feature is enabled
ResourceSlice,ResourceClaim, andResourceClaimTemplate - Keep existing fields of
ResourceSliceif any feature field is set in the oldResourceSlice. - Keep existing fields of
ResourceClaimandResourceClaimTemplateif any feature field is set in the oldResourceClaimorResourceClaimTemplate. - Keep existing fields of
ResourceClaimif any feature field is set in the oldResourceClaim.Status. - Drop fields if the feature is disabled and the fields has not been used by the old resource as described above.
- If the
DRAPrioritizedListis enabled, theCapacityofDeviceSubRequestinFirstAvailablemust be dropped as well.
- If the
- The same strategy for
ResourceClaimmust be followed regardless ofDRADeviceStatusfeature enablement.
Allocator
Allow Multiple Allocations
- can allocate a device which allow multiple allocations for multiple times
- must not allocate a device which do not allow multiple allocations more than once
- can exclude dedicated device from allocation with CEL
- can limit allocation to multi-allocatable device with CEL
- can work with
DRAPartitionableDevicesfeature.
Consumable Capacity
- can gather consumed capacity from allocated resource claims
- can add/remove consumed capacity of allocating devices
- can round up and compute user-requesting minimum capacity according to request policy range and chunk size
- requested capacity for non-consumable capacity acts like a
>=filter - can work with
DRAPrioritizedListfeature’s subrequests.
Distinct Attribute
- can prevent allocating the same device in the same request with a distinct constraint
- can allocate different device in the same request with a distinct contraint
Coverage
k8s.io/dynamic-resource-allocation/structured/internal/experimental:4/2/2026-93.1k8s.io/dynamic-resource-allocation/structured/internal/incubating:4/2/2026-93.1k8s.io/kubernetes/pkg/apis/resource/validation:4/2/2026-96.8k8s.io/kubernetes/pkg/registry/resource/resourceclaimtemplate:4/2/2026-72.0k8s.io/kubernetes/pkg/registry/resource/resourceclaim:4/2/2026-87.1k8s.io/kubernetes/pkg/registry/resource/resourceslice:4/2/2026-77.7k8s.io/kubernetes/pkg/kubelet/cm/dra:4/2/2026-83.5k8s.io/kubernetes/pkg/kubelet/cm/dra/plugin:4/2/2026-83.5k8s.io/kubernetes/pkg/kubelet/cm/dra/state:4/2/2026-44.2
Integration tests
The existing integration tests for kube-scheduler which measure performance will be extended to cover the overheaad of running the additional logic to support the features in this KEP.
We extend the test for creating large ResourceSlices to ensure that a ResourceSlice using the new fields satisfies the etcd limits.
e2e tests
We extend the DRA test driver to enable support for this feature and add tests to ensure they are handled by the scheduler as described in this KEP.
The following functionalities should be covered in E2E tests:
- ResourceSlice creation: The ResourceSlice must be created successfully with
AllowMultipleAllocationsand aRequestPolicy. - Pod Scheduling with Available Capacity: A Pod with a resource claim must run successfully when the requested capacity is available.
- Capacity Enforcement: A Pod must stay in Pending state if it requests more than the remaining capacity, even if the request is less than the total capacity.
- Capacity Release and Re-Scheduling: When a Pod is deleted, its reserved capacity must be released, and any pending Pod with a satisfied request must start running.
Graduation Criteria
Alpha
- Feature implemented behind feature gates (
DRAConsumableCapacity). Feature Gates are disabled by default. - Documentation provided
- Initial unit, integration and e2e tests completed and enabled.
Beta
- Feature Gates are enabled by default.
- No major outstanding bugs.
- 2 examples of real-world use cases.
- CNI DRA driver (kubernetes-sigs/cni-dra-driver) can use this feature to manage and limit bandwidth quota.
- DRA Driver for CPU (kubernetes-sigs/dra-driver-cpu) can use this feature to manage and limit CPU resources.
- Feedback collected from the community (developers and users) with adjustments provided, implemented and tested.
GA
- 2 examples of real-world use cases.
- CNI DRA driver (kubernetes-sigs/cni-dra-driver) can use this feature to manage and limit bandwidth quota.
- Acelerator DRA driver can use this feature for on-demand virtual memory allocation.
- Allowing time for feedback from developers and users.
Upgrade / Downgrade Strategy
In the context of this enhancement, the following strategy is proposed:
All introduced fields are optional and can be omitted if empty. This means that during the upgrade or downgrade process, if certain fields or configurations are not required, they can be left out without causing issues or disrupting the upgrade process.
The general upgrade and downgrade processes will follow the DRA strategy.
The upgrade and downgrade processes of shareID will follow the optional map keys strategy.
The upgrade and downgrade processes of allowMultipleAllocations CEL will follow the VersionOptions method.
Version Skew Strategy
During version skew, where the API server supports the feature but the scheduler does not, the scheduler will throw an error if the ResourceClaim contains Capacity to prevent allocating the devices that doesn’t meet the user requests. If there is no Capacity, the scheduler continues scheduling and ignores the allowMultipleAllocation and requestPolicy fields in ResourceSlice.
If the feature is enabled on scheduler but is disabled on API server. The scheduler can continue scheduling as-is without feature fields.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name: DRAConsumableCapacity
- Components depending on the feature gate:
- kube-scheduler
- kubelet
- kube-apiserver
- Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?
Does enabling the feature change any default behavior?
No
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
Yes, this feature can be disabled once it has been enabled.
The AllowMultipleAllocations flag, RequestPolicy and Capacity fields will be dropped.
However, the ShareID, ConsumedCapacity, and renamed device (<device id>/<share id>) in device status needs to remain to keep the existing allocation result reference valid.
What happens if we reenable the feature if it was previously rolled back?
The fields will be available again for read and write.
However, the previously dropped RequestPolicy, Capacity, and ConsumedCapacity will be missing.
Are there any tests for feature enablement/disablement?
The enablement and disablement of this feature are tested as part of the integration tests. Additionally, the feature enablement/disablement tests cover the scenario where the feature gate is switched from enabled to disabled after an allocation has already been made. In this case, the existing resource claim should remain valid, but the remaining device capacity must no longer be multi-allocatable.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
- Enabling the feature gate will enable the field to be written and therefore invoke validation of the field.
- Disabling the feature gate will drop the ability to consume the capacity in scheduling so that the
ConsumedCapacityin the allocation result should be also dropped. If the external party uses the reference to this field to manage the QoS-aware devices, it may fail if there is no handler. - Disabling the feature gate is equivalent to unset
AllowMultipleAllocationsandRequestPolicy, the scheduler will handle as described in this previous section .
What specific metrics should inform a rollback?
When we notice unexpected scheduler_unschedulable_pods{plugin="DynamicResources"} or metric scheduler_plugin_execution_duration_seconds{plugin="DynamicResources"} in the kube-scheduler suddenly increases.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
The manual test was performed on a local Kind cluster by manually disabling and enabling the feature gate for all control plane components on the Kind node.
When this feature is enabled, a ResourceClaim with the added fields can be deployed and the driver can advertise 10G of bandwidth. Workloads which requests 5G can request capacity from devices that allow multiple allocations, and the consumed capacity is updated in ResourceClaim.Status.
When the feature is disabled, existing workloads continue running, and there is no change to ResourceClaim, including the status of consumed capacity. However, new workloads, requesting 2G, that include a capacity request are rejected and remain in a pending state. Additionally, the fields added by this feature are removed when applying a new ResourceClaimTemplate.
When the feature is re-enabled, a new ResourceClaimTemplates can be created with the added fields. The scheduler can properly prevent over-provisioning of capacity when trying to deploy another workload which requests 8G while allow the workload which requests only 2G to run with their consumed capacity tracked in ResourceClaim.Status, as intended by this feature, without impacting on the first workload that was already running.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
Check the allowMultiAllocation flag in the resource slice.
How can someone using this feature know that it is working for their instance?
- Events
- Event Reason:
- API .status
- Condition name:
- Other field:
ResourceClaim.Status.Allocation.Devices.Results[].ShareID
- Other (treat as last resort)
- Details:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
Existing DRA and related SLOs continue to apply.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric names:
apiserver_requestwithresource="resourceclaims"scheduler_unschedulable_podswithplugin="DynamicResources"scheduler_plugin_execution_duration_secondswithplugin="DynamicResources"- For state gathering,
extension_point="PreFilter" - For allocation,
extension_point="Filter" - For status update,
extension_point="PostFilter"
- For state gathering,
- [Optional] Aggregation method:
- Components exposing the metric: kube-apiserver, kube-scheduler
- Metric names:
- Other (treat as last resort)
- Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?
No.
Dependencies
Does this feature depend on any specific services running in the cluster?
This feature depends on the DRA structured parameters feature being enabled, and on DRA drivers that support the feature being deployed. This feature also works with DRA device status feature if it is enabled.
Scalability
Will enabling / using this feature result in any new API calls?
No.
Will enabling / using this feature result in introducing new API types?
There will be CapacityRequestPolicy and CapacityRequirements struct added to DeviceCapacity in ResourceSlice and DeviceSubRequest/ExactDeviceRequest in ResourceClaim.
Will enabling / using this feature result in any new calls to the cloud provider?
No.
Will enabling / using this feature result in increasing size or count of the existing API objects?
Yes, when using this field, the user will add additional data in their ResourceSlice, ResourceClaim and ResourceClaimTemplate objects.
This is an incremental increase on top of the existing structures.
Estimated increase in size:
- ~ 10 bytes of boolean pointer per device
- ~ 200-1100 bytes per request policy (max 10 options)
- ~ 100 bytes per capacitiy per request and allocation result
(
ResourceSliceMaxAttributesAndCapacitiesPerDevice=32) - ~ 40 bytes of share ID per resource allocation
- 7 bytes extended name in device name if the device status feature is enabled
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
Scheduling a claim that uses this feature may take a bit longer, if it is necessary to calculate aggregation of consumed capacity before finding a suitable device. We will measure in beta timeframe.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
No.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No.
Troubleshooting
The troubleshooting section in https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4381-dra-structured-parameters#troubleshooting still applies. The only additional failure modes comes from version skew in the cluster and the troubleshooting steps provided through the link above should be sufficient to determine the cause.
How does this feature react if the API server and/or etcd is unavailable?
What are other known failure modes?
kube-scheduler cannot allocate ResourceClaims.
The shared device may not have sufficient capacity to satisfy the request. The log message
Device capacity not enoughand thecapacitiesfield in the logAllocating one devicecan provide further clues for investigation (require -v=7 on kube-scheduler).
If the feature is disabled but a ResourceClaim still requests capacity, the scheduler log will report: has capacity requests, but the DRAConsumableCapacity feature is disabled. Nevertheless, when using the allocator in stable mode, no logs related to the DRAConsumableCapacity feature will be emitted.
What steps should be taken if SLOs are not being met to determine the problem?
N/A
Implementation History
Alpha 1.34:
- Initial implementation merged on 2025-07-30
Alpha 1.35:
- Fix scheduler perf test (simplified) merged on 2025-09-10
- Fix 133705 - failed to schedule next device PR 133706 has been pushed on 2025-08-26
- Fix 134100 - integration with partitionable device PR 134103 has been pushed on 2025-09-17
- Fix 134519 - add ShareID to kubelet plugin API PR 134520 has been pushed on 2025-10-10
- Increase test coverage PR 134615 has been pushed on 2025-10-15
Beta 1.36:
- Fix 136734 - missing GetSharedDeviceIDs bug in GatherAllocatedState has been pushed on 2025-02-04
- Promote DRAConsumableCapacity to Beta PR 136611 has been pushed on 2026-01-29
Drawbacks
This adds complexity to the scheduler.
Alternatives
Identifying multi-allocatable Property of Device
Current Approach:
Use a boolean to indicate whether a device can be shared among multiple resource claims (or requests).
Pros:
- Simple
Cons:
- Implicit infinite sharing if no consuming capacity defined
Alternatives:
- Use an enum, such as
Allocatable, with defined values like:
AllocatableOnce— device can only be allocated onceAllocatableMultipleTimes— device can be allocated multiple timesPros:
- Provides flexibility for future extension according to Kubernetes API conventions .
Cons:
- Increases the program’s memory footprint compared to a boolean when there is only a single binary option to serve the purpose.
Use a count field to specify how many times a device can be reallocated to different resource requests.
Pros:
- Simple.
- No implicit infinite sharing.
Cons:
- Not equivalent to the legacy CNI, which places no limit on the number of master devices, as long as the Pod can be successfully created.
Selecting/Deselecting multi-allocatable Devices
Current Approach:
Extend the CEL selector to recognize device.shareable for filtering multi-allocatable devices.
Alternative:
Introduce explicit flags in the resource request:
- AllowShared: Opt-in to allow multi-allocatable devices.
- RequireShared: Only allow multi-allocatable devices.
(Default: multi-allocatable devices are excluded unless explicitly allowed.)
Pros:
- Does not affect dedicated device selection.
- Easier for users to understand and configure, reducing the risk of mistakes.
- More user-friendly than writing CEL expressions manually.
Cons:
- Adds complexity to the allocation logic for multi-allocatable devices.
- Introduces an additional field in resource requests.
- May require an abstraction layer if more device features are added in the future.
- Less explicit and expressive than CEL for advanced use cases.
Preventing Same multi-allocatable Device from Being Allocated Multiple Times in the Same Claim
Current Approach:
Introduce a new API-level constraint: DistinctAttribute, ensuring devices in a single claim have unique attribute values.
Alternative:
The scheduler enforces this behavior implicitly—never allocate the same multi-allocatable device multiple times to the same resource claim.
Pros:
- Avoids any API changes—logic handled internally.
Cons:
- Doesn’t support cases where a pod legitimately permits multiple fractions of capacity from the same multi-allocatable device. For example, when a pod uses two vGPUs for parallel processing, it may not require them to come from different devices. It can accept allocations from either the same or different multi-allocatable devices.
- Not configurable—users can’t override this behavior when needed.
Identifying shared device in the device status
When the same device are allocated multiple times to different requests, ShareID is required to differentiate between different allocation especially useful when the different allocation has a different status set in the AllocatedDeviceStatus.
ShareID, which is intended to serve as part of a composite key for identifying devices, cannot be added to the map keys of the listType because fields used as map keys must be required. Making ShareID required is not an option, as the API feature gate should not introduce required fields.
Current Approach:
Update api-machinery to support adding new keys to listType=map - GitHub Issue
Alternative:
Append the Device with /<share id> and workarounds on validation function.
Defining a valid range for RequestPolicy
Current Approach:
Newly define minimum, maximum, and chunk size and implement a function to validate the value in range.
Alternative:
The range can be defined and validated using a LimitRange match.
Enforcing min/max/default values via LimitRange is a generally useful mechanism.
Alternative words
Device.AllowMultipleAllocations: The alternative names proposed were:Shareable,Shared, andAllowShared.Step: The alternative names proposed were:ChunkSize,StepSize,UnitSize.DeviceRequest.Capacity: The alternative names proposed were:Capacities,Capacity.DeviceRequest.Capacity.Requests: The alternative names proposed were:Required,Reservation,Consumption,MinimumandMin.Requestswas dropped once since it’s already used in the DRA API for device requests.Minimumwas selected as an alternative because the actual consumed capacity can be rounded up based on the request policy — for example, to match a defined chunk size or meet a minimum requirement. However,Requestswas reselected during API review because it is more align with the container spec and matches present semantic definition used elsewhere in the API (minimum guaranteed, must be satisfied). with the need of clear description to distinguish between requests for devices and requests for resources which must be provided by those devices.
Future Possibilities
RequestPolicy
The allocation strategy can be introduced for each capacity attribute defined in the
RequestPolicy. For example, astrategyfield could be added to explicitly define the scheduling behavior for a specific capacity:requestPolicy: strategy: ...For example,
AlwaysConsumed: The default behavior. A predefined default value is always applied if no capacity is explicitly requested.ConsumedOrNever: If the first consumer specifies a capacity request, that capacity becomes consumable. If not, it remains non-consumable until the first consumer releases it.BlockOrShare: The inverse ofConsumedOrNever. If the first consumer requests no capacity, it consumes the entire device (i.e., full capacity). If it does specify a capacity request, the device remains multi-allocatable up to the guaranteed amount.
The current default behavior is
AlwaysConsumed.A common RequestPolicy can be defined in the Device struct (similar to mixins) and reference it using new fields named
RequestPolicyReforRequestPolicyName, which are mutually exclusive with the Default field as discussed in this comment’s thread .Defining an inifinite requestPolicy
zeroConsumptionas a mutual exclusive definition to other valid value policies. This flag is equivalent to{default: 0, validValues{{0}}}. If the request doesn’t contain this capacity entry, zero value is used and Default must not be defined. See this comment for future discussion.
CapacityRequirements
Limitsfield to describe burstable consumption. Handling burstability would be the responsibility of the individual device driver, similar to how the CPU manager handles CPU burst behavior.