KEP-5517: DRA Node Allocatable Resources
KEP-5517: DRA: Node Allocatable Resources
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- API Changes
- Kubelet Admission Control
- Node Resource Enforcement and Isolation
- Use Case Walkthroughs
- Future Enhancements
- Test Plan
- Graduation Criteria
- Upgrade / Downgrade Strategy
- Version Skew Strategy
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests within one minor version of promotion to GA
- (R) Production readiness review completed
- (R) Production readiness review approved
- [] “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
This KEP proposes a solution for managing node allocatable resources via Dynamic Resource Allocation (DRA). Node allocatable resources are resources currently reported in v1.Node status.allocatable that are not extended resources (examples include CPU, Memory, Ephemeral-storage, and Hugepages). Currently, when these node allocatable resources are managed via DRA, there is a fundamental disconnect across the control plane and the Node. In the scheduler, having two independent accounting systems (one for standard resources, one for DRA) managing the same underlying resource leads to resource overcommitment. On the node, the kubelet is completely unaware of DRA allocations, which may result in incorrect QoS class assignment and has many downstream implications. This forces users into fragile workarounds that are incompatible with all use cases.
The proposed solution in this KEP addresses node allocatable resource accounting in the kube-scheduler. The standard resource (NodeResourcesFit plugin) and DRA (DynamicResources plugin) will be enhanced to synchronize their accounting, creating a single, authoritative ledger. The kubelet will also be enhanced to consider the node allocatable resource requests made through both the pod spec and the DRA ResourceClaim to correctly calculate QoS, configure cgroups, and protect high-priority pods. This provides a robust, backward-compatible solution for advanced resource management in Kubernetes.
Motivation
Dynamic Resource Allocation (DRA) provides a powerful framework for managing specialized hardware resources such as GPUs, FPGAs, and high-performance network interfaces. It also enables fine-grained management of node allocatable resources like CPU and Memory, for example, through the dra-driver-cpu . However, when a node allocatable resource is managed via DRA, while it provides added advantages of being able to specify more detailed requirements, a fundamental disconnect emerges between the scheduler, the kubelet, and the DRA framework, which breaks the resource guarantees.
Additionally, specialized resources like accelerators often have implicit dependencies on node allocatable resources like CPU or Hugepages for the application to interact with it. Currently, users must manually research and declare these auxiliary node allocatable resource requirements, typically as additional requests in the PodSpec. This process is error-prone and adds complexity to workload configuration. Furthermore, there is no existing mechanism to express critical co-location requirements. For example, there is no way to ensure an accelerator allocated via DRA is NUMA-aligned with the specific hugepages or CPUs it needs, as the standard and DRA resource models are entirely independent.
Core Problem
The core problem is that the same underlying physical resource is advertised and consumed through two parallel, uncoordinated mechanisms.
Dual Publication: A node’s total CPU/Memory capacity is advertised in two different places:
- Via the Kubelet in the
Node.Status.Allocatablefield. - Via the DRA driver in
ResourceSliceobjects.
- Via the Kubelet in the
Dual Consumption: Pods can consume this CPU capacity in two different ways:
- Via pod spec requests (
pod.spec.containers[].resources.requests,pod.spec.initcontainers[].resources.requests), which is considered in theNodeResourcesFitscheduler plugin to find a Node that fits. - Via
ResourceClaim, which is considered in theDynamicResourcesscheduler plugin to allocate devices.
- Via pod spec requests (
Scheduler-Level Resource Oversubscription: The kubelet is the source of truth for a node’s
available resources. The scheduler continuously watches the Node object and uses
Node.Status.Allocatable to maintain an internal, in-memory cache (NodeInfo) of each node’s
capacity. This cache is the baseline for all its scheduling decisions, ensuring it does not place
more pods on a node than the node reports it can handle.
It is completely blind to the fact that the DRA (like CPU ResourceClaim) draws from the same
physical resource as a standard request. This gap leads to the scheduler overcommitting a node’s CPU
resources by scheduling more pods than the node resource capacity.
Kubelet-Level Guarantee Failure: The kubelet is the component that enforces resource guarantees on
the node. It determines a pod’s Quality of Service (QoS) class, configures its cgroups, and makes
critical lifecycle decisions like eviction based only on the pod.spec. Because it is unaware of
resources allocated via DRA, it will:
- Misclassify QoS: A pod with a guaranteed CPU
ResourceClaimmay be misclassified asBestEffort. This would have downstream effects like- Apply Incorrect Cgroups: It will set the wrong
cpu.sharesandcpu.quota, potentially throttling high-performance workloads. - Make Incorrect Eviction Decisions: The misclassified pod will be the first to be evicted under node pressure.
- Incorrect OOM Score calculation.
- Apply Incorrect Cgroups: It will set the wrong
Current workarounds for DRA-managed node allocatable resources (like
CPU DRA driver
) force users to duplicate
resource requests in both the ResourceClaim and the standard pod.spec.containers.resources.
However, this approach is fragile, error-prone, and difficult to manage, especially for complex pods
with shared resource claims. It is also incompatible with advanced DRA features like
Prioritized Lists
This KEP proposes to solve this problem by creating a single, unified resource model that spans the entire control plane, from the scheduler to the kubelet. The goal is not just to fix an accounting issue in the scheduler, but to provide a complete, native way for Kubernetes to handle core resources that are backed by DRA.
Goals
- To create a unified accounting model within the kube-scheduler that prevents overcommitment of core
resources (like CPU) when they are allocated via both standard
pod.specrequests and DRAResourceClaims. - To ensure the solution is compatible with different ways node allocatable resources can be represented and allocated within DRA, including as individual devices, consumable capacities (KEP-5075 ), and partitionable devices (KEP-4815 )
- To enable specialized devices, such as accelerators, to declare any auxiliary node allocatable resource requirements (e.g., CPU, Memory) they depend on for their operation.
- To maintain backward compatibility with existing workloads and ecosystem tools that rely on
node.status.allocatableand the scheduler’s view of node resource utilization.
Non-Goals
- To move all resource management logic into the DRA driver. The Kubelet will remain the primary agent for cgroup management and QoS enforcement, ensuring that the benefits of its existing stability and lifecycle management features are preserved.
- To replace the standard
pod.spec.containers.resourcesAPI for requesting node allocatable resources. This KEP aims to enhance the system by adding a clear path for node allocatable resource requests via DRA while ensuring it works coherently with the existing PodSpec-based requests. - Changes to the Kubelet for QoS classification, cgroup management, and eviction logic based on DRA node allocatable resource allocations are not in scope for the Alpha release of this KEP.
- Interaction with In-Place Pod Resizing and Pod Level Resources will be a non goal for alpha. More details in Future Enhancements section.
Proposal
This KEP introduces a unified accounting model within the kube-scheduler to integrate node allocatable resources managed
by Dynamic Resource Allocation (DRA) with the scheduler’s standard resource tracking. By bridging the gap
between pod.spec.resources and DRA ResourceClaim allocations, we can achieve consistent resource accounting
and prevent node overcommitment.
Background
To understand the proposed solution, it is essential to first understand how kube-scheduler currently manage standard resource requests and DRA ResourceClaims.
The Kubernetes scheduler is built on a plugin-based framework that executes a series of stages to place
a pod. This KEP is primarily concerned with the interaction between NodeResourcesFit and
DynamicResource plugins at the PreFilter, Filter, and Bind stages of the
scheduling framework
.
Standard Resource Accounting
The Kubelet is the source of truth for a node’s available resources. It inspects the machine’s total
capacity, subtracts resources reserved for the operating system (--system-reserved) and Kubernetes
system daemons (--kube-reserved), and reports the result in the Node.Status.Allocatable field. The
scheduler continuously watches for updates to this field and uses it to maintain its internal, in-memory
cache (NodeInfo) of each node’s capacity. This cache is the baseline for all its scheduling decisions.
Kube-Scheduler Resource Accounting
- The scheduler maintains an in-memory
NodeInfoobject for each node, which stores theAllocatable, which is the capacity of the node andRequested, which is an aggregated sum of the resources requested by all pods assumed to be on that node (Requested). - During the
Filterstage of scheduling, theNodeResourcesFitplugin checks if a pod’s requested resources can fit on the node (NodeInfo.Allocatable - NodeInfo.Requested >= Pod request). - The
NodeInfo.Requestedvalue is updated by the scheduler framework when a pod is “assumed” on the node. This happens after a node is selected in theScoringphase, and before the actual binding to the API server, ensuring the cache is accurate for subsequent scheduling decisions.
Dynamic Resource Allocation (DRA) Accounting
The DynamicResources plugin manages resources requested via pod.spec.resourceClaims. Its accounting
system is entirely separate from the standard resources.
- The DRA driver/s on the node reports resource availability through the
ResourceSliceobjects. - During the
Filterstage, theDynamicResourcesplugin determines if the inventory in theResourceSliceobjects is sufficient to satisfy the pod’sResourceClaim, after accounting for devices already allocated to other claims. - When a pod is scheduled, the
DynamicResourcesplugin, in itsPreBindstage, makes an API call to update theResourceClaimobject’s status. This update makes the allocation permanent and visible to the rest of the cluster.
These standard resources and the dynamic resources accounting systems are completely independent. The
NodeInfo cache is not aware of allocations recorded in ResourceClaim objects, which is the root
cause of the accounting gap for node allocatable resources when they are managed through DRA.
User Stories
Story 1 (Resource Alignment): A HPC workload needs a certain number of exclusive CPUs and memory
that are aligned on the same NUMA node as a specific NIC for maximum performance. The user creates a
ResourceClaim with co-location constraints to enforce this. The scheduler correctly accounts for the
CPU and memory requests made through the claim, adding them to the node’s total requested resources, so
the node is not oversubscribed.
Story 2 (Dedicated and Shared resources): A Telco application has some high-priority application
containers and some lower-priority sidecar containers. The user wants to dedicate some CPU cores
exclusively to the application containers for low latency, while allowing sidecar containers to run on
the node’s general shared CPU pool. They use DRA to request exclusive cores and standard pod.spec
requests for the shared CPU portion. The scheduler should correctly account for both dedicated and shared
requests made through these different mechanisms.
Story 3 (Accelerator with Node Allocatable Resource Dependency): An AI inference job requests a GPU through
a ResourceClaim. The specific GPU model also requires certain number of CPUs and Hugepages that are
required for the application to interact with the accelerator. Instead of requiring the user to know
about these auxiliary CPU and HugePages requests and add it to their PodSpec, the GPU Device can be
configured to declare these dependencies. The Kubernetes scheduler accounts for both the CPU/HugePages
needs for the GPU device and the standard pod spec requests, ensuring the pod lands on a node with
sufficient capacity for all requirements. The user experience is simplified, as they only need to ask
for the primary device they care about.
Story 4 (Fungibility): An ML inference job can use either a full GPU or, if none is available, a
slice of 8 exclusive CPUs. The user creates a ResourceClaim with a firstAvailable list to
represent this fungible need. The scheduler evaluates both paths against a node’s available
resources. It finds a node with 8 available CPUs, correctly reserves them in its central NodeInfo
cache, and schedules the pod. The user did not need to guess which resource to put in the pod.spec.
Risks and Mitigations
- Increased API and user complexity by having two ways to request node allocatable resources (PodSpec and ResourceClaim). To mitigate, the documentation would be enhanced with clear guidelines and use cases for DRA for Node Allocatable Resources.
- Bugs in the kube-scheduler’s new accounting logic would lead to incorrect node resource calculations and node oversubscription. Extensive unit and integration tests covering various resource claim and standard request combinations should help mitigate this. The feature will also be rolled out gradually, beginning with an alpha release to gather feedback and address potential concerns.
- Until Kubelet is made DRA-aware for node allocatable resources (a non-goal for Alpha), QoS and node-level enforcement will not fully reflect DRA allocations. This is an accepted limitation for the initial Alpha scope.
Design Details
The proposal here is to enhance the kube-scheduler to implement a “Unified Accounting” model for node allocatable resources requested through the standard pod Spec or through Dynamic Resource Allocation (DRA) claims. This involves modifications in NodeResourcesFit and DynamicResources plugins in how they track resource usage on the node. This also includes updates to the DRA API for drivers to declare node allocatable resource implications in Device objects, and Pod Status to record DRA-based node allocatable resource allocations. The core principle is that, when a Pod has a node allocatable resource requested through a DRA claim, the responsibility for checking the node resource fit is delegated to the DynamicResources plugin, and standard checks in NodeResourcesFit are bypassed. The delegation should ensure correct resource accounting irrespective of the execution order of these plugins.
API Changes
To support unified accounting for node allocatable resources, this KEP proposes API extensions to the Device object and PodStatus.
Device API Extensions
The new field NodeAllocatableResourceMappings within the ResourceSlice.Device spec is used to define the node allocatable resource quantities.
// In k8s.io/api/resource/v1/types.go
type Device struct {
// ... existing fields
// NodeAllocatableResourceMappings defines the mapping of node resources
// that are managed by the DRA driver exposing this device. This includes resources currently
// reported in v1.Node `status.allocatable` that are not extended resources
// (see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#extended-resources).
// Examples include "cpu", "memory", "ephemeral-storage", and hugepages.
// In addition to standard requests made through the Pod `spec`, these resources
// can also be requested through claims and allocated by the DRA driver.
// For example, a CPU DRA driver might allocate exclusive CPUs or auxiliary node memory
// dependencies of an accelerator device.
// The keys of this map are the node-allocatable resource names (e.g., "cpu", "memory").
// Extended resource names are not permitted as keys.
// +optional
// +featureGate=DRANodeAllocatableResources
NodeAllocatableResourceMappings map[v1.ResourceName]NodeAllocatableResourceMapping `json:"nodeAllocatableResourceMappings,omitempty" protobuf:"bytes,13,opt,name=nodeAllocatableResourceMappings"`
}
// NodeAllocatableResourceMapping defines the translation between the DRA device/capacity
// units requested to the corresponding quantity of the node allocatable resource.
type NodeAllocatableResourceMapping struct {
// CapacityKey references a capacity name defined as a key in the
// `spec.devices[*].capacity` map. When this field is set, the value associated with
// this key in the `status.allocation.devices.results[*].consumedCapacity` map
// (for a specific claim allocation) determines the base quantity for
// the node allocatable resource. If `allocationMultiplier` is also set, it is
// multiplied with the base quantity.
// For example, if `spec.devices[*].capacity` has an entry "dra.example.com/memory": "128Gi",
// and this field is set to "dra.example.com/memory", then for a claim allocation
// that consumes { "dra.example.com/memory": "4Gi" } the base quantity for the
// node allocatable resource mapping will be "4Gi", and `allocationMultiplier` should
// be omitted or set to "1".
// +optional
CapacityKey *QualifiedName `json:"capacityKey,omitempty" protobuf:"bytes,1,opt,name=capacityKey"`
// AllocationMultiplier is used as a multiplier for the allocated device count or the allocated capacity in the claim.
// It defaults to 1 if not specified. How the field is used also depends on whether `capacityKey` is set.
// 1. If `capacityKey` is NOT set: `allocationMultiplier` multiplies the device count allocated to the claim.
// a. A DRA driver representing each CPU core as a device would have
// {ResourceName: "cpu", allocationMultiplier: "2"} in its
// `nodeAllocatableResourceMappings`. If 4 devices are allocated to the claim,
// 4 * 2 CPUs would be considered as allocated and subtracted from the node's capacity.
// b. A GPU device that needs additional node memory per GPU allocation would
// have {ResourceName: "memory", allocationMultiplier: "2Gi"}. Each allocated
// GPU device instance of this type will account for 2Gi of memory.
//
// 2. If `capacityKey` IS set: `allocationMultiplier` is multiplied by the amount of that capacity consumed.
// The final node allocatable resource amount is `consumedCapacity[capacityKey]` * `allocationMultiplier`.
// For example, if a Device's capacity "dra.example.com/cores" is consumed,
// and each "core" provides 2 "cpu"s, the mapping would be:
// {ResourceName: "cpu", capacityKey: "dra.example.com/cores", allocationMultiplier: "2"}.
// If a claim consumes 8 "dra.example.com/cores", the CPU footprint is 8 * 2 = 16.
// +optional
AllocationMultiplier *resource.Quantity `json:"allocationMultiplier,omitempty" protobuf:"bytes,2,opt,name=allocationMultiplier"`
}
Resource Representation Examples
The Device API Extension model is flexible enough to support various ways of representing node allocatable resources.
- Node allocatable resource represented as individual devices
# DeviceClass
apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
name: cpu-core
spec:
selectors:
- cel: 'device.driver == "dra.example.com"'
---
# ResourceSlice
apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
name: cpu-slice
spec:
driver: dra.example.com
nodeName: my-node
pool: { name: "node-pool", generation: 1, resourceSliceCount: 1 }
devices:
- name: cpu0
attributes:
numaNode: 0
nodeAllocatableResourceMappings:
cpu:
allocationMultiplier: "1"
- name: cpu1
attributes:
numaNode: 0
nodeAllocatableResourceMappings:
cpu:
allocationMultiplier: "1"
# ... other cpu devices
- Each device instance (like
cpu0) in theResourceSlicerepresents a single unit of CPU. - Each Device uses
nodeAllocatableResourceMappingsto specify its impact on node allocatable resources. TheallocationMultiplierfield (whencapacityKeyis not set) indicates the amount of a node allocatable resource per device instance. For example, ifcpu0represents a single CPU thread, this would be “1”. If a device represents a physical CPU core (e.g., with 2 threads),allocationMultiplierwould be “2”.
- Node allocatable resource represented as Consumable Pool
- In this model, a
Devicein theResourceSliceacts as a host for a pool of node allocatable resources (e.g., a CPU socket providing 128 cores). - By setting allowMultipleAllocations: true on the device, the DRA framework allows multiple ResourceClaims to be allocated against that same device instance simultaneously
- This example uses the
capacityKeyfield to link todevice.capacityfor the resource represented as consumable capacity. - When a
ResourceClaimis allocated against this device, it might only request a small slice e.g., 8 CPUs from the 128 CPUs available indra.example.com/cpu. ThenodeAllocatableResourceMappings["cpu"]entry tells the scheduler to look for the'dra.example.com/cpu'key within that specific claim’s allocation to determine the claim’s CPU footprint. This ensures only the allocated slice, rather than the entire device capacity, is accounted for on the node.
# DeviceClass
apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
name: additional-cpu-memory
spec:
selectors:
- cel: 'device.driver == "dra.example.com"'
---
# ResourceSlice
apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
name: native-resource-slice
spec:
driver: dra.example.com
nodeName: my-node
pool: { name: "node-pool", generation: 1, resourceSliceCount: 1 }
devices:
- name: socket0
attributes:
"dra.example.com/type": "socket"
allowMultipleAllocations: true
capacity:
"dra.example.com/cpu": "128"
"dra.example.com/memory": "256Gi"
nodeAllocatableResourceMappings:
cpu:
capacityKey: "dra.example.com/cpu"
memory:
capacityKey: "dra.example.com/memory"
- Partitionable Devices
- In the below example CPU is represented as a partitionable device with NUMA Node and L3 cache partitions.
- The
node-cpu-countersCounterSet holds the total 128 CPUs. - Allocating
socket-0-numa-0would notionally reserve 32 CPUs fromnode-cpu-counterscounter set. - Allocating
socket-0-numa-0-l3-0consumes 8 CPUs from the samenode-cpu-counters. nodeAllocatableResourceMappings.capacityKeylinks the node allocatable resource accounting to this device-specific capacity.
# DeviceClass
apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
name: dra-l3-caches
spec:
selectors:
- cel: 'device.driver == "dra.example.com"'
---
apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
name: cpu-counters-slice
spec:
driver: dra.example.com
sharedCounters:
- name: node-cpu-counters
counters:
"dra.example.com/cpu": { value: "128" }
---
apiVersion: resource.k8s.io/v1
kind: ResourceSlice
# ...
spec:
# ...
devices:
- name: socket-0-l3-0
attributes:
dra.example.com/type: l3cache
dra.example.com/numaID: "0"
capacity:
"dra.example.com/cpu": "8" # This L3 cache contains 8 CPUs
consumesCounters:
- counterSet: node-cpu-counters
counters:
"dra.example.com/cpu": "8"
nodeAllocatableResourceMappings:
cpu:
capacityKey: "dra.example.com/cpu"
. . .
- name: socket-0-numa-0
attributes:
dra.example.com/type: numa
dra.example.com/numaID: "0"
capacity:
"dra.example.com/cpu": "32" # This numa node contains 32 CPUs
consumesCounters:
- counterSet: node-cpu-counters
counters:
"dra.example.com/cpu": "32"
nodeAllocatableResourceMappings:
cpu:
capacityKey: "dra.example.com/cpu"
- Auxiliary node allocatable resource requests for Accelerators
- The accelerator device uses
NodeAllocatableResourceMappingsto indicate it needs additional CPU and Memory. These amounts will be added to the pod’s total requests. - Importantly, the node allocatable resources specified in
NodeAllocatableResourceMappingsare not necessarily managed by the DRA driver in the same way as the accelerator itself. Instead, this mechanism primarily serves as an accounting system for the kube-scheduler to not overcommit the node.
# DeviceClass
apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
name: ai-accelerators
spec:
selectors:
- cel: 'device.driver == "xpu.example.com"'
---
# ResourceSlice
apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
name: my-node-xpus
spec:
driver: xpu.example.com
nodeName: my-node
# ...
devices:
- name: xpu-model-x-001
attributes:
example.com/model: "model-x"
nodeAllocatableResourceMappings:
cpu:
allocationMultiplier: "2"
memory:
allocationMultiplier: "8Gi"
Pod API Changes
We add a new field NodeAllocatableResourceClaimStatuses to PodStatus as a way to pass the allocation details from the DynamicResources plugin to the kube-scheduler accounting logic.
// In k8s.io/api/core/v1/types.go
// PodStatus represents information about the status of a pod.
type PodStatus struct {
// ... existing fields
// NodeAllocatableResourceClaimStatuses contains the status of node-allocatable resources
// that were allocated for this pod through DRA claims. This includes resources currently
// reported in v1.Node `status.allocatable` that are not extended resources
// (see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#extended-resources).
// Examples include "cpu", "memory", "ephemeral-storage", and hugepages.
// +featureGate=DRANodeAllocatableResources
// +optional
// +listType=atomic
NodeAllocatableResourceClaimStatuses []NodeAllocatableResourceClaimStatus
}
// NodeAllocatableResourceClaimStatus describes the status of node allocatable resources allocated via DRA.
type NodeAllocatableResourceClaimStatus struct {
// ResourceClaimName is the resource claim referenced by the pod that resulted in this node allocatable resource allocation.
// +required
ResourceClaimName string `json:"resourceClaimName" protobuf:"bytes,1,opt,name=resourceClaimName"`
// Containers lists the names of all containers in this pod that reference the claim.
// +optional
// +listType=set
Containers []string `json:"containers,omitempty" protobuf:"bytes,2,rep,name=containers"`
// Resources is a map of the node-allocatable resource name to the aggregate quantity allocated to the claim.
// +required
Resources map[ResourceName]resource.Quantity `json:"resources" protobuf:"bytes,3,rep,name=resources"`
}
Kube-Scheduler Workflow
The scheduling process for a Pod involves several stages. The following describes how the NodeResourcesFit and
DynamicResources plugins interact within the kube-scheduler framework to achieve unified accounting for node allocatable resources
managed by DRA. The key goal is to ensure that the delegation mechanism works regardless of the execution order of these
plugins.
PreFilter Stage:
- DynamicResources Plugin: Validates the
ResourceClaimand its associatedDeviceClass. It ensures that the referenced classes exist. - NodeResourcesFit Plugin: Calculates and caches the pod’s total standard resource requests (summing up containers). It does not perform checks fit or filter nodes at this stage. For Alpha, as node allocatable resource claims can only add to standard requests, the delegation
mechanism between the plugins is optional. Without delegation there is a dual resource fit check in both the
NodeResourcesFitand theDynamicResourcesplugins, but theDynamicResourcesplugin’s check is the authoritative check. The delegation may become a strict requirement if we introduce non-additive [accounting policies][#accounting-policies].
- DynamicResources Plugin: Validates the
Filter Stage: This stage performs the node-level checks to determine if a pod fits on a specific node.
- NodeResourcesFit Plugin: In the Alpha stage, this plugin would continue to do the resource fit based on standard requests
- DynamicResources Plugin: This plugin takes on the authoritative role for checking node allocatable resource fit if any of
the pods
ResourceClaims request for node allocatable resources.- The plugin tries to allocate devices to all the resource claims of the pod.
- Claim Resource Calculation: For each allocated device, we check the
Device.NodeAllocatableResourceMappingsand determine the amount of each node allocatable resource (CPU, Memory, etc.) associated with the allocated device instance using theCapacityKeyorAllocationMultiplierfields.- If
CapacityKeyis not set andAllocationMultiplieris specified, that multiplier is applied to the allocated device count. - If
CapacityKeyis set, the node allocatable resource quantity is derived by looking at the consumed capacity in the claim allocation.
- If
- The plugin calculates the total effective demand for each node allocatable resource by
- Summing up container requests from the pod spec requests and the amounts determined from DRA claims.
- If a claim is referenced by multiple containers, its accounted for only once.
- If pod level resources are also specified, that takes precedence and determines the resource footprint of the pod.
- Validation: the plugin validation for the below scenarios
- If Pod Level Resources are defined, the plugin will validate that the sum of effective
requests (standard + DRA claims) does not exceed the budget set at the pod level in
pod.spec.resources(details ). - For Alpha, the plugin would reject a pod with a node allocatable resource claim if the claim is referenced by an existing pod (details ).
- If Pod Level Resources are defined, the plugin will validate that the sum of effective
requests (standard + DRA claims) does not exceed the budget set at the pod level in
- This total effective demand is checked against the node’s allocatable resources and node is filtered out if it does not have enough capacity.
- The calculated node allocatable resource allocations for the pod on this specific node (
NodeAllocatableResourceClaimStatus) are stored in theCycleState. This is needed for passing the node-specific allocation details to the laterAssumeandPreBindstages.
Scheduler Internal Cache Update: After a node is selected, the scheduler updates its internal cache to reflect the resources consumed by the new pod. This stage is critical for maintaining the internal cache consistent. The scheduler framework “assumes” the pod will run on the selected node and updates its cache without waiting for bind (updating the API server) to succeed. Without an “assume” step, the scheduler might try to place other pods on the same node using stale resource information, potentially leading to oversubscription. The Assume phase reserves the resources in the scheduler’s in-memory cache immediately.
- The scheduler framework retrieves the node-specific
NodeAllocatableResourceClaimStatusfromCycleStatewhich was populated during theDynamicResourcesFilter stage. - This is then applied to the in-memory copy of the Pod object’s status (
pod.status.nodeAllocatableResourceClaimStatuses) that the scheduler is about to “assume”. This is passed toNodeInfocache update (update() ). - The pod’s effective node allocatable resource demand is calculated based on standard pod requests and node allocatable resource claims
as detailed in the Resource Calculation
section. This is added to
nodeInfo.Requested.
- The scheduler framework retrieves the node-specific
PreBind Stage: This stage performs actions right before the pod is immutably bound to the node.
- DynamicResources Plugin: The plugin updates the
ResourceClaim.Statusto reflect the allocated devices. It also patches thePod.Statusto add theNodeAllocatableResourceClaimStatusesfield, persisting the information calculated during the Filter stage (NodeAllocatableResourceClaimStatus) and making this information available for components (like kubelet).
- DynamicResources Plugin: The plugin updates the
Bind Stage: This stage executes asynchronously after the main scheduling cycle has decided on a node. The scheduler listens for pod
Updateevents, and transitions the pod from the “assumed” state to “bound” if the bind process succeeded. The resource accounting on theNodeInfodoes not change at this point (as they were previously accounted for during the “Assume” step). If the bind fails, or if the Kubelet later rejects the Pod, the scheduler detects this and reverts the resource allocation in its cache, decrementingnodeInfo.Requested.
Resource Calculation
To ensure consistent resource accounting across multiple consumers, the core logic for calculating a pod’s total
resource footprint, including DRA-managed node allocatable resources, will be centralized in the PodRequests function within the
k8s.io/component-helpers/resource package. This helper function is currently used by various components, including scheduler
plugins like NodeResourcesFit, NodeInfo cache update, and Kubelet’s admission handler.
The total node allocatable resource requirements for a pod are determined by aggregating the following:
- If pod level resources are specified for a resource, that determines the overall footprint for the pod. The individual container level requests are not considered and including requests made through claims.
- It iterates through all containers (init and regular) in the pod and determines the aggregate resource request based on existing logic.
- If
DRANodeAllocatableResourcesis enabled and the pod’sstatus.nodeAllocatableResourceClaimStatusesis populated:- Iterate though each claim and obtain the node allocatable resource quantities allocated from
nodeAllocatableResourceClaimStatuses[].resources - For each resource, the
resources.quantityis added to the pod’s total request.
- Iterate though each claim and obtain the node allocatable resource quantities allocated from
- If pod overheads are specified in
pod.spec.overhead, they are added to the final sum.
Integration with Pod Level Resources
When Pod Level Resources are specified (pod.spec.resources), it continues to set the overall budget for the pod.
Node allocatable resources added to individual containers via DRA claims must be accounted for within this pod-level budget.
The effective resource request for a container is the sum of its base request specified in spec.containers[].resources.requests
and any additional resources allocated through DRA claims.
Currently, with pod level resources, an admission time validation ensures that the sum of container requests does not
exceed pod level requests. However, this is insufficient for pods with node allocatable resource claims, as their exact quantities
are only determined after the DynamicResources scheduler plugin allocates devices. This allocation can be dynamic,
especially with claims with prioritized lists
(fungobility usecases).
Therefore, the DynamicResources plugin must perform an additional validation step during its Filter stage. After allocating
devices to claims and calculating the node allocatable resources added, the plugin will verify that the total effective pod demand
(standard container requests + DRA node allocatable resources) does not surpass the limits set in pod.spec.Resources.
If a pod requests a specific set of devices via DRA claims, and the resulting node allocatable resource footprint
(base container + DRA additions) exceeds the pod.spec.Resources budget, this failure is global to the pod.
The DynamicResources plugin would return UnschedulableAndUnresolvable.
Handling Shared Claims
Intra-Pod Sharing:
Containers within the same pod can reference the same ResourceClaim. The node allocatable resources associated with the claim are accounted for
only once for the entire pod, as described in the Resource Calculation section. The resource calculation shared library function
PodRequests() can effectively handle de-duplication for claims shared within a single pod, as all necessary information is self-contained
within the Pod scope (standard requests in Spec and DRA requests in status.nodeAllocatableResourceClaimStatuses)
Inter-Pod Sharing:
In the current Alpha scope, sharing ResourceClaims that manage node allocatable resources between different pods and the DynamicResources plugin
would reject (UnschedulableAndUnresolvable) the pod referencing a claim referenced by an existing pod. This is becase of the following reasons:
- When multiple pods, each potentially having its own Pod Level Resources budget (
pod.spec.resources), reference the same node allocatable DRA claim, it’s ambiguous how to attribute the cost of these shared node allocatable resources against each pod’s individual resource footprint. - Node-level cgroups enforcement would be challenging if node allocatable resources can be shared between pods. Dynamically adjusting cgroup settings for all consumer pods as pods referencing the same shared claim start/stop would be extremely complex and hard to support.
A new field NodeAllocatableDRAClaimStates is added in NodeInfo to track the state of node allocatable resource DRA claims on this node. The DynamicResources
plugin will use NodeInfo.NodeAllocatableDRAClaimStates during the Filter stage (validation step) to check if the ResourceClaim is assigned to an existing pod.
// In pkg/scheduler/framework/types.go
type NodeInfo struct {
// ... existing fields
// NodeAllocatableDRAClaimStates tracks the state of claims requesting node allocatable resources.
// The key is the NamespacedName of the ResourceClaim.
NodeAllocatableDRAClaimStates map[types.NamespacedName]*NodeAllocatableDRAClaimState
}
// NodeAllocatableDRAClaimState holds information about a node allocatable resource DRA claim's allocation on a node.
type NodeAllocatableDRAClaimState struct {
// Pods using this claim on this node.
ConsumerPods sets.Set[types.UID]
}
Multiple Claims per Container
A single container can reference multiple DRA claims. The node allocatable resources from each distinct claim are summed up to contribute to the pod’s total resource requirements.
Example:
- Combining additive policies.
ClaimA - requests 4 CPUs
ClaimB - requests 2 CPU
- Pod 1
- Container “c1”
- Spec: requests 1 CPU
- claims: ClaimA, ClaimB
- Container “c2”
- Spec: requests 2 CPU
- claims: ClaimA`
- Result:
- Pod Effective CPU = 1 (c1 PodSpec) + 4 (ClaimA) + 2 (ClaimB) + 2 (c2 PodSpec) = 9 CPUs.
- Claim A is accounted for only once
- Pod 1
Unreferenced Claims
If a ResourceClaim is listed in pod.spec.resourceClaims but not referenced by any container in pod.spec.containers[*].resources.claims.
The resources associated with this claim ARE still accounted for against the node’s capacity once. This is because the DRA allocator allocates
the devices to the claim making them unavailable to others (Eg: exclusive CPUs requested through a claim). This will be enforced in the
PodRequests() helper function when computing the pod resource footprint.
Kubelet Admission Control
The Kubelet has its own admission check
(AdmissionCheck
)
to ensure a pod can run on the node, even after the scheduler has placed it. It utilizes the PodRequests() function from
the k8s.io/component-helpers/resource. This shared helper has been enhanced to support unified accounting. When
calculating a pod’s requirements, it aggregates the standard requests from pod Spec with the DRA allocations recorded in
pod.status.nodeAllocatableResourceClaimStatuses. Because the scheduler populates this status field during the PreBind stage, the
Kubelet validates the pod’s comprehensive resource footprint.
Node Resource Enforcement and Isolation
In the Alpha phase, the Kubelet does not account for node allocatable resources requested through DRA for QOS class determination, cgroup management,
and eviction decisions. These mechanisms solely rely on the requests and limits specified in the pod.spec.containers[*].resources
or pod.spec.initcontainers[*].resources. This creates a discrepancy where a user may specify node allocatable resource requests through ResourceClaims,
but the Kubelet enforces runtime limits based solely on the pod.spec.
- A pod requesting CPU/Memory via DRA claims may be classified as
BestEffort(no CPU/Memory requests or limits in its pod spec), or asBurstable(limits greater than request), as the DRA-provided resources are not considered in the QoS calculation. The QoS class directly determines the pod’s parent directory within the cgroup filesystem hierarchy. This hierarchical directory structure is critical for enforcing resource controls in the Linux kernel. - Kubelet currently sets CPU and memory cgroup settings only based on pod spec. This would result in incorrect runtime enforcements. For CPU, the container could get low CPU shares or could be incorrectly throttled. For memory, if the memory allocation exceeds the limit in the spec, it could be OOM killed.
- To prevent a critical system daemon from failing to start, the Kubelet will preempt pods on its node to free up the required requests. This decision is based primarily on QoS Class. Pods with DRA node allocatable resource requests but a low QoS class (BestEffort or Burstable) would have a higher risk of being evicted under node resource pressure.
Mitigation:
- Define an overall pod budget using
pod.spec.resources, the Kubelet uses this to compute QOS class and set the overall cgroup limits for the pod. The pod’s actual runtime usage on the node is bounded by the pod level limits. - If using container level requests and limits, the user must increase the container limits to be equal to or greater than the sum of the base container request in the spec and the DRA claim request. This would result in the pod being classified as Burstable (limit > request). This ensures the Kubelet sets the Cgroup limit high enough to allow full usage of the DRA resource, preventing throttling or OOMs. The request in the spec need not include claim request as they are already accounted for by the scheduler.
- For critical infrastructure (e.g., the DRA driver DaemonSet itself), set
priorityClassNamein thepod.spectosystem-node-criticalorsystem-cluster-criticalto reduce the risk of eviction. The high priority class ensures the pod is evaluated last for eviction among all workloads exceeding their requests.
In a future Alpha or Beta stage, the Kubelet will natively calculate effective requests and limits by combining the standard request from the pod spec and the DRA Claim and configure node level settings like QOS class, Cgroup settings etc. correctly.
Use Case Walkthroughs
Use Case 1: Pod with Standard and DRA CPU and Memory Request
apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
name: node1-slice
spec:
driver: dra.example.com
nodeName: node1
devices:
- name: socket0
attributes: {"dra.example.com/type": "socket"}
allowMultipleAllocations: true
capacity:
"dra.example.com/cpu": "128"
"dra.example.com/memory": "256Gi"
nativeResourceMappings:
cpu:
quantityFrom:
capacity: "dra.example.com/cpu"
memory:
quantityFrom:
capacity: "dra.example.com/memory"
---
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
name: cpu-mem-claim
spec:
devices:
requests:
- name: cpu-mem-req
exactly:
deviceClassName: cpu-mem-socket
capacity:
requests:
"dra.example.com/cpu": "4"
"dra.example.com/memory": "8Gi"
---
# Pod
apiVersion: v1
kind: Pod
metadata:
name: dra-pod
spec:
containers:
- name: my-app1
image: my-image
resources:
requests:
cpu: 100m
memory: 100Mi
claims:
- name: "my-cpu-mem-claim"
- name: my-app2
image: my-image-2
claims:
- name: "my-cpu-mem-claim"
resourceClaims:
- name: "my-cpu-mem-claim"
resourceClaimName: cpu-mem-claim
Expected behavior:
NodeResourcesFit: Checks node capacity against standard container requests {cpu: 100m, memory: 100Mi}.DynamicResources: Allocates from thesocket0device innode1-slice.- DRA Node Allocatable Resources: {cpu: 4, memory: 8Gi} from claim
my-cpu-mem-claim. - Standard Container Requests: {cpu: 100m, memory: 100Mi} from
my-app1. - Effective Pod Demand: {cpu: 4100m, memory: 8.1Gi}
- Checks node capacity against Effective Pod Demand.
- DRA Node Allocatable Resources: {cpu: 4, memory: 8Gi} from claim
- Scheduler Cache Update: Node’s requested resources increase by 4.1 CPU and 8.1Gi Memory.
Use Case 2: Pod with Fungible Resource Claim (GPU or CPU)
apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
name: node1-slice
spec:
driver: dra.example.com
nodeName: node1
pool: {name: node1-pool, generation: 1, resourceSliceCount: 1}
devices:
- name: socket0
attributes: {"dra.example.com/type": "socket"}
allowMultipleAllocations: true
capacity:
"dra.example.com/cpu": "128"
nativeResourceMappings:
cpu:
quantityFrom:
capacity: "dra.example.com/cpu"
---
# ResourceSlice for GPUs
apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
name: node1-gpus
spec:
driver: gpu.example.com
nodeName: node1
pool: {name: node1-pool, generation: 1, resourceSliceCount: 1}
devices:
- name: gpu0
---
# ResourceClaimTemplate for Fungibility
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: gpu-or-cpu-template
spec:
spec:
devices:
requests:
- name: gpu-or-cpu-req
firstAvailable:
- name: gpu
deviceClassName: gpu-class
count: 1
- name: cpu
deviceClassName: cpu-class
capacity:
requests:
"dra.example.com/cpu": "30"
---
apiVersion: v1
kind: Pod
metadata:
name: fungible-pod
spec:
containers:
- name: my-app
image: my-image
resources:
requests:
cpu: "1"
memory: "1Gi"
claims:
- name: "gpu-or-cpu"
resourceClaims:
- name: "gpu-or-cpu"
resourceClaimTemplateName: gpu-or-cpu-template
status:
nativeResourceClaimStatus: # Populated only if CPU device is selected
- claimInfo:
name: gpu-or-cpu
containers:
- my-app
resources:
- resourceName: cpu
quantity: 30
Expected behavior:
NodeResourcesFit: Checks node capacity against standard container requests {cpu: 1, memory: 1Gi}.DynamicResources:- Scenario A: GPU Selected
- DRA Node Allocatable Resources: None
- Standard Container Requests: {cpu: 1, memory: 1Gi}
- Effective Pod Demand: {cpu: 1, memory: 1Gi}
- Checks node capacity against Effective Pod Demand.
- Scheduler Cache Update: Node requested increases by 1 CPU, 1Gi Memory.
- Scenario B: CPU Selected
- DRA Node Allocatable Resources: {cpu: 30} from claim
gpu-or-cpu. - Standard Container Requests: {cpu: 1, memory: 1Gi}
- Effective Pod Demand: {cpu: 31, memory: 1Gi}
- Checks node capacity against Effective Pod Demand.
- Scheduler Cache Update: Node requested increases by 31 CPU, 1Gi Memory.
- DRA Node Allocatable Resources: {cpu: 30} from claim
- Scenario A: GPU Selected
Use Case 3: Combined Native (DRA CPU) and Auxiliary Request (GPU)
# --- gpu-claim, cpu-claim, and ResourceSlices defined as before ---
# Pod
apiVersion: v1
kind: Pod
metadata:
name: combined-dra-pod
spec:
containers:
- name: my-app1
image: my-image1
resources:
requests:
cpu: "100m"
memory: "1Gi"
claims:
- name: "my-cpu-claim"
- name: "my-gpu-claim"
- name: my-app2
image: my-image2
resources:
requests:
cpu: "200m"
memory: "2Gi"
claims:
- name: "my-cpu-claim"
- name: "my-gpu-claim"
resourceClaims:
- name: "my-cpu-claim"
resourceClaimName: cpu-claim
- name: "my-gpu-claim"
resourceClaimName: gpu-claim
Expected Behavior:
NodeResourcesFit: Checks node capacity against standard container requests {cpu: 300m, memory: 3Gi}.DynamicResources:- DRA Node Allocatable Resources:
my-cpu-claim: {cpu: 10}my-gpu-claim: {cpu: 2, memory: 4Gi} (Auxiliary)
- Standard Container Requests:
my-app1: {cpu: 100m, memory: 1Gi}my-app2: {cpu: 200m, memory: 2Gi}
- Effective Pod Demand:
- CPU: 100m + 200m + 10 (my-cpu-claim) + 2 (my-gpu-claim) = 12.3 CPU
- Memory: 1Gi + 2Gi + 4Gi (my-gpu-claim) = 7Gi
- Checks node capacity against Effective Pod Demand.
- DRA Node Allocatable Resources:
- Scheduler Cache Update: Node’s requested increases by 12.3 CPU and 7Gi Memory.
Use Case 4: Pod Level Resources with shared CPU DRA Claim and sidecars
This use case demonstrates using PLR to set the overall CPU budget for a pod, where two containers share a DRA claim for 10 dedicated CPUs, and two sidecar containers run with best-effort in-pod placement.
# --- cpu-req-10-cpus ResourceClaim defined as before ---
# Pod
apiVersion: v1
kind: Pod
metadata:
name: dra-pod-with-plr-besteffort-sidecars
spec:
resources:
requests:
cpu: "11" # 10 from shared claims + additional 1 for all the sidecars
memory: "10Gi"
limits:
cpu: "11"
memory: "10Gi"
containers:
- name: my-app1
image: my-image1
resources:
claims:
- name: "cpu-req-10-cpus"
- name: my-app2
image: my-image-2
resources:
claims:
- name: "cpu-req-10-cpus"
- name: sidecar-container-1
image: my-image-3
- name: sidecar-container-2
image: my-image-4
resourceClaims:
- name: "cpu-req-10-cpus"
resourceClaimName: cpu-req-10-cpus
Expected Behavior:
NodeResourcesFit: Checks node capacity against PLR {cpu: 11, memory: 10Gi}.DynamicResources:- DRA Node Allocatable Resources: {cpu: 10} from
cpu-req-10-cpus. - Standard Container Requests: {cpu: 11, memory: 10Gi} (pod level requests take precedence).
- Total effective demand for PLR check: {cpu: 11, memory: 10Gi} (pod level requests take precedence).
- Checks node capacity against PLR {cpu: 11, memory: 10Gi} (This check is redundant as NodeResourcesFit already does it).
- DRA Node Allocatable Resources: {cpu: 10} from
- Scheduler Cache Update: Node’s requested resources increase by 11 CPU and 10Gi Memory.
Future Enhancements
Kubelet QoS and Cgroup Management
As noted in the Non-Goals, full Kubelet awareness of DRA node allocatable resources for QoS classification and cgroup management is not in scope for first Alpha. This work will involve:
- Updating Kubelet’s QoS class calculation to include node allocatable resources from
pod.status.nodeAllocatableResourceClaimStatuses. - Ensuring Kubelet’s cgroup manager correctly configures CPU and Memory limits/shares based on the sum of PodSpec requests and DRA-provided node allocatable resources.
- Aligning eviction thresholds with the true resource footprint, including DRA.
Kube-Scheduler Scoring and Resource Quota
Scoring: In the current Alpha, the NodeResourcesFit plugin’s scoring only considers node allocatable resources requested directly in the pod Spec.
It does not yet account for node allocatable resources allocated through DRA claims. The DynamicResources plugin’s scoring is based on the
DRA allocation decisions themselves and is independent of the node allocatable resource quantities involved. It may be desirable to unify the
scoring for node allocatable resources instead of independently scoring in two different plugins.
Quota: Currently, ResourceQuota only accounts for resources defined in the pod.spec and including node allocatable resources allocated via DRA
in ResourceQuota enforcement is not included in the Alpha scope.
Enhancing Scoring and ResourceQuota to be aware of DRA node allocatable resources should be considered for a future milestone.
These components rely on the same PodRequests() helper function (from k8s.io/component-helpers/resource) used by the scheduler
framework and plugins to calculate resource footprints. Integrating DRA node allocatable resources would involve ensuring this helper is called
with the appropriate options to include pod.status.nodeAllocatableResourceClaimStatuses. The implications of this change need to be discussed.
Integration with In-Place Pod Vertical Scaling
In-Place Pod Vertical Scaling allows updating a container’s resource requests and limits without restarting the pod restart. The Kubelet actuates these changes by updating the container’s cgroup settings to match the new values in the PodSpec.
Kubelet, in this Alpha, does not account for node allocatable resources allocated via DRA (i.e., from pod.status.nodeAllocatableResourceClaimStatuses)
when setting container-level cgroups. For alpha, any attempt to use the In-Place Pod Resizing /resize subresource on a Pod that
has entries in pod.status.nodeAllocatableResourceClaimStatuses will be rejected by the API server. Validation will be added to the /resize
subresource handler to enforce this. Integration of In-Place Pod Resizing with DRA Node Allocatable Resources will be addressed during future KEP
iterations along with the Kubelet enhancements to consider pod.status.nodeAllocatableResourceClaimStatuses when calculating and enforcing
container and pod level cgroup settings.
Accounting Policies
The Alpha release of this KEP implements an implicit node allocatable resource accounting policy: any node allocatable resource
quantities specified in the NodeAllocatableResourceMappings of allocated devices are added to the pod’s total resource
requirements, accounted for once per ResourceClaim.
Future enhancements could introduce explicit Node Allocatable Resource Accounting Policies to provide more control over
how DRA-based node allocatable resources are aggregated with standard PodSpec requests. This would likely involve adding new
fields, such as AccountingPolicy, to the NodeAllocatableResourceMapping struct to specify the desired policy. The impact of
these accounting policies on existing features like Pod Level Resources and In-Place Pod Vertical Scaling also
needs more consideration.
API with Accounting Policy
Device Class
// NodeAllocatableResourceAccountingPolicy defines how node allocatable resource quantities like CPU, Memory
// allocated via DRA are aggregated with standard resource requests in the PodSpec.
type NodeAllocatableResourceAccountingPolicy string
const (
// PolicyAddPerClaim indicates that the node allocatable resource quantity in the DRA claim
// is treated as additional to the pod spec requests. This quantity is accounted
// for exactly once per claim instance, regardless of the number of containers referencing it.
// This applies whether those referencing containers belong to a single pod or are across different pods.
PolicyAddPerClaim NodeAllocatableResourceAccountingPolicy = "AddPerClaim"
// PolicyAddPerReference indicates that the node allocatable resource quantity in the DRA
// claim is treated as additional to the pod spec requests. This quantity is
// accounted for cumulatively for every reference to the claim.
// Each container that references the claim adds the claim's quantity to its
// node allocatable resource request in the pod spec.
PolicyAddPerReference NodeAllocatableResourceAccountingPolicy = "AddPerReference"
// PolicyMax indicates that effective request is the greater value between the standard container
// request and the DRA claim for the same resource.
PolicyMax NodeAllocatableResourceAccountingPolicy = "Max"
// PolicyConsumeFrom indicates that a DRA claim is defined to represent the node
// resource pool capacity. All containers or pods referencing the claim are satisfied from the capacity pool defined by the DRA
// claim. Pods access this pool by referencing the corresponding `ResourceClaim` in their
// `spec.containers[].resources.claims`. The scheduler ensures that the sum of requests from all
// containers sharing this claim on a node does not exceed the pool's capacity. The entire pool
// capacity reserved on the node, making it unavailable for other pods outside this pool.
PolicyConsumeFrom NodeAllocatableResourceAccountingPolicy = "ConsumeFrom"
)
// In k8s.io/api/resource/v1/types.go
type DeviceClassSpec struct {
// ManagesNativeResources indicates if devices of this class manage node allocatable resources like cpu, memory and/or hugepages.
// +optional
// +featureGate=DRANodeAllocatableResources
ManagesNativeResources bool
// NodeAllocatableResourceAccountingPolicies defines how the node allocatable resource represented by the devices
// in this class should be accounted for and aggregated with any standard request for the same resource
// in the pod spec (pod.spec.containers[].resources.requests or `pod.spec`.initContainers[].resources.requests)
// If an accounting policy is also defined in a Device mapping, that device-specific policy takes
// precedence. The map's key is the node allocatable resource name (e.g., "cpu", "memory", "hugepages-1Gi").
// +optional
// +featureGate=DRANodeAllocatableResources
NodeAllocatableResourceAccountingPolicies map[ResourceName]NodeAllocatableResourceAccountingPolicy
}
Device
// In k8s.io/api/resource/v1/types.go
type Device struct {
// ... existing fields
// NodeAllocatableResourceMappings defines the mapping of node resources
// that are managed by the DRA driver exposing this device. This includes resources currently
// reported in v1.Node `status.allocatable` that are not extended resources
// (see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#extended-resources).
// Examples include "cpu", "memory", "ephemeral-storage", and hugepages.
// In addition to standard requests made through the Pod `spec`, these resources
// can also be requested through claims and allocated by the DRA driver.
// For example, a CPU DRA driver might allocate exclusive CPUs or auxiliary node memory
// dependencies of an accelerator device.
// The keys of this map are the node-allocatable resource names (e.g., "cpu", "memory").
// Extended resource names are not permitted as keys.
// +optional
// +featureGate=DRANodeAllocatableResources
NodeAllocatableResourceMappings map[v1.ResourceName]NodeAllocatableResourceMapping `json:"nodeAllocatableResourceMappings,omitempty" protobuf:"bytes,13,opt,name=nodeAllocatableResourceMappings"`
}
type NodeAllocatableResourceMapping struct {
CapacityKey *QualifiedName `json:"capacityKey,omitempty" protobuf:"bytes,1,opt,name=capacityKey"`
AllocationMultiplier *resource.Quantity `json:"allocationMultiplier,omitempty" protobuf:"bytes,2,opt,name=allocationMultiplier"`
}
Pod Status
// In k8s.io/api/core/v1/types.go
// PodStatus represents information about the status of a pod.
type PodStatus struct {
// ... existing fields
// NodeAllocatableResourceClaimStatuses contains the status of node-allocatable resources
// that were allocated for this pod through DRA claims. This includes resources currently
// reported in v1.Node `status.allocatable` that are not extended resources
// (see https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#extended-resources).
// Examples include "cpu", "memory", "ephemeral-storage", and hugepages.
// +featureGate=DRANodeAllocatableResources
// +optional
// +listType=atomic
NodeAllocatableResourceClaimStatuses []NodeAllocatableResourceClaimStatus `json:"nodeAllocatableResourceClaimStatuses,omitempty" protobuf:"bytes,21,rep,name=nodeAllocatableResourceClaimStatuses"`
}
// NodeAllocatableResourceClaimStatus describes the status of node allocatable resources allocated via DRA.
type NodeAllocatableResourceClaimStatus struct {
ResourceClaimName string `json:"resourceClaimName" protobuf:"bytes,1,opt,name=resourceClaimName"`
Containers []string `json:"containers,omitempty" protobuf:"bytes,2,rep,name=containers"`
Resources map[ResourceName]resource.Quantity `json:"resources" protobuf:"bytes,3,rep,name=resources"`
}
Accounting Policy Precedence
When determining the AccountingPolicy for a node allocatable resource from a DRA claim:
- The
AccountingPolicyspecified within theDevice.NodeAllocatableResourceMappingsfor the specificResourceNametakes highest precedence. - If the
AccountingPolicyis not set in theDevicemapping, the policy is taken from theDeviceClass.Spec.NodeAllocatableResourceAccountingPoliciesmap for the matchingResourceName. - If no policy is found in either location for a
ResourceNamethat has a quantity defined in theDevicemapping, it is considered an error, and the device will not be allocatable for the claim.
This model supports both Admin-Defined Policy and Driver-Defined Policy:
Admin-Defined Policy (e.g., CPU/Memory): A CPU/Memory DRA driver can publish
Deviceobjects withNodeAllocatableResourceMappingscontaining only theQuantityFrom, leaving theAccountingPolicyfield unset. The cluster administrator then defines the desired accounting behavior (e.g.,AddPerReference,AddPerClaim) by creatingDeviceClassobjects with appropriate entries inNodeAllocatableResourceAccountingPolicies. This allows different consumption models for the same underlying CPU resources, controlled by the admin.Driver-Defined Policy (e.g., Accelerators): An accelerator driver (e.g., for GPUs) often knows the exact auxiliary resources (like CPU or Memory) required and the most appropriate accounting method. The driver can specify both the
QuantityFromand theAccountingPolicy(e.g.,AddPerReference) directly in theDevice.NodeAllocatableResourceMappings.
This combined approach provides flexibility, allowing the policy to be defined at the most appropriate level.
If a NodeAllocatableResourceMapping entry exists for a resource but AccountingPolicy is missing from both the Device mapping and the DeviceClass, this is an invalid configuration. The scheduler will fail to schedule the pod referencing the claim.
Resource Representation
- Node Allocatable Resource as a Consumable Pool in ResourceClaim
The device in
ResourceSlicerepresents a consumable pool withAccountingPolicyset toConsumeFrom.When the device is assigned to a
ResourceClaim, the request from the pod’spod.spec.containers[].resources.requestsis consumed out of the claim’s pool.# DeviceClass apiVersion: resource.k8s.io/v1 kind: DeviceClass metadata: name: shared-cpu-pool spec: selectors: - cel: 'device.driver == "dra.example.com"' managesNodeAllocatableResources: true nodeAllocatableResourceAccountingPolicies: cpu: "ConsumeFrom" --- # ResourceSlice apiVersion: resource.k8s.io/v1 kind: ResourceSlice metadata: name: shared-cpu-pool-slice spec: devices: - name: shared-pool-instance-1 allowMultipleAllocations: true capacity: "dra.example.com/cpu": "128" nodeAllocatableResourceMappings: cpu: # Accounting policy specified in the device class quantityFrom: capacity: "dra.example.com/cpu"
Accounting Policy Compatibility and Validation
Since Max and ConsumeFrom policies are not additive, we could have complex interactions between
different claims of a container and the pod spec. Validation rules become necessary to ensure
predictable behavior and prevent conflicting resource requests.
The following rules would need to be enforced by the scheduler, within the DynamicResources plugin’s
Filter stage to handle these interactions.
- If multiple claims affect the same node allocatable resource in the same container using
Max, they must all be from the same DRA driver. The sum of all the claim requests would be considered while comparing with the container spec. - If multiple claims affect the same node allocatable resource in the same container using
ConsumeFrom, they must all be from the same DRA driver. - A container cannot have claims requesting devices with
PolicyConsumeFromfor a node allocatable resource if it also has claims usingPolicyMax. - A container can use a claim with
PolicyMaxfor a node allocatable resource (e.g., from a CPU DRA driver) to set its base request, while simultaneously using other claims for the same node allocatable resource withPolicyAddPerClaimorPolicyAddPerReference(e.g., from a GPU driver for auxiliary CPU). The scheduler will sum the overridden value with rest of the additive policies while accounting for node resources. - A container can use a claim with
PolicyConsumeFromfor a node allocatable resource to set its base request, while using other claims for the same node allocatable resource withPolicyAddPerClaimorPolicyAddPerReference(e.g., from a GPU driver for auxiliary CPU). The container’sresources.requestsare still drawn from theConsumeFrompool and thePolicyAddPerClaim/PolicyAddPerReferenceare accounted for against the node’s general allocatable resources.
Invalid Scenarios:
- A container cannot have multiple
MaxorConsumeFrompolicies for the same resource backed by different drivers
- Container “c1”:
- ClaimA: {cpu (DriverX), Max, 4 CPU}
- ClaimB: {cpu (DriverY), ConsumeFrom, 8 CPU}
- A container cannot have multiple
ConsumeFrompolicies for the same resource from different drivers
- Container “c1”:
- ClaimA: {cpu (DriverX), ConsumeFrom, 100 CPU Pool}
- ClaimB: {cpu (DriverY), ConsumeFrom, 50 CPU Pool}
- A container cannot have multiple
Maxpolicies for the same resource from different drivers
- Container “c1”:
- ClaimA: {cpu (DriverX), Max, 100 CPU Pool}
- ClaimB: {cpu (DriverY), Max, 50 CPU Pool}
Use Case: Pod Consuming from a Shared CPU Pool
# ResourceSlice with 128 CPU consumable capacity
apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
name: shared-cpu-pool-slice
spec:
devices:
- name: shared-pool-instance-1
capacity:
"dra.example.com/cpu": "128"
nodeAllocatableResourceMappings:
cpu:
accountingPolicy: "ConsumeFrom"
quantityFrom:
capacity: "dra.example.com/cpu"
---
# ResourceClaim for the shared pool of 100 CPUs
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
name: shared-cpu-claim
spec:
devices:
requests:
- name: pool
exactly:
deviceClassName: shared-cpu-pool
capacity:
requests:
"dra.example.com/cpu": "100"
---
# Pod 1 consumes 10 CPUs from the shared pool
apiVersion: v1
kind: Pod
metadata:
name: pod1
spec:
containers:
- name: container-a
resources:
requests:
cpu: "10"
claims:
- name: my-pool
resourceClaims:
- name: my-pool
resourceClaimName: shared-cpu-claim
---
# Pod 2 consumes 20 CPUs from the shared pool
apiVersion: v1
kind: Pod
metadata:
name: pod2
spec:
containers:
- name: container-b
resources:
requests:
cpu: "20"
claims:
- name: my-pool
resourceClaims:
- name: my-pool
resourceClaimName: shared-cpu-claim
Expected Behavior & Accounting:
Scheduling Pod1:
NodeResourcesFit: Skips node allocatable resource node fit check as the DeviceClass hasmanagesNodeAllocatableResources: true.DynamicResources: SeesConsumeFrompolicy. The claim requested 100 CPUs from the pool. Checks ifcontainer-a’s request of 10 CPU fits within the 100 CPUs. It does.NodeInfoUpdate:NodeAllocatableDRAClaimStatesforshared-cpu-claimis created.Allocatedis set to {cpu: 100}.Consumedis set to {cpu: 10}.NodeInfo.Requestedincreases by 100 CPUs.
Scheduling Pod2:
NodeResourcesFit: Skips node allocatable resource node fit check as the DeviceClass hasmanagesNodeAllocatableResources: true.DynamicResources: SeesConsumeFrom. RetrievesNodeAllocatableDRAClaimStates.Allocated(Pool Capacity) is 100,Consumedis 10. Remaining pool capacity: 100 - 10 = 90. Checks ifcontainer-b’s request of 20 CPU fits: 20 <= 90. It fits.NodeInfoUpdate:NodeAllocatableDRAClaimStatesforshared-cpu-claimhasConsumedupdated to {cpu: 30}.AllocatedandNodeInfo.Requested.MilliCPUremain unchanged.
Pod Deletion:
- If Pod1 is deleted:
NodeInfo.updatesubtracts 10 fromNodeAllocatableDRAClaimStates[].Consumed.NodeInfo.Requestedis unchanged. - If Pod2 is then deleted:
NodeInfo.updatesubtracts 20 fromNodeAllocatableDRAClaimStates[].Consumed.Consumersbecomes empty. The entire 100 CPU pool capacity is subtracted fromNodeInfo.Requested. TheNodeAllocatableDRAClaimStatesentry forshared-cpu-claimis removed.
- If Pod1 is deleted:
Test Plan
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
Unit tests
Unit tests will be added for all new and modified logic within the kube-scheduler components.
- Ensuring the new fields in
DeviceClassandDeviceare validated correctly. - Scheduler Plugin Logic (
NodeResourcesFit,DynamicResources):- Verifying the correct deferral of node allocatable resource checks in
NodeResourcesFit. - Verify the accurate calculation of a pod’s total node allocatable resource demand, ensuring it correctly considers standard
pod.specrequests with DRA-based allocations. These tests must cover all supported ways to model node allocatable resources including scenarios involving consumable capacity, partitionable devices, and auxiliary resource requests for other devices. - Validating that
pod.status.nodeAllocatableResourceClaimStatusesis updated correctly.
- Verifying the correct deferral of node allocatable resource checks in
- Scheduler Framework:
- Verify
NodeInfocache updates correctly in theAssumestage and reflects resources allocated to node allocatable resource claims. - Verify that when a pod using DRA node allocatable resources is deleted, the resources are correctly released and become available for other pods in the scheduler’s cache.
- Verify
- Component helper (
k8s.io/component-helpers/resource)- Testing the
PodRequestshelper function’s updated logic to include DRA node allocatable resources.- Ensure existing calculations for pods without DRA claims or PLR remain correct, properly aggregating init and regular container requests.
- Verify pod level resources when specified for a resource, continues to take precedence over per-container requests, include node allocatable claim requests.
- Verify that the node allocatable resources from
pod.status.nodeAllocatableResourceClaimStatusesare correctly added to the pod’s effective standard resource requests. - Test that existing logic for different
PodResourcesOptions(e.g.,ExcludeOverhead,SkipPodLevelResources) continues to work as expected when DRA node allocatable resources are present, including correct handling ofpod.spec.overhead.
- Testing the
- Kubelet Admission Check
- Verifying that the admission check correctly uses the DRA node allocatable resource from the pod’s
status.nodeAllocatableResourceClaimStatusesfield.
- Verifying that the admission check correctly uses the DRA node allocatable resource from the pod’s
** Current Test Coverage:**
pkg/scheduler/framework/plugins/dynamicresources:20260203-79.2pkg/scheduler/framework/plugins/noderesources:20260203-89.6pkg/scheduler/schedule_one.go:20260203-86.6pkg/scheduler/framework/types.go:20260203-66.4pkg/scheduler/eventhandlers.go:20260203-71.4staging/src/k8s.io/component-helpers/resource/helpers.go:20260203-82.4
Integration tests
Integration tests will be added in test/integration/dynamicresource to cover the end-to-end scheduling flow:
Kube-Scheduler:
- Tests to ensure correct interaction between
NodeResourcesFitandDynamicResourcesplugins. - Test that the scheduler’s internal cache (
NodeInfo.Requested) is accurately updated to reflect the resources consumed by pods with DRA node allocatable resource claims. - Ensure that resources are correctly released in the scheduler cache when a pod with DRA node allocatable resources is deleted.
- Validate that fungible claims resulting in different node allocatable resource footprints are accounted for correctly on a per-node basis.
- Tests to validate the
pod.status.nodeAllocatableResourceClaimStatusesis populated correctly and the kubelet admission check correctly computes the effective pod resource request.
Kubelet:
- Test that the Kubelet’s admission handler correctly factors in the node allocatable resources specified in
pod.status.nodeAllocatableResourceClaimStatuseswhen deciding whether to admit a pod.
e2e tests
E2E tests will be added to test/e2e/dra:
- Verify these pods are scheduled onto nodes with sufficient capacity, considering both the pod’s standard requests and the DRA-added node allocatable resources.
These tests should cover various DRA modeling scenarios:
- Node allocatable resources as individual devices.
- Node allocatable resources as consumable capacity from a pool.
- Node allocatable resources from partitionable devices.
- Auxiliary node allocatable resources required by other devices (e.g., additional memory for an accelerator).
- Fungible claims involving node allocatable resources
Graduation Criteria
Alpha
- Feature implemented behind the
DRANodeAllocatableResourcesfeature gate and disabled by default. - Core API changes for
DeviceClass,Device, andPodStatusintroduced. - Kube-Scheduler:
- The
DynamicResourcesplugin is updated to calculate and enforce node resource fit based on standard requests and node allocatable resource claims. - The scheduler’s internal cache update logic is enhanced to incorporate DRA node allocatable resource allocations.
- The
k8s.io/component-helpers/resourceshared library is enhanced to compute effective pod resource footprint.- The Kubelet’s admission handler is updated to consider node allocatable resource claims in
Pod.Status. - All unit and integration tests outlined in the Test Plan are implemented and verified.
Alpha2 / Beta
- Gather feedback from alpha.
- Enhance Kubelet to utilize
pod.status.nodeAllocatableResourceClaimStatusesfor accurate QoS classification and cgroup management. - Design and implement support for different accounting policies with node allocatable resource claims and standard requests.
- Define the interactions between DRA node allocatable resources and In-Place Pod Vertical Scaling.
- Add E2E tests for kube-scheduler and Kubelet changes, including correct QOS and cgroup enforcement.
Upgrade / Downgrade Strategy
Upgrade: Enabling the feature gate on an existing cluster is safe. The new accounting logic will apply to any newly scheduled pods or pods that are re-scheduled. Existing pods with node allocatable resource claims would continue to run, but their claim request will not be reflected in the scheduler’s
NodeInfocache as these pods lackpod.status.nodeAllocatableResourceClaimStatusesfield. To fully resynchronize the accounting, the pods with node allocatable resource claims must be restarted.Downgrade: Disabling the feature gate requires a kube-scheduler restart. Upon startup, the scheduler rebuilds the NodeInfo cache without considering DRA node allocatable resources. The scheduler’s view of resource usage for existing pods will be incomplete (underestimated) as it does not consider claim based requests. This could potentially lead to oversubscription of the node if new pods are scheduled.
Version Skew Strategy
An older scheduler will not understand the new API fields or perform unified accounting. If DeviceClass or
ResourceSlice objects contain the new fields, they will be ignored.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name:
DRANodeAllocatableResources - Components depending on the feature gate:
kube-scheduler,kubelet,kube-apiserver.
- Feature gate name:
Does enabling the feature change any default behavior?
No. This feature only takes effect if users create Pods that request node allocatable resources via
pod.spec.resourceClaims and DRA drivers are installed and configured to expose node allocatable resources via
nodeAllocatableResourceMappings in ResourceSlice objects. Existing pods are unaffected.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
Yes. Disabling the feature gate DRANodeAllocatableResources will prevent the scheduler from performing the unified accounting.
Pods already scheduled using DRA node allocatable resource accounting will continue to run. However, when new pods are scheduled
while the gate is disabled, any node allocatable resources specified in their DRA claims will not be considered by the scheduler.
This can lead to node oversubscription as the scheduler’s view of available resources on the node will be incomplete.
What happens if we reenable the feature if it was previously rolled back?
The scheduler will resume its unified accounting logic for pods with DRA node allocatable resource claims. API
validation for the new fields will be re-enabled. The NodeInfo cache may be incorrect as it’s not
retroactively updated to consider node allocatable resource claims for previously scheduled pods. This inconsistent
state would persist until kube-scheduler restarts or all pods with node allocatable resource claims are restarted.
Are there any tests for feature enablement/disablement?
Unit tests in kube-scheduler and kube-apiserver will verify the behavior of the scheduler plugins
(NodeResourcesFit, DynamicResources) and API validation with the feature gate enabled and disabled.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
What specific metrics should inform a rollback?
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
DeviceClassobjects withspec.managesNativeResources: true.Deviceobjects withinResourceSlicehaving non-emptynativeResourceMappings.- Pods with
status.nativeResourceClaimStatuspopulated.
How can someone using this feature know that it is working for their instance?
- Events
- Event Reason:
- API .status
- Other field: pod.status.nodeAllocatableResourceClaimStatuses
- Details: Pods referencing node allocatable resource claims should have the pod status updated with
nodeAllocatableResourceClaimStatuses.
- Other (treat as last resort)
- Details:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- Other (treat as last resort)
- Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?
Dependencies
Does this feature depend on any specific services running in the cluster?
No
Scalability
Will enabling / using this feature result in any new API calls?
No
Will enabling / using this feature result in introducing new API types?
No. The this KEP proposes extensions to an existing type, but not a new type itself.
Will enabling / using this feature result in any new calls to the cloud provider?
No.
Will enabling / using this feature result in increasing size or count of the existing API objects?
Yes. With the API changes proposed in this KEP, individual DeviceClass, ResourceSlice and Pod objects would have additional fields, thus increasing their overall signature.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
Yes. The time to schedule a pod would increase if it references claims with node allocatable resources. The DynamicResources
scheduler plugin would need to allocate the device to the pod and would also need to perform additional
validations and node resource fit check.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
No.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
What are other known failure modes?
What steps should be taken if SLOs are not being met to determine the problem?
Implementation History
Drawbacks
Alternatives
DeviceClass API Extension for NodeAllocatableResourceMappings
In this option, the primary information about how a DeviceClass relates to node allocatable resources is contained within the DeviceClassSpec.
// In k8s.io/api/resource/v1/types.go
type DeviceClassSpec struct {
// ... existing fields
// NodeAllocatableResourceMappings lists the node allocatable resources that this DeviceClass can provide or depend on.
// +optional
// +featureGate=DRANodeAllocatableResources
NodeAllocatableResourceMappings []NodeAllocatableResourceMapping `json:"nodeAllocatableResourceMappings,omitempty"`
}
// NodeAllocatableResourceAccountingPolicy, NodeAllocatableResourceQuantity
// are defined the same as in the main proposal.
Reason for Not Choosing:
While defining NodeAllocatableResourceMappings in the DeviceClass is simpler, it lacks the granularity needed for many real-world scenarios. The Device API Extension approach allows these mappings to be specified per-Device instance within the ResourceSlice. This is advantageous because:
- Heterogeneous Devices: Even within the same
DeviceClass, individual device instances can have different node allocatable resource implications. For example, different GPU models or even the same model on different parts of the system topology might have varying CPU/memory overheads. Option 1 cannot express this. - Complex Resources: Resources where we use Partitionable Devices to model hierarchies (e.g., sockets, NUMA nodes, caches, cores). The node allocatable resource capacity (e.g., number of CPUs) is associated with specific instances in the hierarchy changes and this is best represented in individual
Deviceentries.