KEP-5004: DRA Extended Resource

KEP-5004 : DRA: Handle extended resource requests via DRA Driver

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Extended resource provides a simple, concise approach to describe resource capacity, and resource consumption. In constrast, Dynamic Resource Allocation (DRA) provides a more expressive, flexible approach, yet more complicated, and harder to use.

This KEP provides a solution to enable cluster administrators to advertise the dynamic resources (in ResourceSlice) as extended resource via DeviceClass. and enables the application developers, and operators to continue using extended resource to request for such resources.

This KEP provides dynamic allocation of resources to requests made through either extended resource, or DRA resource claim.

Motivation

There are three major motivations for the solution in this KEP.

Enable existing applications to run without modification.
Enable application developers and operators to transition to DRA gradually at their own pace.
Enable cluster administrators to transition to DRA gradually at their own pace, possibly one node a time, which means supporting clusters where some nodes use device plugins and some nodes use DRA drivers for the same hardware at the same time.

For example, the following Deployment can be installed without modification on a cluster with DRA ResourceSlice,DeviceClass and Node below. The 1 GPU out of the 8 GPUs on the node is dynamically allocated to the pod, with the remaining 7 GPUs left for allocation for future requests from either extended resource, or DRA resource claim.

Note that another node in the same cluster has installed device plugin, which may have advertised e.g. ’example.com/gpu: 2’ in its Node’s Capacity. The same Deployment can possibly be scheduled and run on that node too.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: demo
  template:
    metadata:
      labels:
        app: demo
    spec:
      containers:
      - name: demo
        image: nvidia/cuda:8.0-runtime
        command: ["/bin/sh", "-c"]
        args: ["nvidia-smi && tail -f /dev/null"]
        resources:
          limits:
            example.com/gpu: 1

apiVersion: resource.k8s.io/v1beta1
kind: DeviceClass
metadata:
  name: gpu.example.com
spec:
  selectors:
  - cel:
      expression: device.driver == 'gpu.example.com' && device.attributes['gpu.example.com'].type
        == 'gpu'
  extendedResourceName: example.com/gpu

apiVersion: resource.k8s.io/v1beta1
kind: ResourceSlice
metadata:
  name: gke-drabeta-n1-standard-4-2xt4-346fe653-zrw2-gpu.coqj92d
spec:
  devices:
  - basic:
    name: gpu-0
  - basic:
    name: gpu-1
  - basic:
    name: gpu-2
  - basic:
    name: gpu-3
  - basic:
    name: gpu-4
  - basic:
    name: gpu-5
  - basic:
    name: gpu-6
  - basic:
    name: gpu-7
  driver: gpu.example.com
  nodeName: gke-drabeta-n1-standard-4-2xt4-346fe653-zrw2

apiVersion: v1
kind: Node
metadata:
  name: gke-drabeta-n1-standard-4-2xt4-346fe653-zrw2
status:
  capacity:
    cpu: "4"
    ephemeral-storage: 101430960Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 15335536Ki
    pods: "110"

apiVersion: v1
kind: Node
metadata:
  name: gke-drabeta-n1-standard-4-2xt4-346fe653-xyz8
status:
  capacity:
    cpu: "4"
    ephemeral-storage: 101430960Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 15335536Ki
    pods: "110"
    example.com/gpu: 2

With this motivating example in mind. We define the following goals and non-goals of this KEP.

Goals

Enable cluster administrators to specify devices advertised by DRA drivers to satisfy extended resource requests.
Enable application operators to use the existing extended resource request in pod spec to request DRA resources.
Extended resource support is not added just for easing the transition to DRA for the short term. Its ease of use is one big advantage to keep it remaining useful for the long term.
Device plugin API must not change. The existing device plugin drivers must continue working without change.
DRA driver API must not change. Core Kubernetes (kube-scheduler, kubelet) is preferred over DRA driver for any change needed to support the feature.
Keep advertising only extended resources backed by device plugin in node.status.Capacity for Alpha. It will be revisited for Beta, based on Alpha feedback.

Non-Goals

Minimize kubelet or kube-scheduler changes. The feature requires necessary changes in both scheduling and actuation.
One node has both extended resource backed by DRA, and the same named extended resource backed by device plugin at the same time.

Proposal

The basic idea is the following:

Introduce an extended resource backed by DRA concept. It is like the current extended resource backed by device plugin, in that, it has a string name, and a discrete countable quantity. Its capacity can be derived from DRA ResourceSlice, its consumption is specified through pod’s extended resource request.
Introduce a field ExtendedResourceName to DeviceClass to allow cluster administrators to treat certain class of devices as an extended resource.
Introduce a special ResourceClaim object to keep track of device allocations. It is special only in the sense that it is created by the scheduler. No semantic changes are needed in the ResourceClaim API for it. kube-scheduler uses DRA scheduling algorithm to fit pod’s extended resource request to a node that advertises the extended resource in DRA ResorceSlice or extended resources backed by device plugin. When using DRA devices, it creates a special ResourceClaim for the pod with the allocation result recording which devices were picked. More details on this special ResourceClaim follow below. When using extended resources advertised for a node by device plugin, the existing resource tracking reserves them.
Introduce a field ExtendedResourceClaimStatus to pod’s Status, such that:
- the kubelet can find the special ResourceClaim while looking for claims to prepare
- the kubelet can pass the devices to containers in the pod with the extended resource requests, based on the container/extended resource to device request mapping in the ExtendedResourceClaimStatus. Containers can be initContainers, regular containers, but cannot be ephemeral containers.

Some quick clarifications around the basic concepts: extended resource backed by device plugin, extended resource backed by DRA, and dynamic resource.

extended resource backed by device plugin uses pod’s spec.containers[].resources.requests to request for resources, it consumes the capacity from node’s status.capacity. It is of type (string, int64)
dynamic resource uses ResourceClaim to request resources, and ResourceSlice to provide resource capacity. A pod asks for resources through resource claim requests in pod’s spec.resources.claims. Dynamic resource type is described in resource slice, simply speaking, it is a list of devices, with each device being described as structured parameters.
extended resource backend by DRA is a combination of the two above. It uses pods' spec.containers[].resources.requests to request for resources, and uses ResourceSlice to provide resource capacity. Hence, it is of type (string, int64) on the consumption side, and list of devices with a common ExtendedResourceName on the capacity side.

With these additions in place, the DRA devices can be consumed by extended resource requests, or by DRA resouce claims. The scheduler has everything it needs to support the dynamic allocation of devices to requests made through extended resource and resource claims. No static partition of resources between extended resources and resource claims is needed. The kubelet and DRA driver has everything they need to admit a pod and pass the allocated devices to the containers in the pod to run.

Note the following cluster setup configuration and constraint:

One node in a cluster can have an extended resource backed by DRA, and another node in the same cluster can have the same named extended resource backed by Device Plugin.
One node in a cluster cannot have both extended resource backed by DRA, and same named extended resource backed by device plugin at the same time. This implies that either the resource is advertised through ResourceSlice, or Node’s status.capacity.

Design Details

Device Class API

The extended resource name to DRA device mapping can be specified at DeviceClassSpec. The same extended resource name should be given to at most one device class. If there are more than one device classes, the one created later is picked at scheduling time, if two are created at the same time, the name lexicographically sorted first is picked, this gives a non-disruptive non-error way to transition an extended resource from being backed by one device class to being backed by another (create new device class, update old device class to clear the extended resource field).

Cluster administrator is soly responsible for creating device classes, and the mapping between the class of devices and the extended resource name. DeviceClass is cluster scoped, application developers and operators cannot change it.

The mapping of DRA devices and extended resources is stored in k8s data store (e.g. etcd). An application using the extended resources can only request the devices from DRA after the device class with the mapping is created. Before that, the application can request the devices from device plugin only.

// DeviceClassSpec is used in a DeviceClass to define what can be allocated
// and how to configure it.
type DeviceClassSpec struct {
	// ExtendedResourceName defines a mapping to the extended resource API.
	// All devices matched by the device class can be used to satisfy extended resource requests in pod's spec using this name.
	//
	// +optional
	ExtendedResourceName *string
}

Implicit Extended Resource Name

In addition to this optional extended resource name that is explicitly defined, every device class can be accessed as an extended resource using the name deviceclass.resource.kubernetes.io/<device-class-name>. This implicit extended resource name allows the simpler API to be used for DRA resource when no special DRA features beyond those available via DeviceClass are needed.

There is a mismatch between what the API server allows to be a valid device class name and extended resource name:

DeviceClass metadata.name must match IsDNS1123Subdomain, can be 253 characters long with dots
extended resource name must match IsQualifiedName, name part can be 63 characters, with dots

As a result, cluster admin must pick a DeviceClass name that conforms to the extended resource name requirement, to be able to use it as implicit extended resource name. Failing that, cluster admin can still set the extened resource name field explicitly in the DeviceClass.

Resource Claim API

A special resource claim object is created to keep track of device allocations for extended resource. The resource claim object has the following properties:

It is namespace scoped, like other resource claim objects.
It is owned by a pod, like other resource claim objects.
It has Spec of device.requests, with each request name being an encoding of the container name and the extended resource backed by DRA name inside the container.
Its status.allocation.devices and status.allocation.reservedFor are used.
It does not have annotation resource.kubernetes.io/pod-claim-name: as it is created for the extended resource request(s) in a pod spec, not for a claim in the pod spec.
It does have annotation resource.kubernetes.io/extended-resource-claim: pod-name as it is created, deleted, updated by the scheduler. It is used by scheduler to find the resource claim it has created, and ensure at most one such claim per pod.
At most one such claim object is created per pod. For example, if a pod requests for foo1.domain/bar and foo2.domain/bar, the allocation of devices for each are recorded in DeviceResourceRequestAllocationResult, and just one claim object with allocation Results that lists all allocated devices is created for the pod.

The special resource claim object lifecycle is managed by the scheduler and garbage collector.

It is created in a namespace when there is a pod with extended resource request, and the extended resource is advertised by ResourceSlice and scheduler has fit the pod to a node with the ResourceSlice.
It is created by the scheduler dynamic resource plugin during preBind phase. The in-memory one in the assumed cache is created earlier during Reserve phase.
It is deleted
- either together with the owning pod’s deletion.
- or by the scheduler dynamic resource plugin during unReserve phase.
- or by the scheduler dynamic resource plugin during postFilter phase.
It is discovered by the kubelet via pod.Status.ExtendedResourceClaimStatus
It is read by the kubelet DRA device driver to prepare the devices listed therein when preparing to run the pod.

type DeviceRequest struct {
	// Name can be used to reference this request in a pod.spec.containers[].resources.claims
	// entry and in a constraint of the claim.
	//
	// Must be a DNS label.
	//
	// +required
	Name string
}

To enable the kubelet to map devices back to the containers which requested them, the kube-scheduler creates one DeviceRequest per extended resource backed by DRA per container in the pod. containers can be initContainers, regular containers, but cannot be ephemeral containers. The name of the DeviceRequest has the form “container-%d-request-%d”, where the first %d is the index of the container in the pod. The second %d is the index of the extended resource inside the container resource requests. For example, if the first container in the pod has an extended resource backed by DRA which is the 3rd such request in the container, then the name of the DeviceRequest is “container-0-request-2”.

Documenting this naming is merely informational, it is not part of the API. The kubelet must not rely on it. Instead, the ContainerExtendedResourceRequest field below specifies the mapping.

Pod API

A new field extendedResourceClaimStatus is added to Pod’s status to track the special ResourceClaim object created for the extended resource requests in the pod. This is needed for kubelet to pass the devices allocated by driver to the containers in the pod. containers can be initContainers, regular containers, but cannot be ephemeral containers.

// PodExtendedResourceClaimStatus is stored in the PodStatus for each extended
// resource requests backed by DRA. It stores the generated name for 
// the corresponding special ResourceClaim created by scheduler.
type PodExtendedResourceClaimStatus struct {
        // ResourceClaimName is the name of the ResourceClaim that was
        // generated for the Pod in the namespace of the Pod.
        ResourceClaimName string

        // RequestMapping identifies the mapping of <container, extended resource backed by DRA> to  device request.
        // +patchMergeKey=requestName
        // +patchStrategy=merge,retainKeys
        // +listType=atomic
        // +listMapKey=requestName
        // +featureGate=DynamicResourceAllocation
        RequestMapping []ContainerExtendedResourceRequest
}

type ContainerExtendedResourceRequest struct {
        // ContainerName is the unique container name within the pod.
        ContainerName string
        // ExtendedResourceName is the extended resource name backed by DRA inside
        // the container's requests.
        ExtendedResourceName string
        // RequestName is the device request name in the special resource claim
        // created for extended resource requests backed by DRA.
        RequestName string
}

type PodStatus struct {
    ...

    // Status of extended resource claim backed by DRA.
    // +featureGate=DynamicResourceAllocation
    // +optional
    ExtendedResourceClaimStatus *PodExtendedResourceClaimStatus
}

For example, if a pod has requested for foo.domain/bar, and it is scheduled to run on a node where foo.domain/bar was mapped to devices in a DeviceClass, then the pod’s status is like below:

status:
  extendedResourceClaimStatus:
    resourceClaimName: ccc-gpu-57999b9c4c-vpq68-gpu-8s27z
    requestMapping:
    - containerName: container-name
      extendedResourceName: foo.domain/bar
      requestName: container-0-request-2

where deviceRequest name is “container-0-request-2”, and container-name is the first container in the pod, foo.domain/bar is the 3rd extended resource in the container’s requests.

Note the validations for extendedResourceClaimStatus are different from the validations for resourceClaimStatuses.

resourceClaimStatuses requires name must be DNS label, extendedResourceClaimStatus’s requestMapping’s containerName and RequestName must be a DNS label, while the extendedResourceName is not a DNS label.
resourceClaimStatuses requires name must be one of the claim’s name in the pod spec. extendedResourceClaimStatus requires containerName must be one of the container name in the pod spec, and extendedResourceName must be one of the extended resource name in that container.

Resource Quota

Currently, there are two different applicable quotas, one is device-class-name.deviceclass.resource.k8s.io/devices that limits the resource claims in a namespace as described in KEP . The other is the extended resource quota .

As there is a one to one mapping between device class, and extended resource, the two quota mechanisms above should keep track of the usages of the same class of devices the same way.

But currently, the extended resource quota keeps track of the devices provided from device plugin, and DRA resource slice requested from pod’s extended resource requests. The resource claim quota currently keeps track of the devices provided from DRA resource slice requested from resource claims.

The extended resource quota usage needs to be adjusted to account for the device requests from resource claims. On the other side, resource claim quota has alreadys accounted for the devices requests from pod’s extendeded resources, as scheduler would create a special resource claim for the extended resource requests.

For example, before the adjustment, the quota is as below. The explicit extended resource quota requests.example.com/gpu counts 1 device (e.g. gpu-0) from device plugin, and 1 device (e.g. gpu-1) from DRA resource slice. The implicit extended resource quota request.deviceclass.resource.kubernetes.io/mygpuclass counts 1 device (e.g. gpu-2) from DRA resource slice. The resource claim quota gpu.example.com.deviceclass.resource.k8s.io/devices counts 1 device (e.g. gpu-3) from a pod resource claim, and 1 device (e.g. gpu-4) from a resource claim template, in addition it also counts gpu-1 and gpu-2 in, as scheduler generates extended resource claims for them.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu
spec:
  hard:
    requests.example.com/gpu: 10
    request.deviceclass.resource.kubernetes.io/mygpuclass: 10
    gpu.example.com.deviceclass.resource.k8s.io/devices: 10
  used:
    requests.example.com/gpu: 2
    request.deviceclass.resource.kubernetes.io/mygpuclass: 1
    gpu.example.com.deviceclass.resource.k8s.io/devices: 4

Provided that the device class mygpuclass is mapped to the extended resource example.com/gpu.

apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
  name: mygpusclass
spec:
  extendedResourceName: example.com/gpu

For the same example, the explicit extended resource quota requests.example.com/gpu needs to be adjusted to count in the devices requested from implicit extended resource (e.g. gpu-2) and from resoure claims (e.g gpu-3 and gpu-4). The implicit extended resource quota request.deviceclass.resource.kubernetes.io/mygpuclass needs to be adjusted to count in the devices requested from resource claims (e.g. gpu-3 and gpu-4), and the DRA devices requested from explicit extended resources (e.g. gpu-1), but not the device plugin devices (e.g. gpu-0). The adjusted quota is as below.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu
spec:
  hard:
    requests.example.com/gpu: 10
    request.deviceclass.resource.kubernetes.io/mygpuclass: 10
    gpu.example.com.deviceclass.resource.k8s.io/devices: 10
  used:
    requests.example.com/gpu: 5
    request.deviceclass.resource.kubernetes.io/mygpuclass: 4
    gpu.example.com.deviceclass.resource.k8s.io/devices: 4

Scheduling for Extended Resource backed by DRA

A new field DynamicResources is added to Resource , it works similar to ScalarResources. It is used to keep track of the extended resources backed by DRA, i.e. those that are advertised by ResourceSlice, and mapped via DeviceClass extendedResourceName field.

type Resource struct {
	MilliCPU         int64
	Memory           int64
	EphemeralStorage int64
	// We store allowedPodNumber (which is Node.Status.Allocatable.Pods().Value())
	// explicitly as int, to avoid conversions and improve performance.
	AllowedPodNumber int
	// ScalarResources
	ScalarResources map[v1.ResourceName]int64

	// NEW!
	// DynamicResources: keep track of extended resources backed by DRA to device class
	// The map's key is the extended resource name that has exactly one device
	// class advertises it.
	DynamicResources map[v1.ResourceName]string
}

type NodeInfo is used by scheduler to keep track of the information for each node in memory. Its Allocatable field is used to keep track of the allocatable resources in memory. At the beginning of each scheduling cycle, scheduler takes a snapshot of all the nodes in the cluster, and updates their corresponding NodeInfo.

For the scheduler with DRA enabled, right after taking the node snapshot, the scheduler also takes a snapshot of DeviceClass, and updates NodeInfo.DynamicResources if there is an extended resource backed by DRA.

For a node with extended resources from device plugin, its NodeInfo’s Allocatable.ScalarResources is updated with the k8s Node’s object. For a node with extended resources backed by DRA, its NodeInfo’s Allocatable.DynamicResources is updated based on DRA DeviceClass objects.

The existing ’noderesources’ plugin needs to be modified, such that a pod’s extended resource request is checked against a NodeInfo’s ScalarResources if the node uses device plugin, and checked against a NodeInfo’s DynamicResources, if the request is for extended resources backed by DRA, then ’noderesources’ plugin would pass and leave it to ‘dynamicresource’ plugin to check if it can be satisfied.

The existing ‘dynamicresources’ plugin needs to be modified to account for the extended resource backed by DRA requests.

EventsToRegister

This registers all cluster events that might make an unschedulable pod schedulable, like finishing the allocation of a claim, or resource slice updates.

The existing dynamicresource plugin has registered almost all the events needed for extended resource backed by DRA, with one addition framework.UpdateNodeAllocatable for node action.

PreFilter

It checks if the pod has any container requests for extended resources backed by DRA. If not, and no claims in the pod, then the plugin can return early, as there is nothing to do.

If the pod still needs to be considered by the plugin, then it checks if the special resource claim for extended resources backed by DRA has been created before by scheduler, by checking resource claim name having pod name in the annotation resource.kubernetes.io/extended-resource-claim: pod-name.

If found, scheduler would reuse it. If not found, scheduler would create a special resource claim that has empty spec. The exact spec needs to be decided during Filter phase, as some node may have device plugin provide the capacity for the extended resource, some other node may have DRA provide the capacity. The requests in the special resource claim need to vary for each node.

Filter

If a pod has an extended resource backed by DRA, and the node does not have device plugin to provide the capacity for the resource, then the dynamicresource plugin needs to try to allocate the resource by filling in the special claim’s Spec.Devices.Requests field.

One request is created per container, and per extended resource backed by DRA in the container. The DeviceClass in the request is the device class that has the matching ExtendedResourceName field (one extended resource name can be in at most one device class). The Name of the request is determined by the container name and the extended resource name.

The allocator needs to be modified to allow for the special resource claim for extended resource backed by DRA, which could vary by node. The Allocate method takes the claim as a parameter, in adddition to node parameter. The algorithm uses the passed in special claim whenever it processes the last claim in the claims slice, which is an instantiation of the special claim template created during the preFilter phase.

If there is an allocation for a node, the allocation, and the claim are recorded in cyclestate.

PostFilter

If the special resource claim is not available, i.e., the claim cannot be bound to the node, then scheduler would deallocate it, and delete it during PostFilter phase.

Reserve

Reserve the in-memory ResourceClaim and its allocation results in the assume cache, a map of in-flight claims.

Unreserve

The plugin deletes the special ResourceClaim for extended resource backed by DRA, because it cannot be scheduled after all.

Prebind

This is called in a separate goroutine. The plugin makes API call to create the ResourceClaim and updates the pod’s status ExtendedResourceClaimStatus. If some API request fails now, PreBind fails and the pod must be retried.

Failure handling

The special resourceclaim for extended resources backed by DRA may fail to be created, updated in kuberentes API server. Below discusses the possible failures in the write API calls added in this KEP.

During Prebind phase, if the special resourceclaim is new, i.e. it is not written to API server before, then it is first created in API server. Then followed by claim finalizer field update, claim status update if needed. These updates have local retries in case there is a conflict. (Note these updates logic is not new, they apply to other regular claims too). After that, Pod.Status.ExtendedResourceClaimStatus is updated if needed. Both the claim create, and claim finalizer update, and claim status, or pod status update could fail, which will fail the Prebind phase, framework.Error code is returned. The scheduler framework will first call Unreserve phase to clean up, then requeue the pod to activeQ/backoffQ soon.

During Unreserve phase, the special resourcelclaim’s finalizer is first removed with an update API call, then the claim is deleted, then Pod.Status.ExtendedResourceClaimStatus is updated. If the update or delete fails, the failure is logged, and continued. Unreserve needs to be idempotent, scheduler framework will retry later if there is failure.

During Postfilter phase, if the special resourceclaim is picked to be deleted, then the special resourcelclaim’s finalizer is first removed with an update API call, then the claim is deleted, then Pod.Status.ExtendedResourceClaimStatus is updated. If the update or delete fails, then framework.Error code is returned. The scheduler framework will requeue the pod to activeQ/backoffQ soon.

During Prefilter phase, if the special resource claim has already been created before, it is validated, and reused if still valid. If not valid, then return framework.Unschedulable , then the invalid resourceclaim may be picked during PostFilter phase for deletion. If not found, then a new in-memory resource claim template is created, which will be instantiated at Filter phase, persisted at PreBind phase.

During Filter phase, if the special resource claim allocation’s node selector does not match, then return framework.Unschedulable , then the resourceclaim may be picked for deletion during PostFilter phase.

Actuation for Extended Resource backed by DRA

When a pod with extended resources requests is picked up by the kubelet on the node it is scheduled to run, the following are particularly important:

Kubelet tries to admit the pod, the pod’s extended resources requests should not be checked against the Node’s allocatable, as the resources are in ResourceSlice, not in Node. In reality, the current predicate.go has already removed the missing extended resources from node info for cluster-level resources, hence there is no extra logic needed to admit the extended resources backed by DRA.
Kubelet (DRA manager) passes the special ResoureClaim to DRA driver to prepare the devices, in the same way as that for normal ResourceClaim.
Kubelet passes the device IDs through CDI to the containers with the extended resource requests. This is different from actuation of a pod with resource claim, as the pod does not have claim requests in containers or pods. Instead, the pod.status.extendedResourceClaimStatus has the mapping of container name and extended resource name to request in claim.spec.devices.requests, DRA manager uses this status information to pass the proper allocated device IDs to the proper container.

Cluster Autoscaler integration

The new NodeInfo.Allocatable.DynamicResources field inside NodeInfo may need to be correctly set in cluster autoscaler, based on its own internal cluster state, which means there may be a need to expose a public method to set it.

The special resource claim created in PreFilter has to go through the ResourceClaimTracker from SharedDRAManager so that cluster autoscaler can reflect the claim in-memory. The special claim is currently reserved by calling SignalClaimPendingAllocation() in Reserve phase and persisted to API server in PreBind phase. There might be a need to expand ResourceClaimTracker to integrate with cluster autoscaler.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

Start of v1.34 development cycle (v1.33.0):

k8s.io/dynamic-resource-allocation/cel: 88.2%
k8s.io/dynamic-resource-allocation/structured: 90.5%
k8s.io/kubernetes/pkg/controller/resourceclaim: 74.6%
k8s.io/kubernetes/pkg/scheduler/framework/plugins/dynamicresources: 65.4%

Integration tests

The existing integration tests for kube-scheduler which measure performance will be extended to cover the overhead of running the additional logic to support the features in this KEP. These also serve as correctness tests as part of the normal Kubernetes “integration” jobs which cover the dynamic resource controller .

e2e tests

End-to-end testing depends on a working resource driver and a container runtime with CDI support. A test driver was developed as part of the overall DRA development effort. We will add tests to ensure ExtendedResourceNames are handled by the scheduler as described in this KEP.

Graduation Criteria

Alpha

Feature implemented behind a feature flag
Initial e2e tests completed and enabled

Beta

The basic scoring in NodeResourcesFit has to be implemented and that the queueing hints have to work efficiently.
Keep the Alpha behavior to create the special resource claim in scheduler.
Gather feedback from developers and surveys
Scalability tests that mirror real-world usage as determined by user feedback
Additional tests are in Testgrid and linked in KEP
All functionality completed
All security enforcement completed
All testing requirements completed
All known pre-release issues and gaps resolved

GA

Allowing time for feedback
All issues and gaps identified as feedback during beta are resolved

Upgrade / Downgrade Strategy

The usual Kubernetes upgrade and downgrade strategy applies for in-tree components. Vendors must take care that upgrades and downgrades work with the drivers that they provide to customers.

Version Skew Strategy

All of the API extensions proposed in this KEP is the optional ExtendedResourceName in DeviceClass, and ExtendedResourceClaimStatus in Pod. There is no risk for version skew downgrades because these DeviceClass and Pod will never have existed in older clusters.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: DRAExtendedResource
- Components depending on the feature gate:
  - kube-apiserver
  - kube-controller-manager
  - kube-scheduler
  - kubelet

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. Applications that were already deployed and are running will continue to work. They will continue to work when restarted because the CDI devices that have been prepared for them won’t change across the restart.

The DRA driver itself should also be able to survive a rollback, as there is no DRA driver change in this KEP.

What happens if we reenable the feature if it was previously rolled back?

The scheduler may lose track of what devices it has allocated to what pods. Any pods that had previously allocated devices with the feature enabled will need to be deleted to ensure they are freed back to their corresponding driver and the accounting for them is updated in the scheduler.

Are there any tests for feature enablement/disablement?

Unit tests will be written in the scheduler and kubelet to verify that enabling / disabling of the DRAExtendedResource feature gate is non-disruptive to the scheduler and kubelet.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

Workloads that do not use the DRA Extended Resource feature should not be impacted, since the functionality is unchanged.

If the feature is being used in pods before support for it has been fully rolled out across the cluster, api server, scheduler in control plane, and kubelet in nodes, it can cause a failure to schedule pods or a failure to run the pods on the nodes. This will not affect already running workloads unless they have to be restarted.

Device plugin drivers can be replaced with DRA drivers for the same devices on a per-node basis, one node at a time.

What specific metrics should inform a rollback?

One indicator are unexpected restarts of the cluster control plane components (kube-scheduler, apiserver) or kubelet.

If the scheduler_pending_pods metric in the kube-scheduler suddenly increases, it can suggest that pods are no longer getting scheduled which might be due to a problem with the DRA scheduler plugin. Another are an increase in the number of pods that fail to start, as indicated by the kubelet_started_containers_errors_total metric.

If the node.status.Capacity for the extended resources for the devices do not decrease to zero, or a pod fails to be scheduled, or run on the node, it may indicate that the device plugin driver on the node for the devices is not properly replaced by the DRA driver.

In all cases further analysis of logs and pod events is needed to determine whether errors are related to this feature.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Upgrade, downgrade, and upgrade->downgrade->upgrade paths would be tested using an automated testing framework.

Testing Framework

A new testing framework has been developed for DRA upgrade/downgrade testing (see kubernetes/kubernetes#135664 and kubernetes/kubernetes#136156 ).

Testing Strategy for Beta

Given the historical difficulty of implementing comprehensive upgrade/downgrade tests and the need for a clear path forward, the testing approach balances required coverage for beta graduation with comprehensive long-term testing goals. The scenarios below are organized accordingly.

DRAExtendedResources Test Scenarios

For the DRAExtendedResources feature, the following test scenarios are planned for implementation in test/e2e_dra/extendedresources_test.go:

REQUIRED FOR BETA:

Note: All test scenarios below must be executed for both explicit and implicit extended resources:

Explicit resources: Use a custom resource name (e.g., example.com/gpu) with DeviceClass.spec.extendedResourceName set
Implicit resources: Use the format deviceclass.resource.kubernetes.io/<device-class-name> with DeviceClass.spec.extendedResourceName unset

Basic Feature Enablement
- Deploy workloads requesting extended resources via pod spec (e.g., example.com/gpu: 1)
- Configure DeviceClass with appropriate extendedResourceName configuration
- Create ResourceSlice advertising devices
- Verify pods are scheduled and devices are allocated correctly
- Validate that the special ResourceClaim is created by the scheduler
- Check pod.status.extendedResourceClaimStatus contains correct mapping
Enablement Testing (Feature Gate: OFF → ON)
- Pre-enablement state: Cluster running with DRAExtendedResource feature disabled
  - Workloads using device plugin extended resources function normally
- Enablement: Enable DRAExtendedResource feature gate on API server, kube controller manager, scheduler, and the kubelet
- Post-enablement validation:
  - Existing pods with device plugin resources continue to run without disruption
  - New pods can request extended resources backed by DRA
  - DeviceClass configurations are processed correctly
  - Scheduler creates special ResourceClaims for new pods
  - Resource quota accounting works for both device plugin and DRA-backed resources
- Workload transition:
  - Create DeviceClass mapping to DRA devices on specific nodes
  - Deploy new pods requesting the extended resource name
  - Verify pods are scheduled on the appropriate nodes
  - Verify device allocation through DRA driver
Upgrade and Downgrade Testing (version n-1 → n → n-1)
- Initial state (version n-1): Cluster running version n-1 with DRAExtendedResource feature enabled
  - Install test DRA driver with devices on cluster nodes
  - Create DeviceClass with appropriate extendedResourceName configuration
  - Deploy workloads requesting extended resources backed by DRA
  - Verify special ResourceClaims are created by scheduler
  - Step 1 validation: Run test workloads and verify device allocation works correctly
- Upgrade (n-1 → n): Upgrade cluster to version n while keeping DRAExtendedResource feature gate enabled
- Post-upgrade validation (version n):
  - Existing pods with DRA-backed extended resources continue to run without disruption
  - Special ResourceClaims created in n-1 remain valid and functional
  - Step 2 validation: Deploy new pods requesting extended resources backed by DRA
  - Scheduler in version n correctly creates special ResourceClaims
  - Resource quota accounting continues to work correctly
  - No API compatibility issues with DeviceClass or Pod.status.extendedResourceClaimStatus
  - Verify backward compatibility of special ResourceClaim format
- Downgrade (n → n-1): Downgrade cluster back to version n-1 while keeping feature gate enabled
- Post-downgrade validation (version n-1):
  - Pods and ResourceClaims created in version n continue to function correctly
  - Step 3 validation: Verify scheduler in downgraded version can still handle existing allocations
  - New pods can be scheduled with extended resources backed by DRA
  - Device allocation and cleanup continue to work properly
API Object Persistence
- Verify DeviceClass.spec.extendedResourceName field persists across upgrades
- Verify pod.status.extendedResourceClaimStatus persists correctly
- Roundtripping of API types is covered by unit tests

COMPREHENSIVE COVERAGE (Expanded testing for GA and long-term robustness):

The following scenarios provide thorough coverage of edge cases and complex state transitions. While valuable for ensuring long-term feature robustness, they are not hard requirements for beta graduation. These tests will be implemented based on available resources and prioritized for GA.

Disablement Testing (Feature Gate: ON → OFF)
- Pre-disablement state: Cluster with DRAExtendedResource enabled and active workloads
  - Some pods using DRA-backed extended resources
  - Special ResourceClaims exist
- Disablement: Disable DRAExtendedResource feature gate
- Post-disablement validation:
  - Existing pods with DRA-backed resources continue running (CDI devices remain prepared)
  - New pods with extended resource requests can only use device plugin resources
  - Scheduler no longer creates special ResourceClaims
  - API server accepts but ignores extendedResourceName in DeviceClass
  - Existing special ResourceClaims are not deleted (handled by garbage collector)
Enablement→Disablement→Enablement Path
- Initial enablement: OFF → ON (as described in scenario 2)
- Disablement: ON → OFF
  - Delete all pods using DRA-backed extended resources
  - Verify special ResourceClaims are cleaned up by garbage collector
  - Verify resource accounting is updated correctly
- Re-enablement: OFF → ON
  - Scheduler can recreate special ResourceClaims for new workloads
  - No stale state from previous enable/disable cycles
  - Resource allocation works correctly
Mixed Node Scenarios
- Cluster with some nodes using device plugin and others using DRA for the same resource name
- Verify pods can be scheduled to either type of node appropriately
- Ensure one node cannot have both device plugin and DRA for same resource simultaneously
- Validate node transition: remove device plugin, add DRA driver on same node

Test Implementation Location

Tests can be implemented in:

test/e2e_dra/extendedresources_test.go - End-to-end upgrade/rollback tests
Unit tests in scheduler and kubelet packages - Component-level validation

The testing approach ensures core functionality is validated for beta graduation through the required scenarios (1-4), while comprehensive coverage scenarios (5-7) provide thorough edge case testing that will be prioritized for GA based on available resources and implementation feasibility.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

kube_pod_resource_limit and kube_pod_resource_request (label: namespace, pod, node, scheduler, priority, resource, unit) can be used to determine if the feature is in use by workloads though it doesn’t differentiate between extended resources backed by DRA or device plugin.

We will add a new source label to resourceclaim_controller_resource_claims (label: admin_access, allocated), which can determine if the resource claim is created by extended resource or resource claim template. It should be a good metric to determine if the resource claim is created by extended resource backed by DRA.

How can someone using this feature know that it is working for their instance?

API .status
- Other field: Pod’s .status.extendedResourceClaimStatus will have a list of resource claims that are created for DRA extended resources.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

Existing DRA and kube-scheduler SLOs continue to apply and must be maintained. Pod scheduling duration with this feature should be as fast as existing DRA. Since this feature implicitly affects the filtering phase of the NodeResourcesFit plugin, the performance should be similar with no visible degradation compared to the baseline scheduling performance.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics Values of each label are not exhaustive; we are providing some example values that are related to this feature’s SLI. Existing metrics:
- Metric name: workqueue
  - Type: Gauge/Counter (multiple workqueue metrics)
  - Labels: name (“resource_claim”)
  - SLI Usage: Monitor workqueue depth and duration to detect resource claim processing bottlenecks. High depth or duration values indicate potential performance issues in resource claim handling that could affect pod scheduling times.
- Metric name: scheduler_pending_pods
  - Type: Gauge
  - Labels: queue (“active”, “backoff”, “unschedulable”, “gated”)
  - SLI Usage: Track increases in ‘unschedulable’ queues to identify when extended resource availability is preventing pod scheduling. Sustained high values may indicate resource constraint issues or misconfigurations.
- Metric name: scheduler_plugin_execution_duration_seconds
  - Type: Histogram
  - Labels: plugin (“NodeResourcesFit”, “DynamicResources”), extension_point, status
  - SLI Usage: Monitor latencies for NodeResourcesFit and DynamicResources plugins to ensure the extended resource integration doesn’t introduce performance regressions.
  - We need to monitor NodeResourcesFit because this feature implicitly affects its filtering phase.
- Metric name: scheduler_pod_scheduling_sli_duration_seconds
  - Type: Histogram
  - Labels: attempts
  - SLI Usage: Track end-to-end scheduling performance for pods using extended resources.

Updating metrics:

Metric name: resourceclaim_controller_resource_claims
- Type: Gauge
- Labels: admin_access, allocated, source (“extended-resource”, “resource-claim-template”)
- SLI Usage: Monitor the ratio of allocated vs. total resource claims filtered by source="extended-resource" to track resource utilization. A low ratio of allocated claims may indicate DRA driver or resource claim controller issues.
- The source label is newly added. It can be determined based on the resource.kubernetes.io/extended-resource-claim annotation of resource claims.

New metrics:

Metric name: scheduler_resourceclaim_creates_total
- Type: Counter
- Labels: status (“failure”, “success”)
- SLI Usage: Calculate success rate to monitor the reliability of automatic resource claim creation. High failure rates indicate potential issues with extended resource configuration.
- Because the resource claim is created in the scheduler PreBind phase by making k8s API call, we need a different metric from resourceclaim_controller_creates_total.
- The metric is incremented accordingly based on the API call outcome, either success or failure.

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

The container runtime must support CDI.

A third-party DRA driver is required for publishing resource information and preparing resources on a node.

These are not new requirements from this feature, rather, they are required by DRA structured parameters.

Scalability

Will enabling / using this feature result in any new API calls?

Yes. scheduler make new API calls to create, update, and delete the special resource claim for extended resource backed by DRA.

Will enabling / using this feature result in introducing new API types?

No. The this KEP proposes extensions to an existing type, but not a new type itself.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

Yes. With the extensions proposed in this KEP, individual DeviceClass and Pod have additional fields, thus increasing their overall signature. In addition, there is the special resource claim for extended resource by DRA, there is at most one such claim per pod.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Yes. The time to allocate a device to a pod with extended resource request will be affected.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

The troubleshooting section in https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4381-dra-structured-parameters#troubleshooting still applies.

How does this feature react if the API server and/or etcd is unavailable?

The Kubernetes control plane will be down, so no new Pods get scheduled. kubelet may still be able to start or restart containers if it already received all the relevant updates (Pod, ResourceClaim, etc.).

What are other known failure modes?

[Pod pending due to extended resource backed by DRA requests no less than 128 devices]
- Detection: inspect pod status ‘Pending’
- Mitigations: reduce the number of devices requested in one extended resource backed by DRA requests
- Diagnostics: scheduler logs at level 5 show the reason for the scheduling failure.
- Testing: this is known, determinstic failure mode due to defined system limit, i.e., DRA requests must be no more than 128 devices.
[API server priority & fairness limits extended resource claim creation requests]
- Detection: inspect metric scheduler_resourceclaim_creates_total, and API server priority & fairness limits
- Mitigations: adjust API sever priority and fairness limits if too low, to allow extended resource claim creation
- Diagnostics: API server and scheduler logs level 5 show the reason for the extended resource claim creation failure.
- Testing: creating pods with DRA extended resource requests at high rate, and at the same time, API server priority and fairness limit too low, could trigger extended resource claim creation failure at scheduler.

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Kubernetes 1.34: KEP accepted.
Kubernetes 1.35: Feature in alpha.
Kubernetes 1.36: Promotion to beta.
Kubernetes 1.37: Promotion to ga.

Drawbacks

It adds complexity to the scheduler.

Alternatives

Many different approaches were considered.

Specifically, the following two alternative proposals were considered:

Option 1: webhook rewrite extended resource requests in pod spec

This approach requires cluster administrator deploy a mutation webhook to the cluster, and configure the webhook with rewrite rules that can rewrite the extended resource requests and node selectors. This approach is not taken due to the webhook’s extra configuration, and maintenance overhead.

Option 2: client CLI tool to rewrite extended resource requests in pod spec

This approach requires application developers, operators to run the client CLI tool to rewrite the application YAML with extended resources to DRA resource claims. This adds extra overhead to the application deployment flow.