KEP-5729: DRA: ResourceClaim Support for Workloads

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
- Increase the size limit on the status.reservedFor field
- Allow ResourceClaims to be reserved for any object

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests within one minor version of promotion to GA
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This enhancement describes additions to the Workload API and PodGroup API which make it possible to associate ResourceClaims and ResourceClaimTemplates with those objects to better facilitate sharing DRA resources between the Pods they contain.

A ResourceClaim referenced by a PodGroup will be reserved for that PodGroup as a whole instead of its individual Pods, addressing the limit on the number of entries in a ResourceClaim’s status.reservedFor list.

A ResourceClaimTemplate referenced by a PodGroup will cause a ResourceClaim to be generated once for that PodGroup, like how ResourceClaimTemplates work today when referenced by a Pod. Whereas ResourceClaims today can only be shared between Pods by name, this will allow ResourceClaims to be shared by Pods in the same PodGroup where the exact name of the ResourceClaim is not known ahead of time.

Motivation

AI/ML workloads are particularly sensitive to network latency and bandwidth between closely related Pods. Certain groups of Pods must be placed as closely together as possible to achieve maximum performance.

“Modeling Topology and Multi-Node Logical Devices” describes where existing mechanisms like Pod and Node affinity break down for these use cases and how Dynamic Resource Allocation (DRA) can fulfill those requirements by scheduling Pods within strict topological boundaries:

A ResourceSlice lists “devices” which represent nested topological units and form a tree of arbitrary depth. These units could be as small as a single host or as large as an entire datacenter, or perhaps even larger.
A ResourceClaim requests one of these topological units. Pods which reference that same ResourceClaim are scheduled within the same topological boundary.
Additionally, the allocation of the ResourceClaim may trigger a controller to reprogram the datacenter fabric to match the selected topological unit.

Large-scale workloads orchestrated by specialized APIs like JobSet and LeaderWorkerSet cannot currently practically express granular topological constraints with the current Kubernetes APIs. Today, ResourceClaims to be shared by multiple Pods must be created one by one, and referenced by name in the Pod spec. Those APIs which define replicable groups of Pods are left to manage shared ResourceClaims themselves.

Moreover, the current limit of 256 entries in a ResourceClaim’s status.reservedFor list limits a device to being shared by up to that number of Pods. Production-scale workloads require larger numbers of Pods to share a single claim.

The Workload API defines a common representation of these related sets of Pods as a PodGroup. Associating ResourceClaimTemplates with PodGroups allows Kubernetes to manage the lifecycle of the generated ResourceClaims generically for all types implementing the Workload API.

Goals

Allow users to express sets of DRA resources to be replicated for each PodGroup, and shared by each Pod in the PodGroup.
Automatically create and delete PodGroups’ ResourceClaims as needed.
Reduce the burden of each true workload controller implementing ResourceClaim generation separately (e.g. JobSet, LWS).
Allow claims to be allocated for more than 256 Pods.

Non-Goals

Associate ResourceClaims or ResourceClaimTemplates with Workload objects (future work).
Influence how Pods are placed onto Nodes based on the ResourceClaimTemplates and ResourceClaims associated with a PodGroup or Pod (See KEP-5732 ).

Proposal

User Stories

As a workload author administering large deployments, I want to be able to share a single ResourceClaim among more than 256 Pods. That opens up the possibility for DRA to orchestrate scheduling large groups of Pods that all share a large device, such as a virtual device representing a topological domain.

Shareable and replicable ResourceClaims

As a workload author administering a deployment composed of multiple groups of Pods, I want to be able to express DRA resources which are replicated once for each group and can be shared by all of the Pods within a particular group.

Currently, ResourceClaims generated for an individual Pod from ResourceClaimTemplate cannot be declaratively shared among other Pods, and standalone ResourceClaims would need to be managed separately from the rest of the workload.

Integrating DRA with high-level APIs

As a maintainer of a high-level workload API like LWS or JobSet, I want to manage the lifecycle of ResourceClaims associated with the groups of Pods defined by my API.

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Higher memory usage by the device_taint_eviction controller

The device_taint_eviction controller will need to keep an index of which Pods are referenced from each ResourceClaim, so it can evict the correct Pods when devices are tainted. This will require some additional memory.

Removing this limit does not mean that the number of Pods that can share a ResourceClaim will be unlimited. New scale tests will determine how many Pods can practically share a single ResourceClaim.

Design Details

Background

The status.reservedFor field ResourceClaims is currently used for two purposes:

Deallocation

Devices are allocated to a ResourceClaim when the first Pod referencing the claim is scheduled. Other Pods can also share the ResourceClaim in which case they share the devices. Once no Pods are consuming the claim, the devices should be deallocated to they can be allocated to other claims. The status.reservedFor list is used to keep track of Pods consuming a ResourceClaim. Pods are added to the list by the DRA scheduler plugin during scheduling and removed from the list by the ResourceClaim controller when Pods are deleted or finish running. An empty list means there are no current consumers of the claim and it can be deallocated.

Finding Pods Using a ResourceClaim

status.reservedFor is read by the DRA scheduler plugin, the kubelet, and the device_taint_eviction controller to find Pods that are using a ResourceClaim:

The kubelet uses this to make sure it only runs Pods where the claims have been allocated to the Pod. It can verify this by checking that the Pod is listed in the status.reservedFor list.
The DRA scheduler plugin uses the list to find claims that have zero or only a single Pod using it, and is therefore a candidate for deallocation in the PostFilter function.
The device_taint_eviction controller uses the ReservedFor list to find the Pods that need to be evicted when one or more of the devices allocated to a ResourceClaim is tainted (and the ResourceClaim does not have a toleration).

So the solution needs to:

Give the ResourceClaim controller a way to know when there are no more consumers of a ResourceClaim so it can be deallocated.
Give controllers a way to list the Pods consuming or referencing a ResourceClaim.

API

The following API changes will be made:

The Workload and PodGroup APIs will be updated to include references to ResourceClaims and ResourceClaimTemplates, like Pods.
The Pod API will include new semantics for the existing API fields to refer to claims made by its PodGroup.

Workload

The Workload API changes are modeled after the existing Pod API to reference ResourceClaims, adding a spec.podGroupTemplates[].resourceClaims field:

type PodGroupTemplate struct {
	...

	// ResourceClaims defines which ResourceClaims may be shared among Pods in
	// the group. Pods consume the devices allocated to a PodGroup's claim by
	// defining a claim in its own Spec.ResourceClaims that matches the
	// PodGroup's claim exactly. The claim must have the same name and refer to
	// the same ResourceClaim or ResourceClaimTemplate.
	//
	// This is a beta-level field and requires that the
	// DRAWorkloadResourceClaims feature gate is enabled.
	//
	// This field is immutable.
	//
	// +optional
	// +patchMergeKey=name
	// +patchStrategy=merge,retainKeys
	// +listType=map
	// +listMapKey=name
	// +k8s:optional
	// +k8s:listType=map
	// +k8s:listMapKey=name
	// +k8s:maxItems=4
	// +featureGate=DRAWorkloadResourceClaims
	ResourceClaims []PodGroupResourceClaim `json:"resourceClaims,omitempty" patchStrategy:"merge,retainKeys" patchMergeKey:"name"`
}

// PodGroupResourceClaim references exactly one ResourceClaim, either directly
// or by naming a ResourceClaimTemplate which is then turned into a ResourceClaim
// for the PodGroup.
//
// It adds a name to it that uniquely identifies the ResourceClaim inside the PodGroup.
// Pods that need access to the ResourceClaim define a matching reference in its
// own Spec.ResourceClaims. The Pod's claim must match all fields of the
// PodGroup's claim exactly.
type PodGroupResourceClaim struct {
	// Name uniquely identifies this resource claim inside the PodGroup.
	// This must be a DNS_LABEL.
	//
	// +required
	// +k8s:required
	// +k8s:format=k8s-short-name
	Name string `json:"name" protobuf:"bytes,1,opt,name=name"`

	// ResourceClaimName is the name of a ResourceClaim object in the same
	// namespace as this PodGroup. The ResourceClaim will be reserved for the
	// PodGroup instead of its individual pods.
	//
	// Exactly one of ResourceClaimName and ResourceClaimTemplateName must
	// be set.
	//
	// +optional
	// +k8s:optional
	// +k8s:unionMember
	// +k8s:format=k8s-long-name
	ResourceClaimName *string `json:"resourceClaimName,omitempty"`

	// ResourceClaimTemplateName is the name of a ResourceClaimTemplate
	// object in the same namespace as this PodGroup.
	//
	// The template will be used to create a new ResourceClaim, which will
	// be bound to this PodGroup. When this PodGroup is deleted, the ResourceClaim
	// will also be deleted. The PodGroup name and resource name, along with a
	// generated component, will be used to form a unique name for the
	// ResourceClaim, which will be recorded in podgroup.status.resourceClaimStatuses.
	//
	// This field is immutable and no changes will be made to the
	// corresponding ResourceClaim by the control plane after creating the
	// ResourceClaim.
	//
	// Exactly one of ResourceClaimName and ResourceClaimTemplateName must
	// be set.
	//
	// +optional
	// +k8s:optional
	// +k8s:unionMember
	// +k8s:format=k8s-long-name
	ResourceClaimTemplateName *string `json:"resourceClaimTemplateName,omitempty"`
}

PodGroup

The PodGroup API will be updated similarly to contain the ResourceClaim references from its template defined in the Workload:

type PodGroupSpec struct {
	...

	// ResourceClaims defines which ResourceClaims may be shared among Pods in
	// the group. Pods consume the devices allocated to a PodGroup's claim by
	// defining a claim in its own Spec.ResourceClaims that matches the
	// PodGroup's claim exactly. The claim must have the same name and refer to
	// the same ResourceClaim or ResourceClaimTemplate.
	//
	// This is a beta-level field and requires that the
	// DRAWorkloadResourceClaims feature gate is enabled.
	//
	// This field is immutable.
	//
	// +optional
	// +patchMergeKey=name
	// +patchStrategy=merge,retainKeys
	// +listType=map
	// +listMapKey=name
	// +k8s:optional
	// +k8s:listType=map
	// +k8s:listMapKey=name
	// +k8s:maxItems=4
	// +k8s:immutable
	// +featureGate=DRAWorkloadResourceClaims
	ResourceClaims []PodGroupResourceClaim `json:"resourceClaims,omitempty" patchStrategy:"merge,retainKeys" patchMergeKey:"name"`
}

Similar to Pods, PodGroups will include a new status.resourceClaimStatuses field to resolve ResourceClaimTemplate references in spec.resourceClaims to the exact ResourceClaim generated for the PodGroup:

// PodGroupStatus represents information about the status of a pod group.
type PodGroupStatus struct {
	...

	// Status of resource claims.
	// +optional
	// +patchMergeKey=name
	// +patchStrategy=merge,retainKeys
	// +listType=map
	// +listMapKey=name
	// +k8s:optional
	// +k8s:listType=map
	// +k8s:listMapKey=name
	// +k8s:maxItems=4
	// +featureGate=DRAWorkloadResourceClaims
	ResourceClaimStatuses []PodGroupResourceClaimStatus `json:"resourceClaimStatuses,omitempty" patchStrategy:"merge,retainKeys" patchMergeKey:"name"`
}

// PodGroupResourceClaimStatus is stored in the PodGroupStatus for each
// PodGroupResourceClaim which references a ResourceClaimTemplate. It stores the
// generated name for the corresponding ResourceClaim.
type PodGroupResourceClaimStatus struct {
	// Name uniquely identifies this resource claim inside the PodGroup. This
	// must match the name of an entry in podgroup.spec.resourceClaims, which
	// implies that the string must be a DNS_LABEL.
	//
	// +required
	Name string `json:"name" protobuf:"bytes,1,name=name"`

	// ResourceClaimName is the name of the ResourceClaim that was generated for
	// the PodGroup in the namespace of the PodGroup. If this is unset, then
	// generating a ResourceClaim was not necessary. The
	// podgroup.spec.resourceClaims entry can be ignored in this case.
	//
	// +optional
	// +k8s:optional
	// +k8s:format=k8s-long-name
	ResourceClaimName *string `json:"resourceClaimName,omitempty"`
}

Pod

The existing DRA API fields for Pods remain unchanged. To request a ResourceClaim reserved for its PodGroup, a Pod specifies a claim in spec.resourceClaims that exactly matches a claim made in its PodGroup’s spec.resourceClaims. The name, resourceClaimName, and resourceClaimTemplateName fields must match between the Pod’s and the PodGroup’s claim. If a claim made by a Pod does not exactly match one made by its PodGroup, then the ResourceClaim is reserved (and for ResourceClaimTemplates, generated) for the Pod.

The status.resourceClaimStatuses field continues to include the names of ResourceClaims generated from ResourceClaimTemplates referenced in spec.resourceClaimNames whether they are generated for the Pod or the PodGroup. Status for ResourceClaims generated for the PodGroup is also recorded in the PodGroup’s status.resourceClaimStatuses.

The following example demonstrates the matching semantics:

apiVersion: scheduling.k8s.io/v1beta1
kind: PodGroup
metadata:
  name: podgroup
spec:
  resourceClaims:
  - name: pg-claim
    resourceClaimName: my-claim
  - name: pg-claim-template
    resourceClaimTemplateName: my-claim-template
---
apiVersion: v1
kind: Pod
metadata:
  name: pod
spec:
  resourceClaims:
  # Matches the PodGroup's claim, reserved for the PodGroup:
  - name: pg-claim
    resourceClaimName: my-claim

  # Matches the PodGroup's claim, reserved for the PodGroup, ResourceClaim is
  # generated for the PodGroup and shared by its Pods:
  - name: pg-claim-template
    resourceClaimTemplateName: my-claim-template

  # Does not match any PodGroup claim, reserved for the Pod:
  - name: pod-claim
    resourceClaimName: my-claim

  # Does not match any PodGroup claim, generated and reserved for the Pod:
  - name: pg-claim-template
    resourceClaimTemplateName: my-other-claim-template

Example

The following example demonstrates the relationships between the new fields. It describes the more common case where some higher level true workload controller (e.g. LWS, JobSet) is orchestrating the Workload and PodGroup objects vs. the user managing those directly.

Here, a user defines a high-level workload with two logical groups of Pods. Each of the two groups of Pods also request one device to be shared by the Pods in its group.

The user creates the following objects to request DRA devices which will be referenced by Pods through their PodGroup:

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: pg-claim-template
  namespace: default
spec:
  spec:
    devices:
      requests:
      - name: my-device
        exactly:
          deviceClassName: example
---
apiVersion: example.com/v1
kind: MyWorkload
metadata:
  name: my-workload
  namespace: default
spec:
  ...

The true workload API defines how ResourceClaims and ResourceClaimTemplates relate to groups of Pods. If the user is responsible for defining the Pods' spec.resourceClaims in a Pod template, then the PodGroups' spec.resourceClaims must be deterministic for the user to be able to define matching claims in the Pod spec.

The true workload controller then creates the following Workload API resources based on the true workload’s definition:

apiVersion: scheduling.k8s.io/v1beta1
kind: Workload
metadata:
  name: my-workload
  namespace: default
spec:
  podGroupTemplates:
  - name: group
    schedulingPolicy:
      basic: {}
    resourceClaims:
    - name: pg-claim
      resourceClaimTemplateName: pg-claim-template
---
apiVersion: scheduling.k8s.io/v1beta1
kind: PodGroup
metadata:
  name: my-podgroup-1
  namespace: default
spec:
  workloadRef:
    workloadName: my-workload
    templateName: group
  schedulingPolicy:
    basic: {}
  resourceClaims:
  - name: pg-claim
    resourceClaimTemplateName: pg-claim-template
---
apiVersion: scheduling.k8s.io/v1beta1
kind: PodGroup
metadata:
  name: my-podgroup-2
  namespace: default
spec:
  workloadRef:
    workloadName: my-workload
    templateName: group
  schedulingPolicy:
    basic: {}
  resourceClaims:
  - name: pg-claim
    resourceClaimTemplateName: pg-claim-template
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: wl-claim-example-1
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: wl-claim-example-1
  template:
    metadata:
      labels:
        app: wl-claim-example-1
    spec:
      containers:
      - name: pause
        image: "registry.k8s.io/pause:3.6"
        resources:
          claims:
          - name: pg-claim
      resourceClaims:
      - name: pg-claim
        resourceClaimTemplateName: pg-claim-template
      schedulingGroup:
        podGroupName: my-podgroup-1
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: wl-claim-example-2
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: wl-claim-example-2
  template:
    metadata:
      labels:
        app: wl-claim-example-2
    spec:
      containers:
      - name: pause
        image: "registry.k8s.io/pause:3.6"
        resources:
          claims:
          - name: pg-claim
      resourceClaims:
      - name: pg-claim
        resourceClaimTemplateName: pg-claim-template
      schedulingGroup:
        podGroupName: my-podgroup-2

Here, a Workload organizes Pods managed by two different Deployments into two different PodGroups. Each group refers to the same ResourceClaimTemplate, pg-claim-template. This single ResourceClaimTemplate forms the basis of two different ResourceClaims which will be created by the ResourceClaim controller: one for each PodGroup. The Pod templates in the Deployments include a claim matching their PodGroup’s claim, which ultimately resolves to the ResourceClaim generated for the PodGroup. The result is that with a single ResourceClaimTemplate, Pods in the same group all share the exact same allocated device, while Pods in the other group use an equivalent, but separately allocated, device.

ResourceClaim Lifecycle

The DynamicResources scheduler plugin and the ResourceClaim controller will cooperate to manage key points in the life of a ResourceClaim or ResourceClaimTemplate claimed by a PodGroup. Referenced ResourceClaimTemplates will replicate into one ResourceClaim per PodGroup. Those generated ResourceClaims and ResourceClaims referenced by name by a PodGroup will be allocated and deallocated by the Kubernetes control plane.

Create

When a PodGroup is created which references a ResourceClaimTemplate, the ResourceClaim controller will create a ResourceClaim from that template if one does not already exist for that PodGroup. Generated ResourceClaims will be owned (through metadata.ownerReferences) by the PodGroup and annotated with resource.kubernetes.io/pod-claim-name where the value is the name of the claim from the PodGroup’s spec.resourceClaims[].name to facilitate mapping a single PodGroup claim to the ResourceClaim generated for its PodGroup. When a Pod is created which requests a claim from its PodGroup, the name of the ResourceClaim generated for the PodGroup’s claim will be recorded in the Pod’s status.resourceClaimStatuses like ResourceClaims generated for Pods.

Delete

The resource.kubernetes.io/delete-protection finalizer added to a generated ResourceClaim by kube-scheduler serves the same purpose as for other ResourceClaims, preventing the ResourceClaim from being deleted until it is deallocated. Like other generated ResourceClaims, the ResourceClaim controller will unlock deletion of PodGroup-owned claims by removing the finalizer when they become deallocated. The garbage collector will then be responsible for deleting the ResourceClaim once its owning PodGroup is deleted.

Allocate

Generated and standalone ResourceClaims referenced by a PodGroup remain unallocated until kube-scheduler allocates the ResourceClaim by setting status.allocation for the first Pod in the PodGroup that references the PodGroup’s claim. When a Pod’s claim matches a claim made by its PodGroup, the ResourceClaim’s status.reservedFor list will reference the PodGroup instead of each individual Pod.

The names of all ResourceClaims associated with a Pod continue to be represented in the Pod, no matter if those ResourceClaims are reserved for the Pod or its PodGroup. The names of ResourceClaims referenced via resourceClaim continue to match that value, and the names of ResourceClaims generated from ResourceClaimTemplates continue to be recorded in the Pod’s status.resourceClaimStatuses. The kubelet does not need to look up a Pod’s PodGroup to find all of the Pod’s ResourceClaims.

Deallocate

The ResourceClaim controller will continue to deallocate claims when there are no entries in the ResourceClaim’s status.reservedFor. References to PodGroups in status.reservedFor are removed after the PodGroup is deleted. PodGroup deletion is gated by a finalizer managed by kube-controller-manager to prevent the PodGroup from being removed from status.reservedFor before all of its Pods are done using the ResourceClaim. When no more Pods in the group are expected to run, the creator of the PodGroup is responsible for deleting it to free up the devices allocated by its ResourceClaims.

Claims reserved for a PodGroup can also be deallocated by the scheduler in the DynamicResource plugin’s Unreserve phase when scheduling of a PodGroup failed and they are not reserved for any other resources (like other PodGroups). In that phase, the DynamicResources plugin uses the scheduler’s internal view of PodGroups to determine if any of the group’s Pods are scheduled. When no Pods in the group are scheduled, the scheduler removes the PodGroup from the claim’s status.reservedFor.

The DynamicResources plugin will also implement the PostFilter phase of the PodGroup scheduling cycle which will perform a similar function to the Pod-level implementation. When a PodGroup fails to schedule, PostFilter will deallocate and unreserve the PodGroup’s ResourceClaims which are reserved only for that PodGroup or for nobody.

Determining Allowed Pods for a ResourceClaim

Currently, any Pod allowed to utilize a ResourceClaim is listed explicitly in the claim’s status.reservedFor. When the list instead references a PodGroup, only the name in the reference must match a Pod’s spec.schedulingGroup.podGroupName. Since a finalizer will protect a PodGroup from being deleted before any of its Pods, a reference to the name of a PodGroup in a Pod will always refer to the exact same PodGroup, i.e. the PodGroup cannot be deleted and recreated with the same name without all of its Pods also being deleted or terminating in the meantime or if its finalizer is manually removed.

Finding Pods Using a ResourceClaim

If the reference in the status.reservedFor list is to a PodGroup, controllers can no longer use the list to directly find all Pods consuming the ResourceClaim. Instead they will look up all Pods referencing the PodGroup, which can be done by using a watch on Pods and maintaining an index of PodGroup to Pods referencing it. This can be done using the informer cache.

The list of Pods making up a PodGroup for which a ResourceClaim is reserved is not exactly the same as the list of Pods consuming a ResourceClaim. The status.reservedFor list only references Pods, or Pods' PodGroups, that have been processed by the DRA scheduler plugin and are scheduled to use the ResourceClaim. It is possible to have Pods that reference a PodGroup that has been allocated a claim, but haven’t yet been scheduled. This distinction is important for some of the usages of the status.reservedFor list described above:

The device_taint_eviction controller will use the list of Pods referencing the PodGroup to determine the list of pods that needs to be evicted. In this situation, it is ok if the list includes pods that haven’t yet been scheduled.

Test Plan

[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

None needed.

Unit tests

k8s.io/dynamic-resource-allocation/resourceclaim: 2026-01-29 - 89.3%
k8s.io/kubernetes/pkg/apis/core/v1: 2026-01-29 - 79.0%
k8s.io/kubernetes/pkg/apis/core/validation: 2026-01-29 - 85.3%
k8s.io/kubernetes/pkg/apis/scheduling/v1alpha1: 2026-01-29 - 83.3%
k8s.io/kubernetes/pkg/apis/scheduling/validation: 2026-01-29 - 96.6%
k8s.io/kubernetes/pkg/controller/devicetainteviction: 2026-01-29 - 86.7%
k8s.io/kubernetes/pkg/controller/resourceclaim: 2026-01-29 - 74.6%
k8s.io/kubernetes/pkg/kubelet/cm/dra: 2026-01-29 - 83.6%
k8s.io/kubernetes/pkg/scheduler/framework/plugins/dynamicresources: 2026-01-29 - 79.2%

Integration tests

New integration tests will verify:

New API fields in Pod and PodGroup are persisted or rejected correctly depending on the value of the DRAWorkloadResourceClaims feature gate.
ResourceClaimTemplates specified for PodGroups result in the correct ResourceClaims being allocated for the correct Pods.
No inconsistent state is reached when PodGroups rapidly come and go.
- ResourceClaims should continue to be created and deleted with their owning PodGroups such that Pods still schedule and no ResourceClaims are orphaned.
- At most one generated ResourceClaim should exist for a claim made by a PodGroup at any given time.

Additionally, scheduler_perf tests will be added, aiming for the same thresholds as existing DRA tests.

Integration tests added for alpha:

TestDRA/all/WorkloadResourceClaims : integration master , triage search

e2e tests

New e2e tests will verify correct behavior at key points in the lifecycle of a PodGroup.

When a PodGroup referencing a ResourceClaimTemplate is created, a ResourceClaim is generated and remains unallocated.
When the first Pod is created for the PodGroup, the ResourceClaim is allocated.
When subsequent Pods in the PodGroup are created, no additional ResourceClaims are generated and the Pods are all allocated the same existing ResourceClaim.
When all Pods in the PodGroup are deleted, the ResourceClaim is not deleted and remains allocated.
When the PodGroup has been deleted, then the ResourceClaim is deallocated, and eventually deleted.

e2e tests added for alpha: SIG Node , triage search

Graduation Criteria

Alpha

Feature implemented behind a feature flag
Initial e2e tests completed and enabled

Beta

Gather feedback from developers and surveys
Additional tests are in Testgrid and linked in KEP
More rigorous forms of testing—e.g., downgrade tests and scalability tests
All functionality completed
All security enforcement completed
All monitoring requirements completed
All testing requirements completed
All known pre-release issues and gaps resolved

GA

Integration with at least 2 widely used APIs for complex workload orchestration (e.g. Jobset, LeaderWorkerSet)
Allowing time for feedback
All issues and gaps identified as feedback during beta are resolved

Upgrade / Downgrade Strategy

The feature will no longer work if downgrading to a release without support for it. The API server will no longer accept the new fields and the other components will not know what to do with them. So the result is that the status.reservedFor list will only have references to Pod resources like today.

Any ResourceClaims that have already been allocated when the feature was active will have PodGroup references in the status.reservedFor list after a downgrade, but the controllers will not know how to handle it. There are two problems that will arise as a result of this:

The ResourceClaim controller will also have been downgraded, meaning that it will not remove references to PodGroups from the status.reservedFor list, thus leading to a situation where the claim will never be deallocated.
For new Pods that get scheduled, the scheduler will add Pod references in the status.reservedFor list, despite there being a PodGroup reference here. So it ends up with both Pod and PodGroup references in the list. We can manage both Pod and PodGroup references in the list by adding the PodGroup reference even if Pod references exist and making sure that the ResourceClaim controller removes Pod references even if there are PodGroup references in the list. Deallocation is only safe when no Pods are consuming the claim, so both PodGroup and Pod reference should be removed once that is true.

We will also provide explicit recommendations for how users can manage downgrades or disabling this feature. This means manually updating the status.reservedFor list to reference only Pods and not PodGroups. We don’t plan on providing automation for this.

Version Skew Strategy

If the kubelet is on a version that doesn’t support the feature but the rest of the components are, Pods referencing a PodGroup will be scheduled, but the kubelet will refuse to run those Pods since it will still check whether the Pods are referenced in the status.reservedFor list.

If the API server is on a version that supports the feature, but the scheduler is not, the scheduler will not know how to match a Pod’s claim with a claim made by its PodGroup, so it will put the reference to the Pod in the status.reservedFor list rather than the PodGroup. It will do this even if there is already a PodGroup reference in the status.reservedFor list. This leads to the challenge described in the previous section.

If the API server is on a version that supports the feature, but kube-controller-manager is not, then the ResourceClaim controller may observe PodGroups that define spec.resourceClaims. When Pods contain matching claims, the intent is that those claims are generated for the PodGroup instead of each Pod. Even when this feature is disabled, the ResourceClaim controller will check a Pod’s claims against its PodGroup. If the controller would have created a ResourceClaim for the PodGroup if the feature was enabled, then it will return an error. Users are expected to restart kube-controller-manager with the feature enabled to generate a ResourceClaim for the PodGroup. If the user intended to generate a ResourceClaim for that Pod, then the user has to recreate the PodGroup without the resource claim and all of its member Pods with the resource claim.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: DRAWorkloadResourceClaims
- Components depending on the feature gate:
  - kube-apiserver
  - kube-controller-manager
  - kube-scheduler
  - kubelet

Does enabling the feature change any default behavior?

No.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

If the kubelet restarts with the feature disabled, existing containers continue to run with all of their allocated devices, including those from claims made by their PodGroup when the feature was enabled.

If a DRA device is allocated to a ResourceClaim reserved for a PodGroup and the feature is disabled, the PodGroup will continue to be listed in the status.reservedFor of the ResourceClaim and will not be deallocated.

What happens if we reenable the feature if it was previously rolled back?

If the kubelet restarts with the feature enabled, then containers similarly continue to run with all of the devices with which they were first started.

Since no other state is lost when the feature is disabled, other components once again operate as described.

Are there any tests for feature enablement/disablement?

Unit and integration tests will verify behavior both when the feature is enabled and when it is disabled. They will also exercise cases where the feature is toggled.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

The ResourceClaim controller run by kube-controller-manager may see a PodGroup with spec.resourceClaims even when the feature is disabled if a rollout has already enabled the feature for at least one kube-apiserver instance. When a Pod makes a claim for one of its PodGroup’s ResourceClaimTemplates, whether the ResourceClaim controller should create a ResourceClaim once for the PodGroup or for each individual Pod is unclear. In Kubernetes 1.36, the controller creates one ResourceClaim for each Pod when the feature is disabled. In Kubernetes 1.37, the controller will not create a ResourceClaim for a Pod’s claim which matches its PodGroup’s when the feature is disabled.

If a PodGroup has lingering spec.resourceClaims references to ResourceClaimTemplates meant to be replicated for each Pod, then an upgrade to Kubernetes 1.37 will not create new ResourceClaims for new Pods in the PodGroup with a matching claim.

When downgrading to Kubernetes 1.36 with the feature disabled or disabling the feature on a 1.36 cluster, a ResourceClaimTemplate meant for a PodGroup will be replicated for each Pod while the feature is enabled for kube-apiserver and disabled for kube-controller-manager.

What specific metrics should inform a rollback?

A sudden increase in the dynamic_resource_allocation_resourceclaim_creates_total metric could mean that a ResourceClaimTemplate meant to be created for each PodGroup is being created for each Pod.

An increase in the scheduler_pending_pods metric may indicate that the controller is not creating ResourceClaims that grouped Pods need in order to be schedulable.

An increase in the workqueue_retries_total{name="resource_claim"} metric may indicate that the ResourceClaim controller is repeatedly running into errors.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

New automated upgrade/downgrade tests will exercise the following scenario where a Kubernetes 1.36 cluster is upgraded to 1.37 and then rolled back to 1.36:

Create a cluster on the older version with the DRAWorkloadResourceClaims feature enabled.
Create a PodGroup group-1 with member Pods pod-1 and pod-2. The PodGroup and Pods define matching resource claims for the ResourceClaimTemplate named template-1.
Verify that one ResourceClaim is generated from template-1 and is reserved for PodGroup group-1. Verify that Pods pod-1 and pod-2 are using the same generated ResourceClaim.
Upgrade the cluster to the new version.
Create Pod pod-3 as a member of PodGroup group-1.
Verify that the same generated ResourceClaim from template-1 stays reserved only for PodGroup group-1 and that no other ResourceClaims are generated.
Verify that pod-3 uses the same generated ResourceClaim as pod-1 and pod-2.
Delete Pod pod-2.
Verify that the generated ResourceClaim stays allocated and reserved only for PodGroup group-1.
Roll back the cluster upgrade.
Create Pod pod-4 as a member of PodGroup group-1.
Verify that the same generated ResourceClaim from template-1 stays reserved only for PodGroup group-1 and that no other ResourceClaims are generated.
Verify that pod-4 uses the same generated ResourceClaim as pod-1 and pod-3.
Delete Pod pod-3.
Verify that the generated ResourceClaim stays allocated and reserved only for PodGroup group-1.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

New owner_api_group and owner_api_kind labels will be added to the dynamic_resource_allocation_resourceclaim_creates_total metric to distinguish between claims created for a PodGroup or a Pod. A query for dynamic_resource_allocation_resourceclaim_creates_total{owner_api_group="scheduling.k8s.io", owner_api_kind="PodGroup"} shows how many ResourceClaims have been created for PodGroups.

Pods using the feature can be identified by looking for the ResourceClaims in its status.resourceClaimStatuses whose status.reservedFor lists any items with apiGroup set to scheduling.k8s.io and resource set to podgroups.

How can someone using this feature know that it is working for their instance?

API
- Condition name: None
- Other field:
  - When one of a Pod’s spec.resourceClaims matches one of its PodGroup’s spec.resourceClaims, the ResourceClaim referenced in the Pod’s status.resourceClaimStatuses for that claim contains the PodGroup in its status.reservedFor and not the Pod.
  - ResourceClaims created from ResourceClaimTemplates for a PodGroup list the PodGroup in its metadata.ownerReferences.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

This feature does not affect the existing SLOs for Pods using ResourceClaims as described by KEP-4381 .

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name: dynamic_resource_allocation_resourceclaim_creates_total{owner_api_group="scheduling.k8s.io", owner_api_kind="PodGroup"}, workqueue_*{name="resource_claim"}
- Components exposing the metric:
- kube-controller-manager

Are there any missing metrics that would be useful to have to improve observability of this feature?

No.

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

kube-controller-manager will list and watch PodGroup resources.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

This feature adds a new spec.resourceClaims list to the PodGroup API. It will be limited to 4 items.

The size of a ResourceClaim’s spec.reservedFor list will be reduced significantly when many Pods sharing the same claim make that claim through a common PodGroup.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

kube-controller-manager will run a new informer for PodGroup resources and index them by ResourceClaims they reference.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

When the API server is unavailable, the ResourceClaim controller is unable to see new Pods and PodGroups and cannot create ResourceClaims. Pods requiring those ResourceClaims in order to become schedulable will remain unschedulable until the API server becomes available again.

Updates by kube-scheduler to ResourceClaims’ status.reservedFor fields will also fail while the API server is unavailable. It will retry those updates with backoff until they succeed.

The kubelet is still able to start Pods when they have already been scheduled, their status.resourceClaimStatuses are up to date, and their ResourceClaims' statuses are up to date.

What are other known failure modes?

No other known failure modes.

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

1.36:

2025-12-12: KEP first draft published for review
2026-01-28: Combined with KEP-5194
2026-03-23: Initial implementation merged 1.37:
2026-05-11: Beta promotion proposed

Drawbacks

This complicates the allocation and deallocation logic somewhat as there will be two separate ways to manage the allocation and deallocation process for ResourceClaims.

It also leads to additional work for the device_taint_eviction controller since it needs to maintain an index to find all Pods using a ResourceClaim rather than just looking at the list of Pods in the status.reservedFor list.

Alternatives

Increase the size limit on the `status.reservedFor` field

To allow more Pods to share a single claim, the simplest solution would be to increase the size limit on the status.reservedFor field. Having a large list of Pod references is not a good way to handle it and could at least in theory run into the size limit of Kubernetes resources. Also, we would need to have some limit on the size, and whatever number we choose might still be too small for the largest workloads.

Allow ResourceClaims to be reserved for any object

KEP-5194 originally described the addition of new spec.reservedFor and status.reservedForAnyPod fields for ResourceClaims, to enable references to arbitrary objects in status.reservedFor. This approach shifts the responsibility to remove non-Pod objects from the status.reservedFor list to each true workload controller supporting DRA.

With the addition of the Workload and PodGroup APIs, the ResourceClaim API no longer needs to be as flexible since true workloads can integrate with those common APIs. In order to integrate with this feature, true workload controllers create and delete PodGroup objects (which will also provide many additional features) and don’t have to explicitly manage ResourceClaims.

KEP-5729: DRA: ResourceClaim Support for Workloads

KEP-5729: DRA: ResourceClaim Support for Workloads

Release Signoff Checklist

Summary

Motivation

Goals

Non-Goals

Proposal

User Stories

Sharing a ResourceClaim among many Pods

Shareable and replicable ResourceClaims

Integrating DRA with high-level APIs

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Higher memory usage by the device_taint_eviction controller

The number of Pods that can share a ResourceClaim will not be unlimited

Design Details

Background

Deallocation

Finding Pods Using a ResourceClaim

API

Workload

PodGroup

Pod

Example

ResourceClaim Lifecycle

Create

Delete

Allocate

Deallocate

Determining Allowed Pods for a ResourceClaim

Finding Pods Using a ResourceClaim

Test Plan

Prerequisite testing updates

Unit tests

Integration tests

e2e tests

Graduation Criteria

Alpha

Beta

GA

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

Increase the size limit on the status.reservedFor field

Allow ResourceClaims to be reserved for any object

Increase the size limit on the `status.reservedFor` field