KEP-5729: DRA: ResourceClaim Support for Workloads

Implementation History
ALPHA Implementable
Created 2025-12-12
Latest v1.36
Milestones
Alpha v1.36
Ownership
Owning SIG
SIG Scheduling
Participating SIGs

KEP-5729: DRA: ResourceClaim Support for Workloads

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
    • e2e Tests for all Beta API Operations (endpoints)
    • (R) Ensure GA e2e tests meet requirements for Conformance Tests
    • (R) Minimum Two Week Window for GA e2e tests to prove flake free
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • “Implementation History” section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This enhancement describes additions to the Workload API and PodGroup API which make it possible to associate ResourceClaims and ResourceClaimTemplates with those objects to better facilitate sharing DRA resources between the Pods they contain.

A ResourceClaim referenced by a PodGroup will be reserved for that PodGroup as a whole instead of its individual Pods, addressing the limit on the number of entries in a ResourceClaim’s status.reservedFor list.

A ResourceClaimTemplate referenced by a PodGroup will cause a ResourceClaim to be generated once for that PodGroup, like how ResourceClaimTemplates work today when referenced by a Pod. Whereas ResourceClaims today can only be shared between Pods by name, this will allow ResourceClaims to be shared by Pods in the same PodGroup where the exact name of the ResourceClaim is not known ahead of time.

Motivation

AI/ML workloads are particularly sensitive to network latency and bandwidth between closely related Pods. Certain groups of Pods must be placed as closely together as possible to achieve maximum performance.

“Modeling Topology and Multi-Node Logical Devices” describes where existing mechanisms like Pod and Node affinity break down for these use cases and how Dynamic Resource Allocation (DRA) can fulfill those requirements by scheduling Pods within strict topological boundaries:

  • A ResourceSlice lists “devices” which represent nested topological units and form a tree of arbitrary depth. These units could be as small as a single host or as large as an entire datacenter, or perhaps even larger.
  • A ResourceClaim requests one of these topological units. Pods which reference that same ResourceClaim are scheduled within the same topological boundary.
  • Additionally, the allocation of the ResourceClaim may trigger a controller to reprogram the datacenter fabric to match the selected topological unit.

Large-scale workloads orchestrated by specialized APIs like JobSet and LeaderWorkerSet cannot currently practically express granular topological constraints with the current Kubernetes APIs. Today, ResourceClaims to be shared by multiple Pods must be created one by one, and referenced by name in the Pod spec. Those APIs which define replicable groups of Pods are left to manage shared ResourceClaims themselves.

Moreover, the current limit of 256 entries in a ResourceClaim’s status.reservedFor list limits a device to being shared by up to that number of Pods. Production-scale workloads require larger numbers of Pods to share a single claim.

The Workload API defines a common representation of these related sets of Pods as a PodGroup. Associating ResourceClaimTemplates with PodGroups allows Kubernetes to manage the lifecycle of the generated ResourceClaims generically for all types implementing the Workload API.

Goals

  • Allow users to express sets of DRA resources to be replicated for each PodGroup, and shared by each Pod in the PodGroup.
  • Automatically create and delete PodGroups’ ResourceClaims as needed.
  • Reduce the burden of each true workload controller implementing ResourceClaim generation separately (e.g. JobSet, LWS).
  • Allow claims to be allocated for more than 256 Pods.

Non-Goals

  • Associate ResourceClaims or ResourceClaimTemplates with Workload objects (future work).
  • Influence how Pods are placed onto Nodes based on the ResourceClaimTemplates and ResourceClaims associated with a PodGroup or Pod (See KEP-5732 ).

Proposal

User Stories

Sharing a ResourceClaim among many Pods

As a workload author administering large deployments, I want to be able to share a single ResourceClaim among more than 256 Pods. That opens up the possibility for DRA to orchestrate scheduling large groups of Pods that all share a large device, such as a virtual device representing a topological domain.

Shareable and replicable ResourceClaims

As a workload author administering a deployment composed of multiple groups of Pods, I want to be able to express DRA resources which are replicated once for each group and can be shared by all of the Pods within a particular group.

Currently, ResourceClaims generated for an individual Pod from ResourceClaimTemplate cannot be declaratively shared among other Pods, and standalone ResourceClaims would need to be managed separately from the rest of the workload.

Integrating DRA with high-level APIs

As a maintainer of a high-level workload API like LWS or JobSet, I want to manage the lifecycle of ResourceClaims associated with the groups of Pods defined by my API.

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Higher memory usage by the device_taint_eviction controller

The device_taint_eviction controller will need to keep an index of which Pods are referenced from each ResourceClaim, so it can evict the correct Pods when devices are tainted. This will require some additional memory.

The number of Pods that can share a ResourceClaim will not be unlimited

Removing this limit does not mean that the number of Pods that can share a ResourceClaim will be unlimited. New scale tests will determine how many Pods can practically share a single ResourceClaim.

Design Details

Background

The status.reservedFor field ResourceClaims is currently used for two purposes:

Deallocation

Devices are allocated to a ResourceClaim when the first Pod referencing the claim is scheduled. Other Pods can also share the ResourceClaim in which case they share the devices. Once no Pods are consuming the claim, the devices should be deallocated to they can be allocated to other claims. The status.reservedFor list is used to keep track of Pods consuming a ResourceClaim. Pods are added to the list by the DRA scheduler plugin during scheduling and removed from the list by the ResourceClaim controller when Pods are deleted or finish running. An empty list means there are no current consumers of the claim and it can be deallocated.

Finding Pods Using a ResourceClaim

status.reservedFor is read by the DRA scheduler plugin, the kubelet, and the device_taint_eviction controller to find Pods that are using a ResourceClaim:

  1. The kubelet uses this to make sure it only runs Pods where the claims have been allocated to the Pod. It can verify this by checking that the Pod is listed in the status.reservedFor list.

  2. The DRA scheduler plugin uses the list to find claims that have zero or only a single Pod using it, and is therefore a candidate for deallocation in the PostFilter function.

  3. The device_taint_eviction controller uses the ReservedFor list to find the Pods that need to be evicted when one or more of the devices allocated to a ResourceClaim is tainted (and the ResourceClaim does not have a toleration).

So the solution needs to:

  • Give the ResourceClaim controller a way to know when there are no more consumers of a ResourceClaim so it can be deallocated.
  • Give controllers a way to list the Pods consuming or referencing a ResourceClaim.

API

The following API changes will be made:

  • The Workload and PodGroup APIs will be updated to include references to ResourceClaims and ResourceClaimTemplates, like Pods.
  • The Pod API will be updated to include references to claims listed in its PodGroup.

Workload

The Workload API changes are modeled after the existing Pod API to reference ResourceClaims, adding a spec.podGroupTemplates[].resourceClaims field:

type PodGroupTemplate struct {
	...

	// ResourceClaims defines which ResourceClaims may be shared among Pods in
	// the group. Pods must reference these claims in order to consume the
	// allocated devices.
	//
	// This is an alpha-level field and requires that the
	// WorkloadPodGroupResourceClaimTemplate feature gate is enabled.
	//
	// This field is immutable.
	//
	// +patchMergeKey=name
	// +patchStrategy=merge,retainKeys
	// +listType=map
	// +listMapKey=name
	// +featureGate=WorkloadPodGroupResourceClaimTemplate
	// +optional
	ResourceClaims []PodGroupResourceClaim `json:"resourceClaims,omitempty"`
}

// PodGroupResourceClaim references exactly one ResourceClaim, either directly
// or by naming a ResourceClaimTemplate which is then turned into a ResourceClaim
// for the PodGroup.
//
// It adds a name to it that uniquely identifies the ResourceClaim inside the PodGroup.
// Pods that need access to the ResourceClaim reference it with this name.
type PodGroupResourceClaim struct {
	// Name uniquely identifies this resource claim inside the PodGroup.
	// This must be a DNS_LABEL.
	Name string `json:"name"`

	// ResourceClaimName is the name of a ResourceClaim object in the same
	// namespace as this PodGroup. The ResourceClaim will be reserved for the
	// PodGroup instead of its individual pods.
	//
	// Exactly one of ResourceClaimName and ResourceClaimTemplateName must
	// be set.
	ResourceClaimName *string `json:"resourceClaimName,omitempty"`

	// ResourceClaimTemplateName is the name of a ResourceClaimTemplate
	// object in the same namespace as this PodGroup.
	//
	// The template will be used to create a new ResourceClaim, which will
	// be bound to this PodGroup. When this PodGroup is deleted, the ResourceClaim
	// will also be deleted. The PodGroup name and resource name, along with a
	// generated component, will be used to form a unique name for the
	// ResourceClaim, which will be recorded in pod.status.resourceClaimStatuses.
	//
	// This field is immutable and no changes will be made to the
	// corresponding ResourceClaim by the control plane after creating the
	// ResourceClaim.
	//
	// Exactly one of ResourceClaimName and ResourceClaimTemplateName must
	// be set.
	ResourceClaimTemplateName *string `json:"resourceClaimTemplateName,omitempty"`
}

PodGroup

The PodGroup API will be updated similarly to contain the ResourceClaim references from its template defined in the Workload:

type PodGroupSpec struct {
	...

	// ResourceClaims defines which ResourceClaims may be shared among Pods in
	// the group. Pods must reference these claims in order to consume the
	// allocated devices.
	//
	// This is an alpha-level field and requires that the
	// WorkloadPodGroupResourceClaimTemplate feature gate is enabled.
	//
	// This field is immutable.
	//
	// +patchMergeKey=name
	// +patchStrategy=merge,retainKeys
	// +listType=map
	// +listMapKey=name
	// +featureGate=WorkloadPodGroupResourceClaimTemplate
	// +optional
	ResourceClaims []PodGroupResourceClaim `json:"resourceClaims,omitempty"`
}

Pod

When a PodGroup includes claims, the name of a claim in the PodGroup can be used on Pods in the group to associate the PodGroup’s dedicated ResourceClaim. This complements existing references to ResourceClaims and ResourceClaimTemplates.

// PodResourceClaim references exactly one ResourceClaim, either directly,
// by naming a ResourceClaimTemplate which is then turned into a ResourceClaim
// for the pod, or by naming a claim made for a PodGroup.
//
// It adds a name to it that uniquely identifies the ResourceClaim inside the Pod.
// Containers that need access to the ResourceClaim reference it with this name.
type PodResourceClaim struct {
	...

	// PodGroupResourceClaim refers to the name of a claim associated
	// with this pod's PodGroup.
	//
	// Exactly one of ResourceClaimName, ResourceClaimTemplateName,
	// or PodGroupResourceClaim must be set.
	PodGroupResourceClaim *string `json:"podGroupResourceClaim,omitempty"`
}

Example

The following example demonstrates the relationships between the new fields. It describes the more common case where some higher level true workload controller (e.g. LWS, JobSet) is orchestrating the Workload and PodGroup objects vs. the user managing those directly.

Here, a user defines a high-level workload with two logical groups of Pods. Each of the two groups of Pods also request one device to be shared by the Pods in its group.

The user creates the following objects to request DRA devices which will be referenced by Pods through their PodGroup:

apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
  name: pg-claim-template
  namespace: default
spec:
  spec:
    devices:
      requests:
      - name: my-device
        exactly:
          deviceClassName: example
---
apiVersion: example.com/v1
kind: MyWorkload
metadata:
  name: my-workload
  namespace: default
spec:
  ...

The true workload API defines how ResourceClaims and ResourceClaimTemplates relate to groups of Pods. If the user is responsible for defining the Pods' spec.resourceClaims in a Pod template, then the PodGroups' spec.resourceClaims[].names must be deterministic for the user to be able to reference them in the Pod spec.

The true workload controller then creates the following Workload API resources based on the true workload’s definition:

apiVersion: scheduling.k8s.io/v1alpha2
kind: Workload
metadata:
  name: my-workload
  namespace: default
spec:
  podGroupTemplates:
  - name: group-1
    schedulingPolicy:
      basic: {}
    resourceClaims:
    - name: pg-claim
      resourceClaimTemplateName: pg-claim-template
  - name: group-2
    schedulingPolicy:
      basic: {}
    resourceClaims:
    - name: pg-claim
      resourceClaimTemplateName: pg-claim-template
---
apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
  name: my-podgroup-1
  namespace: default
spec:
  podGroupTemplateRef:
    workloadName: my-workload
    podGroupTemplateName: group-1
  resourceClaims:
  - name: pg-claim
    resourceClaimTemplateName: pg-claim-template
---
apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
  name: my-podgroup-2
  namespace: default
spec:
  podGroupTemplateRef:
    workloadName: my-workload
    podGroupTemplateName: group-2
  resourceClaims:
  - name: pg-claim
    resourceClaimTemplateName: pg-claim-template
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: wl-claim-example-1
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: wl-claim-example-1
  template:
    metadata:
      labels:
        app: wl-claim-example-1
    spec:
      containers:
      - name: pause
        image: "registry.k8s.io/pause:3.6"
        resources:
          claims:
          - name: resource
      resourceClaims:
      - name: resource
        podGroupResourceClaim: pg-claim
      schedulingGroup:
        podGroupName: my-podgroup-1
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: wl-claim-example-2
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: wl-claim-example-2
  template:
    metadata:
      labels:
        app: wl-claim-example-2
    spec:
      containers:
      - name: pause
        image: "registry.k8s.io/pause:3.6"
        resources:
          claims:
          - name: resource
      resourceClaims:
      - name: resource
        podGroupResourceClaim: pg-claim
      schedulingGroup:
        podGroupName: my-podgroup-2

Here, a Workload organizes Pods managed by two different Deployments into two different PodGroups. Each group refers to the same ResourceClaimTemplate, pg-claim-template. This single ResourceClaimTemplate forms the basis of two different ResourceClaims which will be created by the ResourceClaim controller: one for each PodGroup. The Pod templates in the Deployments include a reference to the claim listed for the PodGroup, which ultimately resolves to its PodGroup’s ResourceClaim. The result is that with a single ResourceClaimTemplate, Pods in the same group all share the exact same allocated device, while Pods in the other group use an equivalent, but separately allocated, device.

ResourceClaim Lifecycle

The DynamicResources scheduler plugin and the ResourceClaim controller will cooperate to manage key points in the life of a ResourceClaim or ResourceClaimTemplate claimed by a PodGroup. Referenced ResourceClaimTemplates will replicate into one ResourceClaim per PodGroup. Those generated ResourceClaims and ResourceClaims referenced by name by a PodGroup will be allocated and deallocated by the Kubernetes control plane.

Create

When a PodGroup is created which references a ResourceClaimTemplate, the ResourceClaim controller will create a ResourceClaim from that template if one does not already exist for that PodGroup. Generated ResourceClaims will be owned (through metadata.ownerReferences) by the PodGroup and annotated with resource.kubernetes.io/podgroup-claim-name where the value is the name of the claim from the PodGroup’s spec.resourceClaims[].name to facilitate mapping a single PodGroup claim to the ResourceClaim generated for its PodGroup. When a Pod is created which requests a claim from its PodGroup, the name of the ResourceClaim generated for the PodGroup’s claim will be recorded in the Pod’s status.resourceClaimStatuses like ResourceClaims generated for Pods. Like the resource.kubernetes.io/podgroup-claim-name annotation, resource.kubernetes.io/podgroup-claim-name is only to be used by the controller and will not be documented as part of the public API.

Delete

The resource.kubernetes.io/delete-protection finalizer added to a generated ResourceClaim by kube-scheduler serves the same purpose as for other ResourceClaims, preventing the ResourceClaim from being deleted until it is deallocated. Like other generated ResourceClaims, the ResourceClaim controller will unlock deletion of PodGroup-owned claims by removing the finalizer when they become deallocated. The garbage collector will then be responsible for deleting the ResourceClaim once its owning PodGroup is deleted.

Allocate

Generated and standalone ResourceClaims referenced by a PodGroup remain unallocated until kube-scheduler allocates the ResourceClaim by setting status.allocation for the first Pod in the PodGroup that references the PodGroup’s claim. When a Pod’s claim is requested through podGroupResourceClaim, the ResourceClaim’s status.reservedFor list will reference the PodGroup instead of each individual Pod.

The name of a ResourceClaim referenced by a PodGroup via resourceClaimName will be recorded in the status.resourceClaimStatuses of each Pod that requests that PodGroup’s claim. Along with names of ResourceClaims generated from templates (for the Pod or its PodGroup), this keeps all information about exactly which ResourceClaims are requested by the Pod in the Pod itself so the kubelet does not need to look up a Pod’s PodGroup.

Deallocate

The ResourceClaim controller will continue to deallocate claims when there are no entries in the ResourceClaim’s status.reservedFor. References to PodGroups in status.reservedFor are removed after the PodGroup is deleted. PodGroup deletion should be gated by a finalizer managed by the creator of the PodGroup to prevent the PodGroup from being removed from status.reservedFor before all of its Pods are done using the ResourceClaim. When no more Pods in the group are expected to run, the creator of the PodGroup is responsible for removing the finalizer and deleting the PodGroup.

Determining Allowed Pods for a ResourceClaim

Currently, any Pod allowed to utilize a ResourceClaim is listed explicitly in the claim’s status.reservedFor. When the list instead references a PodGroup, only the name in the reference must match a Pod’s spec.schedulingGroup.podGroupName. Since a finalizer will protect a PodGroup from being deleted before any of its Pods, a reference to the name of a PodGroup in a Pod will always refer to the exact same PodGroup, i.e. the PodGroup cannot be deleted and recreated with the same name without all of its Pods also being deleted in the meantime or if its finalizer is manually removed.

Finding Pods Using a ResourceClaim

If the reference in the status.reservedFor list is to a PodGroup, controllers can no longer use the list to directly find all Pods consuming the ResourceClaim. Instead they will look up all Pods referencing the PodGroup, which can be done by using a watch on Pods and maintaining an index of PodGroup to Pods referencing it. This can be done using the informer cache.

The list of Pods making up a PodGroup for which a ResourceClaim is reserved is not exactly the same as the list of Pods consuming a ResourceClaim. The status.reservedFor list only references Pods, or Pods' PodGroups, that have been processed by the DRA scheduler plugin and are scheduled to use the ResourceClaim. It is possible to have Pods that reference a PodGroup that has been allocated a claim, but haven’t yet been scheduled. This distinction is important for some of the usages of the status.reservedFor list described above:

  1. If the DRA scheduler plugin is trying to find candidates for deallocation in the PostFilter function and sees a ResourceClaim with a non-Pod reference, it will not attempt to deallocate. The plugin has no way to know how many Pods are actually consuming the ResourceClaim without the explicit list in status.reservedFor list and therefore it will not be safe to deallocate.

  2. The device_taint_eviction controller will use the list of Pods referencing the PodGroup to determine the list of pods that needs to be evicted. In this situation, it is ok if the list includes pods that haven’t yet been scheduled.

Test Plan

[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

None needed.

Unit tests
  • k8s.io/dynamic-resource-allocation/resourceclaim: 2026-01-29 - 89.3%
  • k8s.io/kubernetes/pkg/apis/core/v1: 2026-01-29 - 79.0%
  • k8s.io/kubernetes/pkg/apis/core/validation: 2026-01-29 - 85.3%
  • k8s.io/kubernetes/pkg/apis/scheduling/v1alpha1: 2026-01-29 - 83.3%
  • k8s.io/kubernetes/pkg/apis/scheduling/validation: 2026-01-29 - 96.6%
  • k8s.io/kubernetes/pkg/controller/devicetainteviction: 2026-01-29 - 86.7%
  • k8s.io/kubernetes/pkg/controller/resourceclaim: 2026-01-29 - 74.6%
  • k8s.io/kubernetes/pkg/kubelet/cm/dra: 2026-01-29 - 83.6%
  • k8s.io/kubernetes/pkg/scheduler/framework/plugins/dynamicresources: 2026-01-29 - 79.2%
Integration tests

New integration tests will verify:

  • New API fields in Pod and PodGroup are persisted or rejected correctly depending on the value of the WorkloadPodGroupResourceClaimTemplate feature gate.
  • ResourceClaimTemplates specified for PodGroups result in the correct ResourceClaims being allocated for the correct Pods.
  • No inconsistent state is reached when PodGroups rapidly come and go.
    • ResourceClaims should continue to be created and deleted with their owning PodGroups such that Pods still schedule and no ResourceClaims are orphaned.
    • At most one generated ResourceClaim should exist for a claim made by a PodGroup at any given time.

Additionally, scheduler_perf tests will be added, aiming for the same thresholds as existing DRA tests.

e2e tests

New e2e tests will verify correct behavior at key points in the lifecycle of a PodGroup.

  • When a PodGroup referencing a ResourceClaimTemplate is created, a ResourceClaim is generated and remains unallocated.
  • When the first Pod is created for the PodGroup, the ResourceClaim is allocated.
  • When subsequent Pods in the PodGroup are created, no additional ResourceClaims are generated and the Pods are all allocated the same existing ResourceClaim.
  • When all Pods in the PodGroup are deleted, the ResourceClaim is not deleted and remains allocated.
  • When the PodGroup has been deleted, then the ResourceClaim is deallocated, and eventually deleted.

Graduation Criteria

Alpha

  • Feature implemented behind a feature flag
  • Initial e2e tests completed and enabled

Beta

  • Gather feedback from developers and surveys
  • Additional tests are in Testgrid and linked in KEP
  • More rigorous forms of testing—e.g., downgrade tests and scalability tests
  • All functionality completed
  • All security enforcement completed
  • All monitoring requirements completed
  • All testing requirements completed
  • All known pre-release issues and gaps resolved

GA

  • Integration with at least 2 widely used APIs for complex workload orchestration (e.g. Jobset, LeaderWorkerSet)
  • Allowing time for feedback
  • All issues and gaps identified as feedback during beta are resolved

Upgrade / Downgrade Strategy

The feature will no longer work if downgrading to a release without support for it. The API server will no longer accept the new fields and the other components will not know what to do with them. So the result is that the status.reservedFor list will only have references to Pod resources like today.

Any ResourceClaims that have already been allocated when the feature was active will have PodGroup references in the status.reservedFor list after a downgrade, but the controllers will not know how to handle it. There are two problems that will arise as a result of this:

  • The ResourceClaim controller will also have been downgraded, meaning that it will not remove references to PodGroups from the status.reservedFor list, thus leading to a situation where the claim will never be deallocated.

  • For new Pods that get scheduled, the scheduler will add Pod references in the status.reservedFor list, despite there being a PodGroup reference here. So it ends up with both Pod and PodGroup references in the list. We can manage both Pod and PodGroup references in the list by adding the PodGroup reference even if Pod references exist and making sure that the ResourceClaim controller removes Pod references even if there are PodGroup references in the list. Deallocation is only safe when no Pods are consuming the claim, so both PodGroup and Pod reference should be removed once that is true.

We will also provide explicit recommendations for how users can manage downgrades or disabling this feature. This means manually updating the status.reservedFor list to reference only Pods and not PodGroups. We don’t plan on providing automation for this.

Version Skew Strategy

If the kubelet is on a version that doesn’t support the feature but the rest of the components are, Pods referencing a PodGroup will be scheduled, but the kubelet will refuse to run those Pods since it will still check whether the Pods are referenced in the status.reservedFor list.

If the API server is on a version that supports the feature, but the scheduler is not, the scheduler will not know about the new fields added, so it will put the reference to the Pod in the status.reservedFor list rather than the PodGroup. It will do this even if there is already a PodGroup reference in the status.reservedFor list. This leads to the challenge described in the previous section.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?
  • Feature gate (also fill in values in kep.yaml)
    • Feature gate name: WorkloadPodGroupResourceClaimTemplate
    • Components depending on the feature gate:
      • kube-apiserver
      • kube-controller-manager
      • kube-scheduler
      • kubelet
Does enabling the feature change any default behavior?

No.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

If the kubelet restarts with the feature disabled, existing containers continue to run with all of their allocated devices, including those from claims made by their PodGroup when the feature was enabled.

If a DRA device is allocated to a ResourceClaim reserved for a PodGroup and the feature is disabled, the PodGroup will continue to be listed in the status.reservedFor of the ResourceClaim and will not be deallocated.

What happens if we reenable the feature if it was previously rolled back?

If the kubelet restarts with the feature enabled, then containers similarly continue to run with all of the devices with which they were first started.

Since no other state is lost when the feature is disabled, other components once again operate as described.

Are there any tests for feature enablement/disablement?

Unit and integration tests will verify behavior both when the feature is enabled and when it is disabled. They will also exercise cases where the feature is toggled.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?
What specific metrics should inform a rollback?
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?
How can someone using this feature know that it is working for their instance?
  • Events
    • Event Reason:
  • API .status
    • Condition name:
    • Other field:
  • Other (treat as last resort)
    • Details:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
  • Metrics
    • Metric name:
    • [Optional] Aggregation method:
    • Components exposing the metric:
  • Other (treat as last resort)
    • Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?
  • kube-controller-manager will list and watch PodGroup resources.
Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

This feature adds a new spec.resourceClaims list to the PodGroup API. It will have the same limits as the Pod API’s spec.resourceClaims.

The Pod API adds a new spec.resourceClaims[].podGroupResourceClaim field which is mutually exclusive with its sibling resourceClaimName and resourceClaimTemplate fields so it will not meaningfully impact the size of a Pod.

The size of a ResourceClaim’s spec.reservedFor list will be reduced significantly when many Pods sharing the same claim make that claim through a common PodGroup.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
  • kube-controller-manager will run a new informer for PodGroup resources and index them by ResourceClaims they reference.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?
What are other known failure modes?
What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

1.36:

  • 2025-12-12: KEP first draft published for review
  • 2026-01-28: Combined with KEP-5194

Drawbacks

This complicates the allocation and deallocation logic somewhat as there will be two separate ways to manage the allocation and deallocation process for ResourceClaims.

It also leads to additional work for the device_taint_eviction controller since it needs to maintain an index to find all Pods using a ResourceClaim rather than just looking at the list of Pods in the status.reservedFor list.

Alternatives

Increase the size limit on the status.reservedFor field

To allow more Pods to share a single claim, the simplest solution would be to increase the size limit on the status.reservedFor field. Having a large list of Pod references is not a good way to handle it and could at least in theory run into the size limit of Kubernetes resources. Also, we would need to have some limit on the size, and whatever number we choose might still be too small for the largest workloads.

Allow ResourceClaims to be reserved for any object

KEP-5194 originally described the addition of new spec.reservedFor and status.reservedForAnyPod fields for ResourceClaims, to enable references to arbitrary objects in status.reservedFor. This approach shifts the responsibility to remove non-Pod objects from the status.reservedFor list to each true workload controller supporting DRA.

With the addition of the Workload and PodGroup APIs, the ResourceClaim API no longer needs to be as flexible since true workloads can integrate with those common APIs. In order to integrate with this feature, true workload controllers create and delete PodGroup objects (which will also provide many additional features) and don’t have to explicitly manage ResourceClaims.

Infrastructure Needed (Optional)