KEP-5729: DRA: ResourceClaim Support for Workloads
KEP-5729: DRA: ResourceClaim Support for Workloads
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests within one minor version of promotion to GA
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
This enhancement describes additions to the Workload API and PodGroup API which make it possible to associate ResourceClaims and ResourceClaimTemplates with those objects to better facilitate sharing DRA resources between the Pods they contain.
A ResourceClaim referenced by a PodGroup will be reserved for that
PodGroup as a whole instead of its individual Pods, addressing the
limit on the number of entries in a ResourceClaim’s status.reservedFor list.
A ResourceClaimTemplate referenced by a PodGroup will cause a ResourceClaim to be generated once for that PodGroup, like how ResourceClaimTemplates work today when referenced by a Pod. Whereas ResourceClaims today can only be shared between Pods by name, this will allow ResourceClaims to be shared by Pods in the same PodGroup where the exact name of the ResourceClaim is not known ahead of time.
Motivation
AI/ML workloads are particularly sensitive to network latency and bandwidth between closely related Pods. Certain groups of Pods must be placed as closely together as possible to achieve maximum performance.
“Modeling Topology and Multi-Node Logical Devices” describes where existing mechanisms like Pod and Node affinity break down for these use cases and how Dynamic Resource Allocation (DRA) can fulfill those requirements by scheduling Pods within strict topological boundaries:
- A ResourceSlice lists “devices” which represent nested topological units and form a tree of arbitrary depth. These units could be as small as a single host or as large as an entire datacenter, or perhaps even larger.
- A ResourceClaim requests one of these topological units. Pods which reference that same ResourceClaim are scheduled within the same topological boundary.
- Additionally, the allocation of the ResourceClaim may trigger a controller to reprogram the datacenter fabric to match the selected topological unit.
Large-scale workloads orchestrated by specialized APIs like JobSet and LeaderWorkerSet cannot currently practically express granular topological constraints with the current Kubernetes APIs. Today, ResourceClaims to be shared by multiple Pods must be created one by one, and referenced by name in the Pod spec. Those APIs which define replicable groups of Pods are left to manage shared ResourceClaims themselves.
Moreover, the current limit of 256 entries in a ResourceClaim’s
status.reservedFor list limits a device to being shared by up to that number
of Pods. Production-scale workloads require larger numbers of Pods to share a
single claim.
The Workload API defines a common representation of these related sets of Pods as a PodGroup. Associating ResourceClaimTemplates with PodGroups allows Kubernetes to manage the lifecycle of the generated ResourceClaims generically for all types implementing the Workload API.
Goals
- Allow users to express sets of DRA resources to be replicated for each PodGroup, and shared by each Pod in the PodGroup.
- Automatically create and delete PodGroups’ ResourceClaims as needed.
- Reduce the burden of each true workload controller implementing ResourceClaim generation separately (e.g. JobSet, LWS).
- Allow claims to be allocated for more than 256 Pods.
Non-Goals
- Associate ResourceClaims or ResourceClaimTemplates with Workload objects (future work).
- Influence how Pods are placed onto Nodes based on the ResourceClaimTemplates and ResourceClaims associated with a PodGroup or Pod (See KEP-5732 ).
Proposal
User Stories
Sharing a ResourceClaim among many Pods
As a workload author administering large deployments, I want to be able to share a single ResourceClaim among more than 256 Pods. That opens up the possibility for DRA to orchestrate scheduling large groups of Pods that all share a large device, such as a virtual device representing a topological domain.
Shareable and replicable ResourceClaims
As a workload author administering a deployment composed of multiple groups of Pods, I want to be able to express DRA resources which are replicated once for each group and can be shared by all of the Pods within a particular group.
Currently, ResourceClaims generated for an individual Pod from ResourceClaimTemplate cannot be declaratively shared among other Pods, and standalone ResourceClaims would need to be managed separately from the rest of the workload.
Integrating DRA with high-level APIs
As a maintainer of a high-level workload API like LWS or JobSet, I want to manage the lifecycle of ResourceClaims associated with the groups of Pods defined by my API.
Notes/Constraints/Caveats (Optional)
Risks and Mitigations
Higher memory usage by the device_taint_eviction controller
The device_taint_eviction controller will need to keep an index of which Pods are referenced from each ResourceClaim, so it can evict the correct Pods when devices are tainted. This will require some additional memory.
The number of Pods that can share a ResourceClaim will not be unlimited
Removing this limit does not mean that the number of Pods that can share a ResourceClaim will be unlimited. New scale tests will determine how many Pods can practically share a single ResourceClaim.
Design Details
Background
The status.reservedFor field ResourceClaims is currently used for two
purposes:
Deallocation
Devices are allocated to a ResourceClaim when the first Pod referencing the
claim is scheduled. Other Pods can also share the ResourceClaim in which case
they share the devices. Once no Pods are consuming the claim, the devices should
be deallocated to they can be allocated to other claims. The
status.reservedFor list is used to keep track of Pods consuming a
ResourceClaim. Pods are added to the list by the DRA scheduler plugin during
scheduling and removed from the list by the ResourceClaim controller when Pods
are deleted or finish running. An empty list means there are no current
consumers of the claim and it can be deallocated.
Finding Pods Using a ResourceClaim
status.reservedFor is read by the DRA scheduler plugin, the kubelet, and the
device_taint_eviction controller to find Pods that are using a ResourceClaim:
The kubelet uses this to make sure it only runs Pods where the claims have been allocated to the Pod. It can verify this by checking that the Pod is listed in the
status.reservedForlist.The DRA scheduler plugin uses the list to find claims that have zero or only a single Pod using it, and is therefore a candidate for deallocation in the
PostFilterfunction.The device_taint_eviction controller uses the
ReservedForlist to find the Pods that need to be evicted when one or more of the devices allocated to a ResourceClaim is tainted (and the ResourceClaim does not have a toleration).
So the solution needs to:
- Give the ResourceClaim controller a way to know when there are no more consumers of a ResourceClaim so it can be deallocated.
- Give controllers a way to list the Pods consuming or referencing a ResourceClaim.
API
The following API changes will be made:
- The Workload and PodGroup APIs will be updated to include references to ResourceClaims and ResourceClaimTemplates, like Pods.
- The Pod API will be updated to include references to claims listed in its PodGroup.
Workload
The Workload API changes are modeled after the existing Pod API to reference
ResourceClaims, adding a spec.podGroupTemplates[].resourceClaims field:
type PodGroupTemplate struct {
...
// ResourceClaims defines which ResourceClaims may be shared among Pods in
// the group. Pods must reference these claims in order to consume the
// allocated devices.
//
// This is an alpha-level field and requires that the
// WorkloadPodGroupResourceClaimTemplate feature gate is enabled.
//
// This field is immutable.
//
// +patchMergeKey=name
// +patchStrategy=merge,retainKeys
// +listType=map
// +listMapKey=name
// +featureGate=WorkloadPodGroupResourceClaimTemplate
// +optional
ResourceClaims []PodGroupResourceClaim `json:"resourceClaims,omitempty"`
}
// PodGroupResourceClaim references exactly one ResourceClaim, either directly
// or by naming a ResourceClaimTemplate which is then turned into a ResourceClaim
// for the PodGroup.
//
// It adds a name to it that uniquely identifies the ResourceClaim inside the PodGroup.
// Pods that need access to the ResourceClaim reference it with this name.
type PodGroupResourceClaim struct {
// Name uniquely identifies this resource claim inside the PodGroup.
// This must be a DNS_LABEL.
Name string `json:"name"`
// ResourceClaimName is the name of a ResourceClaim object in the same
// namespace as this PodGroup. The ResourceClaim will be reserved for the
// PodGroup instead of its individual pods.
//
// Exactly one of ResourceClaimName and ResourceClaimTemplateName must
// be set.
ResourceClaimName *string `json:"resourceClaimName,omitempty"`
// ResourceClaimTemplateName is the name of a ResourceClaimTemplate
// object in the same namespace as this PodGroup.
//
// The template will be used to create a new ResourceClaim, which will
// be bound to this PodGroup. When this PodGroup is deleted, the ResourceClaim
// will also be deleted. The PodGroup name and resource name, along with a
// generated component, will be used to form a unique name for the
// ResourceClaim, which will be recorded in pod.status.resourceClaimStatuses.
//
// This field is immutable and no changes will be made to the
// corresponding ResourceClaim by the control plane after creating the
// ResourceClaim.
//
// Exactly one of ResourceClaimName and ResourceClaimTemplateName must
// be set.
ResourceClaimTemplateName *string `json:"resourceClaimTemplateName,omitempty"`
}
PodGroup
The PodGroup API will be updated similarly to contain the ResourceClaim references from its template defined in the Workload:
type PodGroupSpec struct {
...
// ResourceClaims defines which ResourceClaims may be shared among Pods in
// the group. Pods must reference these claims in order to consume the
// allocated devices.
//
// This is an alpha-level field and requires that the
// WorkloadPodGroupResourceClaimTemplate feature gate is enabled.
//
// This field is immutable.
//
// +patchMergeKey=name
// +patchStrategy=merge,retainKeys
// +listType=map
// +listMapKey=name
// +featureGate=WorkloadPodGroupResourceClaimTemplate
// +optional
ResourceClaims []PodGroupResourceClaim `json:"resourceClaims,omitempty"`
}
Pod
When a PodGroup includes claims, the name of a claim in the
PodGroup can be used on Pods in the group to associate the PodGroup’s dedicated
ResourceClaim. This complements existing references to ResourceClaims and
ResourceClaimTemplates.
// PodResourceClaim references exactly one ResourceClaim, either directly,
// by naming a ResourceClaimTemplate which is then turned into a ResourceClaim
// for the pod, or by naming a claim made for a PodGroup.
//
// It adds a name to it that uniquely identifies the ResourceClaim inside the Pod.
// Containers that need access to the ResourceClaim reference it with this name.
type PodResourceClaim struct {
...
// PodGroupResourceClaim refers to the name of a claim associated
// with this pod's PodGroup.
//
// Exactly one of ResourceClaimName, ResourceClaimTemplateName,
// or PodGroupResourceClaim must be set.
PodGroupResourceClaim *string `json:"podGroupResourceClaim,omitempty"`
}
Example
The following example demonstrates the relationships between the new fields. It describes the more common case where some higher level true workload controller (e.g. LWS, JobSet) is orchestrating the Workload and PodGroup objects vs. the user managing those directly.
Here, a user defines a high-level workload with two logical groups of Pods. Each of the two groups of Pods also request one device to be shared by the Pods in its group.
The user creates the following objects to request DRA devices which will be referenced by Pods through their PodGroup:
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: pg-claim-template
namespace: default
spec:
spec:
devices:
requests:
- name: my-device
exactly:
deviceClassName: example
---
apiVersion: example.com/v1
kind: MyWorkload
metadata:
name: my-workload
namespace: default
spec:
...
The true workload API defines how ResourceClaims and ResourceClaimTemplates
relate to groups of Pods. If the user is responsible for defining the Pods'
spec.resourceClaims in a Pod template, then the PodGroups'
spec.resourceClaims[].names must be deterministic for the user to be able to
reference them in the Pod spec.
The true workload controller then creates the following Workload API resources based on the true workload’s definition:
apiVersion: scheduling.k8s.io/v1alpha2
kind: Workload
metadata:
name: my-workload
namespace: default
spec:
podGroupTemplates:
- name: group-1
schedulingPolicy:
basic: {}
resourceClaims:
- name: pg-claim
resourceClaimTemplateName: pg-claim-template
- name: group-2
schedulingPolicy:
basic: {}
resourceClaims:
- name: pg-claim
resourceClaimTemplateName: pg-claim-template
---
apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
name: my-podgroup-1
namespace: default
spec:
podGroupTemplateRef:
workloadName: my-workload
podGroupTemplateName: group-1
resourceClaims:
- name: pg-claim
resourceClaimTemplateName: pg-claim-template
---
apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
name: my-podgroup-2
namespace: default
spec:
podGroupTemplateRef:
workloadName: my-workload
podGroupTemplateName: group-2
resourceClaims:
- name: pg-claim
resourceClaimTemplateName: pg-claim-template
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: wl-claim-example-1
namespace: default
spec:
replicas: 2
selector:
matchLabels:
app: wl-claim-example-1
template:
metadata:
labels:
app: wl-claim-example-1
spec:
containers:
- name: pause
image: "registry.k8s.io/pause:3.6"
resources:
claims:
- name: resource
resourceClaims:
- name: resource
podGroupResourceClaim: pg-claim
schedulingGroup:
podGroupName: my-podgroup-1
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: wl-claim-example-2
namespace: default
spec:
replicas: 2
selector:
matchLabels:
app: wl-claim-example-2
template:
metadata:
labels:
app: wl-claim-example-2
spec:
containers:
- name: pause
image: "registry.k8s.io/pause:3.6"
resources:
claims:
- name: resource
resourceClaims:
- name: resource
podGroupResourceClaim: pg-claim
schedulingGroup:
podGroupName: my-podgroup-2
Here, a Workload organizes Pods managed by two different Deployments into two
different PodGroups.
Each group refers to the same ResourceClaimTemplate,
pg-claim-template. This single ResourceClaimTemplate forms the basis of two
different ResourceClaims which will be created by the ResourceClaim controller:
one for each PodGroup. The Pod templates in the Deployments include a reference
to the claim listed for the PodGroup, which ultimately resolves to its
PodGroup’s ResourceClaim. The result is that with a single
ResourceClaimTemplate, Pods in the same group all share the exact same allocated
device, while Pods in the other group use an equivalent, but separately
allocated, device.
ResourceClaim Lifecycle
The DynamicResources scheduler plugin and the ResourceClaim controller will cooperate to manage key points in the life of a ResourceClaim or ResourceClaimTemplate claimed by a PodGroup. Referenced ResourceClaimTemplates will replicate into one ResourceClaim per PodGroup. Those generated ResourceClaims and ResourceClaims referenced by name by a PodGroup will be allocated and deallocated by the Kubernetes control plane.
Create
When a PodGroup is created which references a ResourceClaimTemplate, the
ResourceClaim controller will create a ResourceClaim from that template if one
does not already exist for that PodGroup. Generated ResourceClaims will be owned (through
metadata.ownerReferences) by the PodGroup and annotated with
resource.kubernetes.io/podgroup-claim-name where the value is the name of the
claim from the PodGroup’s spec.resourceClaims[].name to facilitate mapping a
single PodGroup claim to the ResourceClaim generated for its PodGroup. When a
Pod is created which requests a claim from its PodGroup, the name of the
ResourceClaim generated for the PodGroup’s claim
will be recorded in the Pod’s status.resourceClaimStatuses like
ResourceClaims generated for Pods. Like the
resource.kubernetes.io/podgroup-claim-name annotation,
resource.kubernetes.io/podgroup-claim-name is only to be used by the
controller and will not be documented as part of the public API.
Delete
The resource.kubernetes.io/delete-protection finalizer added to a generated
ResourceClaim by kube-scheduler serves the same purpose as for other
ResourceClaims, preventing the ResourceClaim from being deleted until it is
deallocated. Like other generated ResourceClaims, the ResourceClaim controller
will unlock deletion of PodGroup-owned claims by removing the finalizer when
they become deallocated. The garbage collector will then be responsible for
deleting the ResourceClaim once its owning PodGroup is deleted.
Allocate
Generated and standalone ResourceClaims referenced by a PodGroup remain
unallocated until kube-scheduler allocates the ResourceClaim by setting
status.allocation for the first Pod in the PodGroup that references the
PodGroup’s claim. When a Pod’s claim is requested through
podGroupResourceClaim, the ResourceClaim’s status.reservedFor list will
reference the PodGroup instead of each individual Pod.
The name of a ResourceClaim referenced by a PodGroup via resourceClaimName
will be recorded in the status.resourceClaimStatuses of each Pod that
requests that PodGroup’s claim. Along with names of ResourceClaims generated
from templates (for the Pod or its PodGroup), this keeps all information about
exactly which ResourceClaims are requested by the Pod in the Pod itself so the
kubelet does not need to look up a Pod’s PodGroup.
Deallocate
The ResourceClaim controller will continue to deallocate claims when there are
no entries in the ResourceClaim’s status.reservedFor. References to PodGroups
in status.reservedFor are removed after the PodGroup is deleted. PodGroup
deletion should be gated by a finalizer managed by the creator of the PodGroup
to prevent the PodGroup from being removed from status.reservedFor before all
of its Pods are done using the ResourceClaim. When no more Pods in the group are
expected to run, the creator of the PodGroup is responsible for removing the
finalizer and deleting the PodGroup.
Determining Allowed Pods for a ResourceClaim
Currently, any Pod allowed to utilize a ResourceClaim is listed explicitly in
the claim’s status.reservedFor. When the list instead references a PodGroup,
only the name in the reference must match a Pod’s
spec.schedulingGroup.podGroupName. Since a finalizer will protect a PodGroup from
being deleted before any of its Pods, a reference to the name of a PodGroup in a
Pod will always refer to the exact same PodGroup, i.e. the PodGroup cannot be
deleted and recreated with the same name without all of its Pods also being
deleted in the meantime or if its finalizer is manually removed.
Finding Pods Using a ResourceClaim
If the reference in the status.reservedFor list is to a PodGroup,
controllers can no longer use the list to directly find all Pods consuming the
ResourceClaim. Instead they will look up all Pods referencing the
PodGroup, which can be done by using a watch on Pods and maintaining an index of
PodGroup to Pods referencing it. This can be done using the informer
cache.
The list of Pods making up a PodGroup for which a ResourceClaim is
reserved is not exactly the same as the list of Pods consuming a ResourceClaim.
The status.reservedFor list only references Pods, or Pods'
PodGroups, that have been processed by the DRA scheduler plugin and
are scheduled to use the ResourceClaim. It is possible to have Pods that
reference a PodGroup that has been allocated a claim, but haven’t
yet been scheduled. This distinction is important for some of the usages of the
status.reservedFor list described above:
If the DRA scheduler plugin is trying to find candidates for deallocation in the
PostFilterfunction and sees a ResourceClaim with a non-Pod reference, it will not attempt to deallocate. The plugin has no way to know how many Pods are actually consuming the ResourceClaim without the explicit list instatus.reservedForlist and therefore it will not be safe to deallocate.The device_taint_eviction controller will use the list of Pods referencing the PodGroup to determine the list of pods that needs to be evicted. In this situation, it is ok if the list includes pods that haven’t yet been scheduled.
Test Plan
[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
None needed.
Unit tests
k8s.io/dynamic-resource-allocation/resourceclaim:2026-01-29-89.3%k8s.io/kubernetes/pkg/apis/core/v1:2026-01-29-79.0%k8s.io/kubernetes/pkg/apis/core/validation:2026-01-29-85.3%k8s.io/kubernetes/pkg/apis/scheduling/v1alpha1:2026-01-29-83.3%k8s.io/kubernetes/pkg/apis/scheduling/validation:2026-01-29-96.6%k8s.io/kubernetes/pkg/controller/devicetainteviction:2026-01-29-86.7%k8s.io/kubernetes/pkg/controller/resourceclaim:2026-01-29-74.6%k8s.io/kubernetes/pkg/kubelet/cm/dra:2026-01-29-83.6%k8s.io/kubernetes/pkg/scheduler/framework/plugins/dynamicresources:2026-01-29-79.2%
Integration tests
New integration tests will verify:
- New API fields in Pod and PodGroup are persisted or rejected correctly
depending on the value of the
WorkloadPodGroupResourceClaimTemplatefeature gate. - ResourceClaimTemplates specified for PodGroups result in the correct ResourceClaims being allocated for the correct Pods.
- No inconsistent state is reached when PodGroups rapidly come and go.
- ResourceClaims should continue to be created and deleted with their owning PodGroups such that Pods still schedule and no ResourceClaims are orphaned.
- At most one generated ResourceClaim should exist for a claim made by a PodGroup at any given time.
Additionally, scheduler_perf tests will be added, aiming for the same thresholds as existing DRA tests.
e2e tests
New e2e tests will verify correct behavior at key points in the lifecycle of a PodGroup.
- When a PodGroup referencing a ResourceClaimTemplate is created, a ResourceClaim is generated and remains unallocated.
- When the first Pod is created for the PodGroup, the ResourceClaim is allocated.
- When subsequent Pods in the PodGroup are created, no additional ResourceClaims are generated and the Pods are all allocated the same existing ResourceClaim.
- When all Pods in the PodGroup are deleted, the ResourceClaim is not deleted and remains allocated.
- When the PodGroup has been deleted, then the ResourceClaim is deallocated, and eventually deleted.
Graduation Criteria
Alpha
- Feature implemented behind a feature flag
- Initial e2e tests completed and enabled
Beta
- Gather feedback from developers and surveys
- Additional tests are in Testgrid and linked in KEP
- More rigorous forms of testing—e.g., downgrade tests and scalability tests
- All functionality completed
- All security enforcement completed
- All monitoring requirements completed
- All testing requirements completed
- All known pre-release issues and gaps resolved
GA
- Integration with at least 2 widely used APIs for complex workload orchestration (e.g. Jobset, LeaderWorkerSet)
- Allowing time for feedback
- All issues and gaps identified as feedback during beta are resolved
Upgrade / Downgrade Strategy
The feature will no longer work if downgrading to a release without support for
it. The API server will no longer accept the new fields and the other components
will not know what to do with them. So the result is that the
status.reservedFor list will only have references to Pod resources like today.
Any ResourceClaims that have already been allocated when the feature was active
will have PodGroup references in the status.reservedFor list after a
downgrade, but the controllers will not know how to handle it. There are two
problems that will arise as a result of this:
The ResourceClaim controller will also have been downgraded, meaning that it will not remove references to PodGroups from the
status.reservedForlist, thus leading to a situation where the claim will never be deallocated.For new Pods that get scheduled, the scheduler will add Pod references in the
status.reservedForlist, despite there being a PodGroup reference here. So it ends up with both Pod and PodGroup references in the list. We can manage both Pod and PodGroup references in the list by adding the PodGroup reference even if Pod references exist and making sure that the ResourceClaim controller removes Pod references even if there are PodGroup references in the list. Deallocation is only safe when no Pods are consuming the claim, so both PodGroup and Pod reference should be removed once that is true.
We will also provide explicit recommendations for how users can manage
downgrades or disabling this feature. This means manually updating the
status.reservedFor list to reference only Pods and not PodGroups. We don’t
plan on providing automation for this.
Version Skew Strategy
If the kubelet is on a version that doesn’t support the feature but the rest of
the components are, Pods referencing a PodGroup will be scheduled, but the
kubelet will refuse to run those Pods since it will still check whether the
Pods are referenced in the status.reservedFor list.
If the API server is on a version that supports the feature, but the scheduler
is not, the scheduler will not know about the new fields added, so it will put
the reference to the Pod in the status.reservedFor list rather than the
PodGroup. It will do this even if there is already a PodGroup reference in the
status.reservedFor list. This leads to the challenge described in the previous
section.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name: WorkloadPodGroupResourceClaimTemplate
- Components depending on the feature gate:
- kube-apiserver
- kube-controller-manager
- kube-scheduler
- kubelet
Does enabling the feature change any default behavior?
No.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
If the kubelet restarts with the feature disabled, existing containers continue to run with all of their allocated devices, including those from claims made by their PodGroup when the feature was enabled.
If a DRA device is allocated to a ResourceClaim reserved for a PodGroup and the
feature is disabled, the PodGroup will continue to be listed in the
status.reservedFor of the ResourceClaim and will not be deallocated.
What happens if we reenable the feature if it was previously rolled back?
If the kubelet restarts with the feature enabled, then containers similarly continue to run with all of the devices with which they were first started.
Since no other state is lost when the feature is disabled, other components once again operate as described.
Are there any tests for feature enablement/disablement?
Unit and integration tests will verify behavior both when the feature is enabled and when it is disabled. They will also exercise cases where the feature is toggled.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
What specific metrics should inform a rollback?
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
How can someone using this feature know that it is working for their instance?
- Events
- Event Reason:
- API .status
- Condition name:
- Other field:
- Other (treat as last resort)
- Details:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- Other (treat as last resort)
- Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?
Dependencies
Does this feature depend on any specific services running in the cluster?
Scalability
Will enabling / using this feature result in any new API calls?
- kube-controller-manager will list and watch PodGroup resources.
Will enabling / using this feature result in introducing new API types?
No.
Will enabling / using this feature result in any new calls to the cloud provider?
No.
Will enabling / using this feature result in increasing size or count of the existing API objects?
This feature adds a new spec.resourceClaims list to the PodGroup API. It will
have the same limits as the Pod API’s spec.resourceClaims.
The Pod API adds a new spec.resourceClaims[].podGroupResourceClaim field which
is mutually exclusive with its sibling resourceClaimName and
resourceClaimTemplate fields so it will not meaningfully impact the size of a
Pod.
The size of a ResourceClaim’s spec.reservedFor list will be reduced
significantly when many Pods sharing the same claim make that claim through a
common PodGroup.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
- kube-controller-manager will run a new informer for PodGroup resources and index them by ResourceClaims they reference.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No.
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
What are other known failure modes?
What steps should be taken if SLOs are not being met to determine the problem?
Implementation History
1.36:
- 2025-12-12: KEP first draft published for review
- 2026-01-28: Combined with KEP-5194
Drawbacks
This complicates the allocation and deallocation logic somewhat as there will be two separate ways to manage the allocation and deallocation process for ResourceClaims.
It also leads to additional work for the device_taint_eviction controller since
it needs to maintain an index to find all Pods using a ResourceClaim rather than
just looking at the list of Pods in the status.reservedFor list.
Alternatives
Increase the size limit on the status.reservedFor field
To allow more Pods to share a single claim, the simplest solution would be to
increase the size limit on the status.reservedFor field. Having a large
list of Pod references is not a good way to handle it and could at least in
theory run into the size limit of Kubernetes resources. Also, we would need to
have some limit on the size, and whatever number we choose might still be too
small for the largest workloads.
Allow ResourceClaims to be reserved for any object
KEP-5194
originally described the addition of new spec.reservedFor and
status.reservedForAnyPod fields for ResourceClaims, to enable references to
arbitrary objects in status.reservedFor. This approach shifts the
responsibility to remove non-Pod objects from the status.reservedFor list to
each true workload controller supporting DRA.
With the addition of the Workload and PodGroup APIs, the ResourceClaim API no longer needs to be as flexible since true workloads can integrate with those common APIs. In order to integrate with this feature, true workload controllers create and delete PodGroup objects (which will also provide many additional features) and don’t have to explicitly manage ResourceClaims.