KEP-5945: DRA Optional Node Preparation
KEP-5945: DRA Optional Node Preparation
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests within one minor version of promotion to GA
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
This KEP introduces Optional Node Preparation to Dynamic Resource Allocation
(DRA), allowing resource drivers to declare that node preparation and/or node
unpreparation is not required for their devices. Currently, the kubelet assumes
it must always coordinate with a node-local DRA driver via gRPC to prepare
allocated devices before container start (NodePrepareResources), and to
unprepare them during pod termination (NodeUnprepareResources).
In some cases, node preparation or cleanup is a pure no-op. Requiring it forces administrators and vendors to deploy and maintain empty node-local drivers on every node, which introduces unnecessary operational complexity and risk.
By introducing a SkipNodeOperations field to the
ResourceSliceSpec and propagating it to the device allocation results at
scheduling time, the kubelet can safely skip driver lookup and gRPC calls for
devices that do not require these node-local actions.
Motivation
In Dynamic Resource Allocation (DRA), the kubelet coordinates with a node-local
driver via gRPC to prepare allocated devices before container start
(NodePrepareResources) and to unprepare them upon pod termination
(NodeUnprepareResources). For node-local accelerators (such as PCIe GPUs or
local FPGAs), this node-level setup is critical to check device health,
partition memory, and configure mount paths.
However, there is an emerging class of resources whose lifecycles are managed
entirely in the control plane and published centrally by a controller as
ResourceSlice objects. These resources require absolutely zero node-local
setup. Under the current architecture, the kubelet still assumes a node-local
driver exists, forcing administrators to deploy and maintain wasteful “no-op”
node DaemonSets just to answer gRPC calls with empty success responses. If one
of these dummy helper plugins crashes or is missing, the kubelet’s unprepare
hook fails and retries indefinitely, leaving terminating pods permanently “stuck
in Terminating” and blocking cluster upgrades and node drains.
To resolve this architectural mismatch and accommodate modern deployments, we
need a way for resource drivers to declare that node preparation and cleanup
can be skipped. Bypassing these gRPC hooks directly at the ResourceSlice level
allows vendors to deploy central-only controllers with zero worker node
footprints. It also provides the flexibility to support mixed hardware
topologies—where a single driver manages some devices requiring node-level
preparation and others that do not—without splitting the driver or forcing
unnecessary footprints onto worker nodes.
Goals
- Allow resource drivers to declare that node-local operations (preparation and clean-up) are optional for devices.
- Propagate this configuration from the
ResourceSliceto the final allocatedResourceClaim.Status.Allocationresult. - Update the kubelet to skip driver lookup and gRPC preparation/unpreparation steps when node operations are explicitly configured as skipped.
- Maintain backward compatibility: by default, all existing DRA drivers must continue to require node-local preparation and unpreparation.
Non-Goals
- Eliminate node-local preparation entirely.
- Enable users to override this infrastructure requirement at the individual
ResourceClaimSpeclevel.
Proposal
We propose adding a boolean field SkipNodeOperations to
ResourceSliceSpec and DeviceRequestAllocationResult.
- API Definition: The driver/controller publisher sets
SkipNodeOperations: trueinResourceSliceresources if the published devices do not require node-local setup or cleanup. - Control Plane Resolution: The allocator/scheduler resolves the referenced
ResourceSliceduring allocation, and copies this configuration intoResourceClaim.Status.Allocation.Devices.Results[i].SkipNodeOperations. - Node Execution: The kubelet reads this field from the
ResourceClaim’s allocation results. If all allocated devices for a given driver within a claim setSkipNodeOperations: true, the kubelet bypasses the corresponding gRPC calls (NodePrepareResourcesandNodeUnprepareResources) to the node-local resource driver.
User Stories
Deploying controller-managed resources without node-local drivers
As a cluster administrator or vendor using a central driver controller, I want
to offer resources (e.g., cluster-wide shared resource pools, logically
partitioned network services, or pure control-plane gating drivers like
dra-driver-image-configurator
)
where availability is discovered and published as
ResourceSlice resources centrally by the controller. Because the devices
require no node-local plumbing or mount operations on worker nodes, there is no
node driver deployed. The controller publishes these resources with
SkipNodeOperations: true. When users request these
devices, the kubelet launches the pods immediately and cleanly, without
complaining about missing node-local drivers, and without requiring any node
driver DaemonSet to be present in the cluster.
Risks and Mitigations
- Dynamic ResourceSlice Changes: An administrator or controller could update
SkipNodeOperationsin aResourceSlicewhile claims are already allocated.Mitigation: While freezing the allocation configuration into the
ResourceClaimstatus ensures consistent execution for already running pods, it also means that if a driver’s requirements are updated in-place, existing claims will still use the older configuration. Specifically:- If a driver changes from skipping to requiring node preparation, existing claims will still have node preparation skipped by the kubelet, causing pods to run without the required hardware setup.
- If a driver changes from requiring to skipping node preparation, and the node-local driver is decommissioned, existing claims will still require node preparation, causing the kubelet to fail or hang waiting for the missing driver plugin.
Because this skew is inherent to the decoupled nature of scheduling and runtime, this risk must be managed operationally: cluster administrators must perform driver upgrades and migrations carefully, ensuring no active claims/pods exist for the driver before changing its configuration or decommissioning node-local driver components.
- Backward Compatibility & Out-of-Tree / Custom Allocators: Old scheduler
clients or out-of-tree custom driver controllers/allocators might write
allocation results without setting the new
SkipNodeOperationsfield.- Mitigation: The behavior depends on whether the driver uses optional node preparation:
- For drivers that do not use optional node preparation (i.e., require node-local setup):
The pointer fields default to
nilwhen omitted, which is treated asfalse(not skipped). The kubelet will execute node preparation and clean-up as normal. This guarantees 100% backward compatibility with all existing schedulers, custom controllers, and running workloads. - For drivers that use optional node preparation (and do not deploy a node-local driver):
If an old or out-of-tree allocator fails to copy the skip fields from the
ResourceSliceto theResourceClaimstatus, the kubelet will default to executing preparation and fail because no node-local driver is running. To mitigate this:- Custom allocators/schedulers must be upgraded to support and copy the new fields before they can be used with optional-preparation drivers.
- Alternatively, during transitions, operators can deploy a minimal, “no-op” node-local daemon for the driver to satisfy the kubelet’s gRPC calls until the allocator is upgraded.
- For drivers that do not use optional node preparation (i.e., require node-local setup):
The pointer fields default to
- Mitigation: The behavior depends on whether the driver uses optional node preparation:
Design Details
API Changes
ResourceSliceSpec:type ResourceSliceSpec struct { ... // SkipNodeOperations indicates that node-local resource operations (NodePrepareResources and NodeUnprepareResources gRPC calls) // are not required for the devices in this slice. Defaults to nil (false). // +optional SkipNodeOperations *bool `json:"skipNodeOperations,omitempty" protobuf:"varint,9,opt,name=skipNodeOperations"` }DeviceRequestAllocationResult:type DeviceRequestAllocationResult struct { ... // SkipNodeOperations indicates that node-local operations are not required for this allocated device. // Typically copied from the corresponding ResourceSliceSpec by the allocator/scheduler. Defaults to nil (false). // +optional SkipNodeOperations *bool `json:"skipNodeOperations,omitempty" protobuf:"varint,11,opt,name=skipNodeOperations"` }
API Server Handling and Ratcheting Validation
The kube-apiserver validates the new SkipNodeOperations field against the
state of the DRAOptionalNodePreparation feature gate to prevent workloads from
entering a broken state. To support safe cluster rollbacks and downgrades, this
feature gate enforcement is implemented using Ratcheting Validation (in
accordance with the Kubernetes API Changes
Guide
):
- When the feature gate is disabled:
- New Resources (POST): Any attempt to create a
ResourceSliceor allocate aResourceClaim(via its status) withSkipNodeOperationsset totrueis rejected with a validation error. - Existing Resources (PUT): The API server allows updates to existing
resources that already have this field set to
true(e.g., persisted while the feature gate was enabled before a downgrade), provided the update does not attempt to newly enable or modify this field. Any transition of this field fromnil/falsetotrueis rejected.
- New Resources (POST): Any attempt to create a
- When the feature gate is enabled:
- The fields are validated and persisted normally.
Allocator Changes
During scheduling, the structured parameters allocator resolves ResourceSlices
that contain the allocated devices.
If the DRAOptionalNodePreparation feature gate is enabled:
- The allocator extracts the
SkipNodeOperationsboolean value from the correspondingResourceSliceSpecand copies it directly into eachDeviceRequestAllocationResultunderResourceClaim.Status.Allocation.Devices.Results.
If the DRAOptionalNodePreparation feature gate is disabled:
- If any resolved
ResourceSlicehasSkipNodeOperationsset totrue, the allocator will fail the allocation of this claim. This prevents scheduling pods when node operations cannot be safely bypassed or properly requested.
Kubelet Changes
When Kubelet prepares resources for an allocated claim, it evaluates the allocated devices’ status:
- Aggregation: Because Kubelet invokes preparation and clean-up per-claim,
Kubelet can only bypass node operations if all devices for a given driver
allocated in a claim have
SkipNodeOperationsset totrue. - Checkpointing: Kubelet caches this aggregated property inside its
checkpointed, claim-specific state (
ClaimInfo) so it is safely preserved across Kubelet restarts. To ensure robust upgrade/downgrade compatibility, the checkpoint serialization must be forward and backward compatible. - Bypassing: During
PrepareResourcesandUnprepareResources, the DRA manager checks the claim’s cached properties. If skipping is enabled for the driver under that claim (meaning all allocated devices haveSkipNodeOperationsexplicitly set totruein the allocation result), it bypasses driver registry lookup and the respective gRPC calls (NodePrepareResourcesorNodeUnprepareResources), allowing container startup/pod termination to proceed immediately. If any device has anilorfalsevalue, it defaults tofalse(do not skip). - Disabled Feature Gate Behavior: If the
DRAOptionalNodePreparationfeature gate is disabled on the kubelet:- Early Admission Failure: If the
NodeDeclaredFeaturesframework is active, the Kubelet’s pod admission handler will use the shared library to infer the pod’s requirements. If a pod requiresDraOptionalNodePreparation(because its allocated claims specify skipping node operations) but the Kubelet has the feature gate disabled (meaning it does not declare the feature in its status), the Kubelet will reject the pod during admission. This prevents the pod from attempting to run and failing later. - Defense-in-Depth for
PrepareResources: If a pod somehow bypasses the Kubelet’s admission handler, the DRA manager’s existing check acts as a secondary defense: if a claim’s allocation result specifiesSkipNodeOperations: true, the DRA manager failsPrepareResourcesimmediately with aDRAOptionalNodePreparationDisablederror, preventing the pod from running with uninitialized hardware. - Safe Rollback for
UnprepareResources: Since we already validate and fail during admission orPrepareResources, we do not need any additional checks or errors duringUnprepareResourcesif the feature gate is disabled. Specifically, if a pod withSkipNodeOperations: truewas already processed (e.g., when the feature gate was enabled) but the feature gate is subsequently disabled, the Kubelet will still skip the unprepare call and allow the pod to terminate cleanly. This honors the original intent and prevents the pod from getting permanently stuck in theTerminatingstate. Since the pod is already running, it is not subject to new admission checks during termination.
- Early Admission Failure: If the
Node Declared Features Integration
To manage version skew safely during rolling upgrades, this KEP integrates with the Node Declared Features framework This allows the control plane to dynamically discover if a node’s Kubelet supports optional node preparation before scheduling workloads, preventing pods from being scheduled to incompatible nodes.
We register a new declared feature:
- Feature Name:
DraOptionalNodePreparation - Associated Feature Gate:
DRAOptionalNodePreparation - Discovery Logic (Kubelet): A node declares support for
DraOptionalNodePreparationin itsnode.status.declaredFeaturesif and only if theDRAOptionalNodePreparationfeature gate is enabled on the Kubelet. - Inference Logic (Scheduler & Admission): The control plane infers that a
Pod requires the
DraOptionalNodePreparationfeature if:- The Pod references one or more
ResourceClaims. - At least one of those claims is allocated (has an
AllocationResult). - Within the allocation result, any device has
SkipNodeOperationsset totrue. If these conditions are met, the Pod is marked as requiringDraOptionalNodePreparationon the target node.
- The Pod references one or more
- Max Version: This feature ceases to be a scheduling constraint once the
DRAOptionalNodePreparationfeature graduates to GA and the minimum supported Kubelet version in the cluster skew policy guarantees support.
Test Plan
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
None.
Unit tests
- API Server Validation Unit Tests: In
pkg/apis/resource/validation/validation_test.go:- Verify the four ratcheting validation cases when the feature gate is disabled:
- New valid (field
nil/false) -> Succeeds. - New invalid (field
true) -> Fails. - Old valid (field
nil/false) -> Succeeds. - Old invalid (field
truein old object, unchanged in new) -> Succeeds.
- New valid (field
- Verify the four ratcheting validation cases when the feature gate is disabled:
- Allocator Unit Tests: In
staging/src/k8s.io/dynamic-resource-allocation/structured/allocator_test.go:- Verify that
SkipNodeOperationsinResourceSliceSpecis correctly propagated toAllocationResult(covering combinations of true, false, and omitted cases).
- Verify that
- Kubelet DRA Manager Unit Tests: In
pkg/kubelet/cm/dra/manager_test.go:- Mock claims with valid values of
SkipNodeOperations(true, false, nil). - Assert that
prepareResourcesandunprepareResourcesbehave accordingly:- If
SkipNodeOperationsistrue,prepareResourcesandunprepareResourcesbypass the plugin manager and succeed immediately. - If
SkipNodeOperationsisfalseornil,prepareResourcesandunprepareResourcesattempt to call the local driver.
- If
- Mock claims with valid values of
- Shared Library Unit Tests: In
staging/src/k8s.io/component-helpers/nodedeclaredfeatures/features/draoptionalnodepreparation_test.go:- Verify the node declared feature for
DraOptionalNodePreparationbehaves correctly.
- Verify the node declared feature for
- Kubelet Admission Unit Tests: In Kubelet pod admission tests:
- Verify that the Kubelet’s pod admission handler rejects a pod requiring
DraOptionalNodePreparationif the feature gate is disabled on the Kubelet.
- Verify that the Kubelet’s pod admission handler rejects a pod requiring
- Kubelet Checkpoint State Unit Tests: In
pkg/kubelet/cm/dra/claiminfo_test.go:- Verify backward and forward compatibility of the serialized
ClaimInfocheckpoint state:- Forward Compatibility (Downgrade/Rollback): Verify that a checkpoint
file written by a Kubelet running version N (containing the new
SkipNodeOperationsfield inClaimInfo) can be successfully parsed and deserialized by a Kubelet running version N-1 (or with the feature gate disabled) without parsing errors or crashes, with unrecognized fields being safely ignored. - Backward Compatibility (Upgrade): Verify that an older checkpoint file
written by a Kubelet running version N-1 (which completely lacks the new
field) is successfully parsed and deserialized by Kubelet version N, with
the field defaulting to
false/nil(ensuring we do not skip preparation/unpreparation for legacy claims).
- Forward Compatibility (Downgrade/Rollback): Verify that a checkpoint
file written by a Kubelet running version N (containing the new
- Verify backward and forward compatibility of the serialized
Integration tests
- Scheduler Filtering Integration Tests: In
test/integration/scheduler/filters/:- Verify that a pod requiring
DraOptionalNodePreparation(having a claim allocated with skip fields set totrue) is successfully scheduled to a node that advertises the feature. - Verify that the scheduler filters out (rejects) nodes that do not advertise the feature (representing older Kubelets or nodes with the feature gate disabled).
- Verify that if no compatible nodes are available, the pod remains in the
Pendingstate with aFailedSchedulingevent indicating the missingDraOptionalNodePreparationfeature on nodes.
- Verify that a pod requiring
e2e tests
We will add new End-to-End test cases inside test/e2e/dra/dra.go to validate
SkipNodeOperations configurations using different
driver configurations.
Scenario 1: Driver without node-local components (Pure Control-Plane)
This scenario validates that we can run workloads using drivers that do not deploy any node-local components.
- Setup: Deploy a DRA test driver without node gRPC components running on
worker nodes (
WithKubelet = false). - Test Case 1.1: Fully skipped node operations (
SkipNodeOperations: true)- API Configuration: Publish
ResourceSliceswithSkipNodeOperationsset totrue. - Workload: Deploy a Pod referencing this resource.
- Assertions:
- The Pod reaches the
Runningphase successfully. - The allocated device can be accessed.
- No
FailedPrepareDynamicResourceswarnings are posted to the Pod events. - Pod deletion completes cleanly and immediately (does not hang in
Terminatingwaiting for unprepare).
- The Pod reaches the
- API Configuration: Publish
- Test Case 1.2: Missing node component failure (
SkipNodeOperations: false)- API Configuration: Publish
ResourceSliceswithSkipNodeOperationsset tofalse(or omitted). - Workload: Deploy a Pod referencing this resource.
- Assertions:
- The Pod gets stuck in
ContainerCreatingwithFailedPrepareDynamicResourceserrors because the kubelet tries to contact the non-existent node driver.
- The Pod gets stuck in
- API Configuration: Publish
Scenario 2: Driver with node-local components (Standard Driver)
This scenario validates that the kubelet invokes the node-local
driver when SkipNodeOperations is false.
- Setup: Deploy a standard DRA test driver that includes node-local gRPC components.
- Test Case 2.1: Standard Node Execution (
SkipNodeOperations: false)- API Configuration: Publish
ResourceSliceswithSkipNodeOperations: false. - Workload: Deploy a Pod.
- Assertions:
- The Pod reaches the
Runningphase. - Assert that the driver’s
NodePrepareResourceswas called. - Delete the Pod.
- Assert that the driver’s
NodeUnprepareResourceswas called.
- The Pod reaches the
- API Configuration: Publish
Scenario 3: Upgrade / Downgrade and Feature Gate Rollback
This scenario validates that the system behaves correctly during a rolling upgrade or downgrade/rollback of the feature gate.
- Test Case 3.1: Rolling Upgrade (N-1 to N):
- Setup: Start with a cluster running version N-1 (feature gate disabled). Deploy a DRA driver.
- Action 1 (Control Plane Upgrade): Upgrade the control plane to version N
(feature gate enabled).
- Assertions:
- If we deploy a new workload using a control-plane-only driver (no
node-local components):
- The Pod remains in the
Pendingstate (unschedulable). The scheduler’sNodeDeclaredFeaturesplugin must filter out all N-1 worker nodes because they do not advertise theDraOptionalNodePreparationfeature in their status. - The pod must not be scheduled to any N-1 node.
- The Pod remains in the
- If we deploy a new workload using a control-plane-only driver (no
node-local components):
- Assertions:
- Action 2 (Kubelet Upgrade): Upgrade the kubelets to version N.
- Assertions:
- Once a Kubelet is upgraded to N and advertises
DraOptionalNodePreparation, verify that the pending workload is automatically scheduled to that node, successfully bypasses node preparation, transitions toRunning, and runs successfully. - Verify that deleting the control-plane-only workload completes immediately without trying to contact a node-local driver.
- Once a Kubelet is upgraded to N and advertises
- Assertions:
- Test Case 3.2: Feature Gate Rollback / Downgrade (N to N-1):
- Setup: Start with a cluster running version N (feature gate enabled).
Deploy a standard DRA driver and a workload using
SkipNodeOperations: true. - Action 1 (Control Plane Downgrade): Downgrade the control plane to N-1
(feature gate disabled).
- Assertions:
- The API server’s ratcheting validation allows the existing
ResourceSliceobjects to remain valid and not be rejected on unrelated updates. - The running workload on the N kubelet continues to run without interruption.
- The API server’s ratcheting validation allows the existing
- Assertions:
- Action 2 (Kubelet Downgrade / Feature Gate Rollback): Downgrade the
Kubelet binary to N-1, or disable the
DRAOptionalNodePreparationfeature gate on Kubelet version N, and restart the Kubelet.- Assertions:
- Checkpoint Recovery: The Kubelet starts up successfully and parses
the checkpoint file without errors or crashes.
- For Kubelet version N (gate disabled): The Kubelet successfully
recovers the full
ClaimInfostate including the savedSkipNodeOperations: truesetting. - For Kubelet version N-1: The Kubelet successfully parses the
checkpoint by ignoring the unknown fields, and recovers the rest of
the state while defaulting the missing skip field to
false.
- For Kubelet version N (gate disabled): The Kubelet successfully
recovers the full
- Workload Deletion:
- Verify that the kubelet behaves according to the Disabled Feature Gate Behavior section upon workload deletion.
- Checkpoint Recovery: The Kubelet starts up successfully and parses
the checkpoint file without errors or crashes.
- Assertions:
- Setup: Start with a cluster running version N (feature gate enabled).
Deploy a standard DRA driver and a workload using
- Test Case 3.3: Upgrade -> Downgrade -> Upgrade (N-1 -> N -> N-1 -> N):
- Setup: Start with a cluster running version N-1 (feature gate disabled). Deploy a standard DRA driver.
- Action 1 (Upgrade): Upgrade the cluster to version N (feature gate
enabled).
- Deploy a workload using a driver configured with
SkipNodeOperations: true. - Assert that the workload runs successfully.
- Deploy a workload using a driver configured with
- Action 2 (Downgrade): Downgrade the cluster to version N-1 (feature gate
disabled).
- Assertions:
- Assert that the API server’s ratcheting validation allows the existing
ResourceSlice(which hasSkipNodeOperations: true) to remain valid and unmodified. - Delete the workload and verify that the kubelet behaves according to the Disabled Feature Gate Behavior section.
- Assert that the API server’s ratcheting validation allows the existing
- Assertions:
- Action 3 (Upgrade Again): Upgrade the cluster back to version N (feature
gate enabled).
- Deploy a new workload using the same driver.
- Assert that the new field is respected, and the workload runs successfully.
- Assert that any pre-existing resource slices that survived the downgrade cycle continue to function correctly with the re-enabled feature gate.
Graduation Criteria
Alpha
- Feature implemented behind the
DRAOptionalNodePreparationfeature flag (off by default). - Full unit and basic E2E test suites (Scenario 1 & 2) implemented and green.
Beta
- Enable the feature gate by default.
- E2E upgrade/downgrade and rollback test suites (Scenario 3) implemented and green.
- Gather real-world feedback from developers and vendors deploying controller-managed DRA drivers.
- Ensure no regressions or performance issues are observed in large clusters.
GA
- Feature gate locked to true.
Upgrade / Downgrade Strategy
- Upgrade:
- When the cluster control plane and nodes are upgraded, all preexisting
claims (where the new pointer field is absent/
nil) automatically evaluate tofalse(not skipped). This guarantees no change in behavior for running workloads. - Newer claims can utilize drivers that publish resource slices configured
with
SkipNodeOperations: trueto bypass node-local execution. - During rolling upgrades, the scheduler’s
NodeDeclaredFeaturesplugin will automatically restrict the scheduling of pods using these newer “no-prep” claims to upgraded nodes that advertise support forDraOptionalNodePreparation.
- When the cluster control plane and nodes are upgraded, all preexisting
claims (where the new pointer field is absent/
- Downgrade:
- If a cluster is downgraded to a version where
DRAOptionalNodePreparationis disabled/unavailable, the kubelet will ignore the skip field and default to the legacy behavior of expecting node preparation. - If any pods are running using a driver without node-local drivers, those pods will fail to restart or delete cleanly if the kubelet tries to invoke node-local gRPC calls that don’t exist. Operators must ensure all pods using no-prep claims are terminated before downgrading, or ensure temporary no-op drivers are running during downgrade transitions.
- If a cluster is downgraded to a version where
Version Skew Strategy
Older kubelet (N-1 and older) / Upgraded Control Plane (N):
- Automated Version Skew Protection: If the control plane is upgraded and
generates allocations with
SkipNodeOperations: true, the scheduler’sNodeDeclaredFeaturesplugin will automatically infer that the pod requires theDraOptionalNodePreparationfeature. - Because older worker nodes running older Kubelets ($N-1$ and older) do not
support the feature gate, they will not advertise
DraOptionalNodePreparationin theirnode.status.declaredFeatures. - The scheduler will automatically filter out these older nodes during the scheduling cycle, guaranteeing that the pod will only land on compatible, upgraded nodes.
- Automated Version Skew Protection: If the control plane is upgraded and
generates allocations with
Upgraded kubelet (N) / Older Control Plane (N-1 and older):
If the control plane has not been upgraded yet, any new allocations will not have
SkipNodeOperationsset in the status.An upgraded kubelet (N) will read the absent fields and default to
false(requiring node preparation/unpreparation).The behavior depends on whether the driver uses optional node preparation:
- For drivers that do not use optional node preparation (i.e., require node-local setup):
The fallback to
falseensures backward-compatible, safe execution because the node-local driver is running and kubelet will coordinate with it as normal. - For drivers that use optional node preparation (and do not deploy a node-local driver):
The fallback to
falsemeans the upgraded kubelet will attempt to coordinate with the local driver and fail because no node-local driver is running.- Mitigation: The control plane must be upgraded before these optional-preparation drivers can be deployed, or a temporary, minimal “no-op” node-local daemon must be deployed to satisfy the kubelet’s gRPC calls during the transition window.
Note: This same fallback behavior occurs if the control plane is upgraded (N) but the active custom allocator or scheduler has not been upgraded to support KEP-5945 yet and fails to copy the field.
- For drivers that do not use optional node preparation (i.e., require node-local setup):
The fallback to
Kubelet Feature Gate Disabled / SkipNodeOperations set to true:
- If the control plane has the gate enabled and writes
SkipNodeOperations: true, but the upgraded kubelet has the gate disabled:- Pods requesting
SkipNodeOperations: truewill failPrepareResourceswith a clearDRAOptionalNodePreparationDisablederror. - For already running pods (in case the feature gate was disabled after
successful
PrepareResources), the kubelet will honorSkipNodeOperations: trueand skip the unprepare call duringUnprepareResources, allowing the pod to terminate cleanly.
- Pods requesting
- If the control plane has the gate enabled and writes
Scheduler Feature Gate Disabled / SkipNodeOperations set to true:
- If the
DRAOptionalNodePreparationfeature gate is disabled in the scheduler/allocator, but a driver publishesResourceSliceswithSkipNodeOperations: true(e.g., due to inconsistent feature gates in a rolling upgrade, or lingering slices after downgrade), the scheduler/allocator will fail the allocation of those claims. - This ensures that we fail allocation early in the scheduling lifecycle (which allows rescheduling/retry after correcting the configuration), rather than scheduling the pod incorrectly (where fields are not copied to the claim status and the kubelet subsequently gets stuck expecting a node-local driver).
- If the
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate
- Feature gate name:
DRAOptionalNodePreparation - Components depending on the feature gate:
- kube-apiserver
- kube-scheduler
- kubelet
- Feature gate name:
Does enabling the feature change any default behavior?
No. By default, absent pointer fields evaluate to nil (which defaults to
false in code), meaning all resource claims continue to require node
preparation and cleanup unless explicitly set to true in the published
ResourceSlice by the driver.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
Yes. Setting the feature gate to false and restarting components will disable
it. If disabled, any new allocations for standard drivers will proceed normally
(propagating false or nil). However, any allocation requests targeting drivers
that set SkipNodeOperations: true in their ResourceSlices
will fail during allocation, preventing workloads from scheduling into a state
where node preparation is incorrectly expected by the kubelet but cannot be satisfied.
What happens if we reenable the feature if it was previously rolled back?
Re-enabling the feature gate is safe. Any claims allocated while the feature was
disabled will have the skip field as false in their status, so they will
continue to be processed with node-local preparation. Newly allocated claims
after re-enablement can once again utilize no-prep resource pools. No state
corruption or data loss occurs.
Are there any tests for feature enablement/disablement?
Yes. Unit tests in the allocator will verify that when the feature gate is
disabled, if any ResourceSlice has SkipNodeOperations: true,
the allocator returns an error and fails allocation.
Kubelet unit tests will verify that when the feature gate is disabled on the node:
- It fails
PrepareResourcesif any active claim hasSkipNodeOperations: true. - During
UnprepareResourcesof an already running pod, it still skips cleanup if the claim hasSkipNodeOperations: true, allowing the pod to terminate cleanly.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
- A rollback can fail if pods were deployed relying on a driver with no node-local driver. If rolled back, the kubelet will start expecting a node driver, blocking those pods’ termination or restarts.
- Mitigation: Operators should ensure no no-prep pods are active in the cluster before disabling the feature gate.
What specific metrics should inform a rollback?
An increase in dra_operations_duration_seconds or
FailedPrepareDynamicResources warnings on the kubelet, indicating the kubelet
is attempting node preparation and blocking/failing due to missing node drivers.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
This will be tested as part of the Beta graduation criteria using the upgrade/downgrade E2E test plan. See Scenario 3: Upgrade / Downgrade and Feature Gate Rollback in the Test Plan for the detailed test cases.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
By exposing and monitoring the kubelet-side counter metrics
kubelet_dra_node_prepare_skips_total and
kubelet_dra_node_unprepare_skips_total, or by auditing active ResourceClaim
allocations to check if .status.allocation.devices.results[*].skipNodeOperations is set to true.
How can someone using this feature know that it is working for their instance?
- API .status
- Other field:
.Status.Allocation.Devices.Results[*].SkipNodeOperationswill betruein theResourceClaim. - Workloads run successfully without node-local drivers deployed.
- Other field:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
- Bypassing the node driver lookup should reduce pod startup latency
(
prepareResources) for resources not requiring node preparation to near-zero. - 0% error rate in kubelet resource preparation for no-prep claims.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- kubelet metrics:
kubelet_dra_operations_duration_secondsforprepareandunprepareactions. - Core Event rate for
FailedPrepareDynamicResources.
Are there any missing metrics that would be useful to have to improve observability of this feature?
Yes, we propose introducing two new kubelet-side counter metrics:
kubelet_dra_node_prepare_skips_total and
kubelet_dra_node_unprepare_skips_total (partitioned by driver_name). This
will track the total number of preparation and cleanup operations skipped
because the claim’s resources do not require node-local setup, allowing
operators to easily monitor optional preparation usage without querying the API
server.
Dependencies
Does this feature depend on any specific services running in the cluster?
Yes. This feature depends on the Node Declared Features framework.
Scalability
Will enabling / using this feature result in any new API calls?
No. It reuses existing API objects and calls.
Will enabling / using this feature result in introducing new API types?
No.
Will enabling / using this feature result in any new calls to the cloud provider?
No.
Will enabling / using this feature result in increasing size or count of the existing API objects?
Yes, slightly. An optional boolean pointer field is added to
ResourceSliceSpec and DeviceRequestAllocationResult.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No. It actually reduces time taken by kubelet pod startup since it skips gRPC lookups and network calls.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
No. Bypassing node-local drivers reduces total cluster-wide memory and CPU consumption by eliminating unnecessary helper daemonsets.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No. In fact, it prevents resource exhaustion by eliminating the need to run dummy daemonsets on every node for drivers without node-local drivers, which saves PIDs, memory, and sockets.
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
The kubelet relies on its locally saved ClaimInfo cache. If etcd is down, new
claims cannot be created/scheduled, but existing pods can be terminated cleanly
without requiring API server calls for no-prep claims.
What are other known failure modes?
Misconfigured driver skip settings: If a driver controller misconfigures
SkipNodeOperations: truefor a physical device that does require node preparation, the kubelet will skip preparation, causing containers to start without necessary mounts or initialization, leading to container application crashes.- Mitigation: Driver developers and administrators must ensure that
SkipNodeOperations: trueis only applied toResourceSlices representing resources that require absolutely no node-local preparation or device plumbing on the worker nodes.
- Mitigation: Driver developers and administrators must ensure that
Driver requirements change in-place: If a driver’s node preparation requirements are updated in-place (e.g., changing
SkipNodeOperationsin new resource slices), existing claims will still use the older configuration. Specifically:- If changing from skipping to requiring preparation, existing claims will still have node preparation skipped by the kubelet (potentially causing pod failures).
- If changing from requiring to skipping preparation and decommissioning the node-local driver, existing claims will still require node preparation, causing the kubelet to fail or hang waiting for the missing driver plugin.
- Mitigation: Administrators must perform such migrations/upgrades carefully (e.g., ensuring no active claims or pods exist for the driver before updating its configuration or decommissioning node-local driver components).
Older or custom allocator fails to copy field: If an older or custom scheduler/allocator does not support copying the skip field from the
ResourceSliceto theResourceClaimstatus, the kubelet will default to executing node preparation, which will fail if the driver has no node-local component deployed on the worker nodes.- Mitigation: Ensure the custom allocator/scheduler is upgraded to support and copy the new fields before deploying optional-preparation drivers, or temporarily run a minimal “no-op” node-local daemon for the driver.
What steps should be taken if SLOs are not being met to determine the problem?
- Verify if the affected Pod has
FailedPrepareDynamicResourcesevents. - Inspect the associated
ResourceClaimstatus:kubectl get resourceclaim <claim-name> -o yaml. - Check if
.status.allocation.devices.results[*].skipNodeOperationsis set totrue. If it isnilorfalsebut the driver is configured withSkipNodeOperations: truein itsResourceSlice, verify if the scheduler or custom allocator has been upgraded to support KEP-5945 and correctly copies this field. - If allocation itself is failing for the pod’s claims with errors indicating that
the optional node preparation feature is disabled in the scheduler, verify that the
DRAOptionalNodePreparationfeature gate is enabled in the scheduler/allocator components. - If
PrepareResourcesfails with aDRAOptionalNodePreparationDisablederror, verify that theDRAOptionalNodePreparationfeature gate is enabled on the target kubelet. - If a terminating pod was deleted and skipped cleanup, verify if it had
SkipNodeOperations: truein its allocation result, which allows bypassing cleanup even when the feature gate is disabled. - If resource preparation succeeded (skipped) but the container fails to start
or run because of missing hardware access, verify that the
ResourceSlicewas not misconfigured. If the device actually requires node-local prep,SkipNodeOperationsmust be set tofalse(or omitted).
Implementation History
- 2026-05-21: KEP drafted and proposed as Provisional for Alpha stage.
Drawbacks
- Adds a new boolean configuration field to the API, which increases API surface area. However, this is necessary to support controller-managed or logical resources natively without node-local drivers in a clean way.
Alternatives
Alternative 1: DeviceClass-level configuration
Configure this on the cluster-scoped DeviceClassSpec.
- Reason for Rejection: The cluster administrator shouldn’t have to specify
whether a device needs node preparation. Shifting it to
ResourceSlice(driver-owned) makes it fully automatic and matches the driver’s self-declared capability.
Alternative 2: Claim-level declaration
Allow users to declare SkipNodeOperations: true in their ResourceClaimSpec.
- Reason for Rejection: Users should not be concerned with, or even know about, the underlying node-level physical or logical prep requirements of the hardware. This is an operational and infrastructure concern that belongs entirely to the vendor and scheduler/kubelet.
Alternative 3: Kubelet Auto-Discovery / gRPC probe with timeout
Instead of using an API field, the kubelet could automatically probe for a local driver. If no driver is registered after a short timeout, it assumes preparation is not needed and starts the pod.
- Reason for Rejection: This is extremely risky. The kubelet cannot distinguish between “no driver is supposed to be here” and “the driver is crashed, slow to start, or overloaded”. Using a timeout would result in flaky pod startups, silent failures, and potential security/consistency issues where containers launch before their local devices are fully prepared. Explicit declaration via the API is highly deterministic and secure.
Alternative 4: Centralized catch-all no-op plugin
Deploy a generic, “no-op” DRA driver (such as dra-driver-noop ) configured centrally to register under specific DRA driver names and handle the node preparation calls by immediately returning success without doing any actual work.
- Reason for Rejection: While this allows running without modifying the DRA
API, it has several drawbacks:
- Operational Overhead: It requires deploying and managing an additional daemon/driver on nodes just to satisfy the Kubelet’s handshake, increasing operational complexity.
- Mixed-mode coordination: It is difficult to coordinate in environments with “mixed-mode” resources, where some devices of a particular driver name require actual node-local preparation (and thus need a real driver) while others do not. A static “catch-all” driver cannot easily co-exist or coordinate with a real driver registering under the same driver name on the same node to selectively handle or bypass preparation.
- Lack of Explicit Intent: It hides the logical nature of the resource
behind a dummy driver, making debugging and cluster observation more
difficult compared to an explicit
SkipNodeOperationsfield in theResourceSlice.