KEP-5055: DRA: device taints and tolerations

KEP-5055 : DRA: device taints and tolerations

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
- User Stories
- Risks and Mitigations
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

With Dynamic Resource Allocation (DRA), DRA drivers publish information about the devices that they manage in ResourceSlices. This information is used by the scheduler when selecting devices for user requests in ResourceClaims.

With this KEP, DRA drivers can mark devices as tainted such that they won’t be used for scheduling new pods. In addition, pods already running with access to a tainted device can be stopped automatically. Cluster administrators can do the same by creating a DeviceTaintRule which applies a taint to all devices matching certain selection criteria, like all devices of a certain driver.

Users can decide to ignore specific taints by adding tolerations to their ResourceClaim.

Motivation

Goals

Enable taking devices offline for maintenance while still allowing test pods to request and use those devices. Being able to do this one device at a time minimizes service level disruption.
Enable users to decide whether they want to keep running a workload in a degraded mode while a device is unhealthy or prefer to get pods rescheduled.
Publish information about devices (“device health”) such that control plane components or admins can decide how to react, without immediately affecting scheduling or workloads.

Non-Goals

Not part of the plan for this KEP: developing a kubectl command for managing device taints.

Proposal

User Stories

Degraded Devices

A driver itself can detect problems which may or may not be tolerable for workloads, like degraded performance due to overheating. Removing such devices from the ResourceSlice would unconditionally prevent using them for new pods. Instead, taints can be added to ResourceSlices with documented types for decision making on the indicated problems and no immediate effect on scheduling or running pods.

A control plane component or the admins react to that information. They may publish a DeviceTaintRule which prevents using the degraded device for new pods or even evict all pods using it at the moment to replace or reset the device once it is idle.

Once there is such an effect, users can decide to tolerate less critical taints in their workload, at their own risk. Admins scheduling maintenance pods need to tolerate their own taints to get the pod scheduled.

External Health Monitoring

As cluster admin, I am deploying a vendor-provided DRA driver together with a separate monitoring component for hardware aspects that are not available or not supported by that DRA driver. When that component detects problems, it can check its policy configuration and decide to take devices offline by creating a DeviceTaintRule with a taint for affected devices.

Safe Pod Eviction

Selecting the wrong set of devices in a DeviceTaintRule can have potentially disastrous consequences, including quickly evicting all workloads using any device in the cluster instead of those using a single device. To avoid this, a cluster admin can first create a DeviceTaintRule such that it has no immediate effect. The eviction controller adds a condition with information about what it would do. kubectl describe then shows that information. At this point, scheduling is not affected and pods keep running.

Then the admin can edit the DeviceTaintRule to set the desired effect to “evict all affected pods”. The DeviceTaintRule status provides information about the progress.

It can happen that a pod still gets scheduled after activating the taint because the scheduler hasn’t observed that change yet. Such a pod then gets evicted. Either way, eventually no affected pods are running and no new ones get scheduled either.

Risks and Mitigations

A device can be identified by its names (<driver name>/<pool name>/<device name>). It was a conscious decision for core DRA to not require that the name is tied to one particular hardware instance to support hot-swapping. Admins are expected to prefer the names. Health monitoring might prefer to be specific and use a vendor-defined unique ID attribute. However, supporting this consistently in a DeviceTaintRule turned out to be hard because the attributes of an allocated device are not guaranteed to be available (ResourceSlice might have been deleted) and therefore selecting devices by attributes in a DeviceTaintRule is not possible. Vendors which want to be very specific about which device incarnation is unhealthy have to use taints in ResourceSlices.

Without a kubectl extension similar to kubectl taint nodes, the user experience for admins will be a bit challenging. They need to decide how to identify the device (by name or with a CEL expression), manually create a DeviceTaintRule with a unique name, then remember to remove that DeviceTaintRule again.

Users might be tempted to tolerate taints to get their pods running. They do that at their own risk. Depending on the taint, the application then may not get the performance it needs (degraded hardware) or may fail at runtime (hardware gets turned off). Admission controllers or validating admission policies could be deployed to limit which tolerations may be used, but as taints are not defined by Kubernetes itself, none of that is part of Kubernetes itself.

A taint is not meant to be used as an access control mechanism. Users are allowed to ignore taints (at their own risk). Adding a taint in a live cluster is inherently racy because it first needs to be observed by e.g. the scheduler.

Design Details

The feature is following the approach and APIs taken for node taints and applies them to devices. A new controller watches tainted devices and deletes pods using them unless they tolerate the device taint, similar to the taint-eviction-controller . A pod which is running or has finalizers will not get removed immediately. Instead, the DeletionTimestamp gets set. That’s okay for the purpose of this KEP:

The kubelet will stop any running containers and mark the pod as completed.
The ResourceClaim controller will remove such a completed pod from the claim’s ReservedFor and deallocate the claim once it has no consumers.

The semantic of the value associated with a taint key is defined by whoever publishes taints with that key. DRA drivers should use the driver name as domain in the key to avoid conflicts. To support tolerating taints by value, values should not require parsing to extract information. It’s better to use different keys with simple values than one key with a complex value.

If some common patterns emerge, then Kubernetes could standardize the name, value and data for certain taints. Those then have a certain semantic. This is similar to standardizing device attributes. The recommended pattern for names is:

[<prefix].]<some domain owned by the vendor of a DRA driver>/<some-vendor-specific-taint>: the domain part avoids name clashes. This can be the same as a DRA driver name, but isn’t required to. Descriptive names for the different taints defined by a vendor are useful.
kubernetes.io/...: reserved for use by Kubernetes.

Effect: None can be used to publish taints which are merely informational. Such taints are ignored during scheduling and cause no eviction. Therefore it is not necessary to specify tolerations for them. Nonetheless a toleration for Effect: None is allowed. Such tolerations then have no effect either.

Taints are cumulative:

Taints defined by an admin in a DeviceTaintRule get added to the set of taints defined by the DRA driver in a ResourceSlice.
All taints that are not tolerated apply their effect.

It is valid to have the same taint key with different effects, both within a ResourceSlice and when one is in a ResourceSlice and the other in a DeviceTaintRule:

Key: “A”, Effect: NoSchedule
Key: “A”, Effect: NoExecute

What this means is that “A” implies both NoSchedule and NoExecute. The first one could be dropped, but it’s merely redundant, not in conflict with the second one. It would be possible to prevent redundant entries during validation of a ResourceSlice like it is done for node taints , but not when they are in different objects (ResourceSlice and DeviceTaintRule), so instead the consumers need to handle this for device taints at runtime by checking that all of them are tolerated without assuming uniqueness by key and effect (meaning NoExecute toleration does not imply NoSchedule toleration).

To ensure consistency among all pods sharing a ResourceClaim, the toleration for taints gets added to the request in a ResourceClaim, not the pod. This also avoids conflicts like one pod tolerating a taint for scheduling and some other pod not tolerating that.

Device and node taints are applied independently. A node taint applies to all pods on a node, whereas a device taint affects claim allocation and only those pods using the claim.

API

The Device struct inside a ResourceSlice gets extended with taint information. To prevent exceeding the size limit for objects, the number of devices is half of what it normally would be without taints.

If changes of taints are limited to the devices stored in a single ResourceSlice, then there is no need to bump the generation of the pool: a consumer will see and use either the old set of ResourceSlices or the new, updated set. This makes updating taints efficient because it avoids having to update the other ResourceSlices with a different generation or count. DRA drivers which support taints are therefore encouraged to group devices together which might have to be tainted together.

type Device struct {
    ...

    // If specified, these are the driver-defined taints.
    //
    // The maximum number of taints is 16. If taints are set for
    // any device in a ResourceSlice, then the maximum number of
    // allowed devices per ResourceSlice is 64 instead of 128.
    //
    // This is an alpha field and requires enabling the DRADeviceTaints
    // feature gate.
    //
    // +optional
    // +listType=atomic
    // +featureGate=DRADeviceTaints
    Taints []DeviceTaint
}

// DeviceTaintsMaxLength is the maximum number of taints per Device.
const DeviceTaintsMaxLength = 16

// The device this taint is attached to has the "effect" on
// any claim which does not tolerate the taint and, through the claim,
// to pods using the claim.
type DeviceTaint struct {
    // The taint key to be applied to a device.
    // Must be a label name.
    //
    // +required
    Key string

    // The taint value corresponding to the taint key.
    // Must be a label value.
    //
    // +optional
    Value string

    // The effect of the taint on claims that do not tolerate the taint
    // and through such claims on the pods using them.
    //
    // Valid effects are None, NoSchedule and NoExecute. PreferNoSchedule as used for
    // nodes is not valid here. More effects may get added in the future.
    // Consumers must treat unknown effects like None.
    //
    // +required
    Effect DeviceTaintEffect

    // ^^^^
    //
    // Implementing PreferNoSchedule would depend on a scoring solution for DRA.
    // It might get added as part of that.
    //
    // A possible future new effect is NoExecuteWithPodDisruptionBudget:
    // honor the pod disruption budget instead of simply deleting pods.
    // This is currently undecided, it could also be a separate field.
    //
    // Validation must be prepared to allow unknown enums in stored objects,
    // which will enable adding new enums within a single release without
    // ratcheting.

    // TimeAdded represents the time at which the taint was added or
    // (only in a DeviceTaintRule) the effect was modified.
    // Added automatically during create or update if not set.
    // In addition, in a DeviceTaintRule a value provided during
    // an update gets replaced with the current time if the provided
    // value is the same as the old one and the new effect is different.
    //
    // +optional
    TimeAdded *metav1.Time

    // ^^^
    //
    // This field was defined as "It is only written for NoExecute taints." for node taints.
    // But in practice, Kubernetes never did anything with it (no validation, no defaulting,
    // ignored during pod eviction in pkg/controller/tainteviction).
}

// +enum
type DeviceTaintEffect string

const (
    // No effect, the taint is purely informational.
    DeviceTaintEffectNone DeviceTaintEffect = "None"

    // Do not allow new pods to schedule which use a tainted device unless they tolerate the taint,
    // but allow all pods submitted to Kubelet without going through the scheduler
    // to start, and allow all already-running pods to continue running.
    DeviceTaintEffectNoSchedule DeviceTaintEffect = "NoSchedule"

    // Evict any already-running pods that do not tolerate the device taint.
    DeviceTaintEffectNoExecute DeviceTaintEffect = "NoExecute"
)

DeviceTaint has all the fields of a v1.Taint, but the description is a bit different. In particular, PreferNoSchedule is not valid and None gets added.

Tolerations get added to a DeviceRequest:

type DeviceRequest struct {
    ...

    // If specified, the request's tolerations.
    //
    // Tolerations for NoSchedule are required to allocate a
    // device which has a taint with that effect. The same applies
    // to NoExecute.
    //
    // In addition, should any of the allocated devices get tainted
    // with NoExecute after allocation and that effect is not tolerated,
    // then all pods consuming the ResourceClaim get deleted to evict
    // them. The scheduler will not let new pods reserve the claim while
    // it has these tainted devices. Once all pods are evicted, the
    // claim will get deallocated.
    //
    // The maximum number of tolerations is 16.
    //
    // This is an alpha field and requires enabling the DRADeviceTaints
    // feature gate.
    //
    // +optional
    // +listType=atomic
    // +featureGate=DRADeviceTaints
    Tolerations []DeviceToleration
}

// DeviceTolerationsMaxLength is the maximum number of tolerations in a DeviceRequest.
const DeviceTolerationsMaxLength = 16

// The ResourceClaim this DeviceToleration is attached to tolerates any taint that matches
// the triple <key,value,effect> using the matching operator <operator>.
type DeviceToleration struct {
    // Key is the taint key that the toleration applies to. Empty means match all taint keys.
    // If the key is empty, operator must be Exists; this combination means to match all values and all keys.
    // Must be a label name.
    //
    // +optional
    Key string

    // Operator represents a key's relationship to the value.
    // Valid operators are Exists and Equal. Defaults to Equal.
    // Exists is equivalent to wildcard for value, so that a ResourceClaim can
    // tolerate all taints of a particular category.
    //
    // +optional
    // +default="Equal"
    Operator DeviceTolerationOperator

    // Value is the taint value the toleration matches to.
    // If the operator is Exists, the value must be empty, otherwise just a regular string.
    // Must be a label value.
    //
    // +optional
    Value string

    // Effect indicates the taint effect to match. Empty means match all taint effects.
    // When specified, allowed values are NoSchedule and NoExecute.
    //
    // +optional
    Effect DeviceTaintEffect

    // TolerationSeconds represents the period of time the toleration (which must be
    // of effect NoExecute, otherwise this field is ignored) tolerates the taint. By default,
    // it is not set, which means tolerate the taint forever (do not evict). Zero and
    // negative values will be treated as 0 (evict immediately) by the system.
    // If larger than zero, the time when the pod needs to be evicted is calculated as <time when
    // taint was adedd> + <toleration seconds>.
    //
    // +optional
    TolerationSeconds *int64
}

// A toleration operator is the set of operators that can be used in a toleration.
//
// +enum
type DeviceTolerationOperator string

const (
    DeviceTolerationOpExists DeviceTolerationOperator = "Exists"
    DeviceTolerationOpEqual  DeviceTolerationOperator = "Equal"
)

As with Taint, these structs get duplicated to document DRA specific behavior and to ensure that future extensions do not get inherited accidentally.

The implementation of the taint logic gets copied from node taints: resourceclaim.ToleratesTaint corresponds to v1.Toleration.ToleratesTaint .

To enable setting of taints without reconfiguring DRA drivers, the cluster-scoped DeviceTaintRule gets added. The scheduler must merge these additional taints with the ones provided by the DRA drivers on-the-fly while it gathers information about available devices. This could be done each time a scheduling cycle starts, but it would be repetitive work. Instead, a ResourceSlice tracker reacts to informer events for ResourceSlice and DeviceTaintRule and maintains a set of updated slices which also contain the taints set via a DeviceTaintRule. The tracker provides the API of an informer and thus can be used as a replacement for a ResourceSlice informer.

This approach is sufficient for the scheduler (only needs to know about which devices are unusable), but not for the controller:

It also needs to know the DeviceTaintRule so that it can update the status.
It needs to handle allocated devices where a DeviceTaintRule causes eviction when there is no ResourceSlice where the taint could be added.

Therefore since 1.35 the controller keeps track of DeviceTaintRules itself.

Supporting DeviceTaintRules depends on quite a bit of additional code (tracker in the kube-scheduler, an entire controller in kube-controller-manager). To support disabling that code in case of problems without having to disable the entire support for device taints, support for DeviceTaintRules is guarded by a separate DRADeviceTaintRule feature gate.

// DeviceTaintRule adds one taint to all devices which match the selector.
// This has the same effect as if the taint was specified directly
// in the ResourceSlice by the DRA driver.
type DeviceTaintRule struct {
    metav1.TypeMeta
    // Standard object metadata
    // +optional
    metav1.ObjectMeta

    // Spec specifies the selector and one taint.
    //
    // Changing the spec automatically increments the metadata.generation number.
    Spec DeviceTaintRuleSpec

    // Status provides information about the effect of what was requested in the spec.
    //
    // +optional
    Status DeviceTaintRuleStatus
}

// DeviceTaintRuleSpec specifies the selector and one taint.
type DeviceTaintRuleSpec struct {
    // DeviceSelector defines which device(s) the taint is applied to.
    // All selector criteria must be satisfied for a device to
    // match. The empty selector matches all devices. Without
    // a selector, no devices are matched.
    //
    // +optional
    DeviceSelector *DeviceTaintSelector

    // The taint that gets applied to matching devices.
    //
    // +required
    Taint DeviceTaint
}

// DeviceTaintSelector defines which device(s) a DeviceTaintRule applies to.
// The empty selector matches all devices. Without a selector, no devices
// are matched.
type DeviceTaintSelector struct {
    // If driver is set, only devices from that driver are selected.
    // This fields corresponds to slice.spec.driver.
    //
    // +optional
    Driver *string

    // If pool is set, only devices in that pool are selected.
    //
    // Also setting the driver name may be useful to avoid
    // ambiguity when different drivers use the same pool name,
    // but this is not required because selecting pools from
    // different drivers may also be useful, for example when
    // drivers with node-local devices use the node name as
    // their pool name.
    //
    // +optional
    Pool *string

    // If device is set, only devices with that name are selected.
    // This field corresponds to slice.spec.devices[].name.
    //
    // Setting also driver and pool may be required to avoid ambiguity,
    // but is not required.
    //
    // +optional
    Device *string
}

// DeviceTaintRuleStatus provides information about the effect of the DeviceTaintRule.
type DeviceTaintRuleStatus struct {
    // Conditions provide information about the current state of the DeviceTaintRule
    // in a machine-readable and human-readable format.
    //
    // The following condition is currently defined as part of this API, more may
    // get added:
    // - Type: EvictionInProgress
    // - Status: True if there are currently pods which need to be evicted, False otherwise
    //   (includes the effects which don't cause eviction).
    // - Reason: not specified, may change
    // - Message: includes information about number of pending pods and already evicted pods
    //   in a human-readable format, updated periodically, may change
    //
    // For `effect: None`, the condition above gets set once for each change to
    // the spec, with the message containing information about what would happen
    // if the effect was `NoExecute`. This feedback can be used to decide whether
    // changing the effect to `NoExecute` will work as intended. It only gets
    // set once to avoid having to constantly update the status.
    //
    // Must have 8 or less entries.
    //
    // +optional
    // +listType=map
    // +listMapKey=type
    Conditions []metav1.Condition
}

const DeviceTaintRuleStatusMaxConditions = 8

The condition is meant to be used to check for completion of a triggered eviction with:

kubectl wait --for=condition=EvictionInProgress=false DeviceTaintRule/my-taint-rule

Users relying on this need to be aware of the potential race between scheduler and controller observing the new taint at different times, which can lead to pods still being scheduled at a time when the controller thinks that there are none which need to be evicted and thus sets this condition to False. This race can be made less likely to go wrong by setting this condition only after a certain time in the range of 5 to 10 seconds has elapsed since the DeviceTaintRule was created. That should be long enough to propagate the new DeviceTaintRule to a running scheduler. After a restart, the kube-scheduler first waits for cache sync before scheduling pods.

Reasons are not specified as part of the API because it’s not required by the use cases and would restrict future changes unnecessarily. For example, reason: Idle can go together with evictionInProgress: False when momentarily there are no pods which need to be evicted.

Useful information for a effect: None condition message includes:

Total number of affected devices.
Total number of pods that would be evicted.
Total number of namespaces of those pods. This indicates whether the eviction is limited to a single workload or potentially spanning many different ones.

Test Plan

[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

None.

Unit tests

v1.32.0:

k8s.io/dynamic-resource-allocation/structured: 91.3%
k8s.io/kubernetes/pkg/apis/resource/validation: 98.6%
k8s.io/kubernetes/pkg/controller/tainteviction: 81.8%

v1.33.0:

k8s.io/dynamic-resource-allocation/resourceslice/tracker: 65.8%
k8s.io/dynamic-resource-allocation/structured: 91.3%
k8s.io/kubernetes/pkg/apis/resource/validation: 97.8%
k8s.io/kubernetes/pkg/controller/devicetainteviction: 89.9%

v1.35.0-rc.0:

k8s.io/dynamic-resource-allocation/resourceslice/tracker: 62.1%
k8s.io/dynamic-resource-allocation/structured/internal/experimental: 93.7%
k8s.io/kubernetes/pkg/apis/resource/validation: 96.8%
k8s.io/kubernetes/pkg/controller/devicetainteviction: 87.9%

Test cases that are worth calling out:

Updating DeviceTaintRule status at a reasonable rate as eviction progresses.
Combinations of tolerations and taints.

Integration tests

An integration test for the new eviction controller is useful to ensure that permissions are correct. It also covers evicting a higher number of pods than what would be possible in an E2E test.

e2e tests

Useful E2E tests are checking that the scheduler really honors taints during scheduling. Adding a taint in a ResourceSlice must evict a running pod. Same for adding a taint through a DeviceTaintRule. A toleration for a NoExecute taint must allow a pod to run.

Graduation Criteria

Alpha

Feature implemented behind a feature flag
Initial e2e tests completed and enabled

Beta

Gather feedback from developers and surveys
Additional tests are in Testgrid and linked in KEP

GA

3 examples of real-world usage
- DeviceTaints usage:
  - https://github.com/intel/intel-resource-drivers-for-kubernetes/blob/main/doc/gpu/USAGE.md#health-monitoring-support (health information discovered by driver)
  - https://github.com/kubernetes-sigs/dra-driver-nvidia-gpu/pull/983 (informational taints)
- DeviceTaintRule usage:
  - https://github.com/intel/intel-resource-drivers-for-kubernetes/blob/main/doc/gaudi/USAGE.md#health-monitoring-support (health information discovered by monitoring component)
  - https://github.com/CoHDI/composable-resource-operator (in preparation for detaching a device)
Allowing time for feedback
- In beta for two release cycles.
Conformance tests
- https://github.com/kubernetes/kubernetes/blob/f15071f37174baf476617f175898beab3c9b8605/test/e2e/dra/dra.go#L114-L118 , can only be promoted to conformance after graduation. Tracked in https://github.com/kubernetes/kubernetes/issues/138814 .

Upgrade / Downgrade Strategy

Tainting gets disabled when downgrading to a release without support for it or when disabling the feature. The effect is as if the taints weren’t set.

The validation for taints in ResourceSlices changed in 1.35. Old DRA drivers which now try to publish too many devices per ResourceSlice will be notified about the validation error . The recommended reaction is to log and fail, which indicates to admins that they need to update or reconfigure the DRA driver. The other direction is fine (newer drivers work on an older Kubernetes).

effect: None is only valid for Kubernetes >= 1.35. DeviceTaintRules and ResourceSlices with that value must be removed before a downgrade.

Version Skew Strategy

During version skew where the apiserver supports the feature and the scheduler doesn’t, taints can be set without encountering errors or warnings, but they won’t have any effect.

NoExecute starts to have the desired effect once kube-controller-manager gets updated to have the feature enabled.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

It is possible to disable the feature through the feature gate while leaving the API group enabled. This enables cleanup through the API.

Re-enabling is supported because DeviceTaintRules remain in etcd even if they are inaccessible and existing taints and tolerations are preserved during updates.

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: DRADeviceTaints
- Components depending on the feature gate:
  - kube-apiserver
  - kube-scheduler
  - kube-controller-manager
- Feature gate name: DRADeviceTaintRules
- Components depending on the feature gate:
  - kube-apiserver
  - kube-scheduler
  - kube-controller-manager
Other
- Describe the mechanism: resource.k8s.io/v1alpha3 API group
- Will enabling / disabling the feature require downtime of the control plane? Yes, in the apiserver.
- Will enabling / disabling the feature require downtime or reprovisioning of a node? No.

Does enabling the feature change any default behavior?

No.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. The behavior of scheduling changes when it was in use. Running applications are not affected.

What happens if we reenable the feature if it was previously rolled back?

It takes effect again for scheduling and may evict pods.

Are there any tests for feature enablement/disablement?

This was covered through unit tests for the apiserver and scheduler:

Integration testing also verifies dropping resp. storing of fields in ResourceSlice, which covers the ResourceSlice controller running in DRA drivers:

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

If components restart, then work like pod scheduling and eviction continues normally once they are back.

During a partial rollout (feature enabled on at least one API server before it is also enabled in kube-scheduler and kube-controller-manager) a user or DRA driver might taint devices before the rest of the control plane catches up. For NoExecute, the pods will get evicted eventually. For NoSchedule, pods keep running.

What specific metrics should inform a rollback?

The normal health metrics of the control plane components need to be monitored to determine if performance deteriorates.

If the scheduler_pending_pods metric in the kube-scheduler suddenly increases, then perhaps scheduling no longer works as intended.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Automated upgrade/downgrade testing verifies that:

A DeviceTaintRule created before a downgrade prevents pod scheduling after a downgrade.
A pod which gets scheduled because of a toleration is kept running after an upgrade.

See the implementation for details.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

Usage of DeviceTaintRules can be seen in the apiserver’s apiserver_resource_objects metric with labels group=resource.k8s.io and resource=devicetaintrules.

Usage of taints in ResourceSlices and tolerations in ResourceClaims can only be observed by inspecting objects.

How can someone using this feature know that it is working for their instance?

API DeviceTaintRule.Status
- Condition name: EvictionInProgress

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

As for normal pod scheduling of pods using ResourceClaims there is no SLO for scheduling with taints.

Pod eviction is a best-effort deletion of pods. The goal is that it deletes all affected pods eventually, with no performance guarantees.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric names: device_taint_eviction_controller_pod_deletions_total, device_taint_eviction_controller_pod_deletion_duration_seconds, workqueue_* with name=“device-taint-eviction-controller”
- Components exposing the metric: kube-controller-manager

Are there any missing metrics that would be useful to have to improve observability of this feature?

No.

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Applying taints to devices scales with number of DeviceTaintRules * number of ResourceSlices but then may still need to compare device names and of course modify selected devices.

Handling eviction scales with the number of claims and pods using those claims.

Will enabling / using this feature result in any new API calls?

A fixed, small number of clients (primarily the scheduler and controller manager) need to start watching DeviceTaintRules.

Pods are already watched in the controller manager. Evicting them adds one call per pod.

Will enabling / using this feature result in introducing new API types?

DeviceTaintRules must be created explicitly by admins or controller operated by admins. Kubernetes itself does not create them.

The number of DeviceTaintRules is expected to be orders of magnitude smaller than the number of ResourceSlices.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

Enabling it doesn’t. Using tolerations increases the size of ResourceClaims and ResourceClaimTemplates.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Pod scheduling may become a bit slower because of the additional checks, but only when pods use claims. There are no SLI/SLOs for pods using claims.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

For scheduling, tracking taints should be comparable to the overhead for patching attributes.

For eviction, additional data structures will be needed to track taints and tolerations. This should not be too large.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No, because the feature is not used on nodes.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

Work is halted while the API server or etcd are unavailable and resumes when they come back.

What are other known failure modes?

None known at this point.

What steps should be taken if SLOs are not being met to determine the problem?

There are no SLOs for this feature.

Implementation History

1.33: first KEP revision and implementation
1.35: revised alpha with effect: None and DeviceTaintRule status
1.36: graduation to beta
1.37: graduation to stable

Drawbacks

Distributing information across different objects of different types makes it harder for users to get a complete view.

Alternatives

Extending node taint controller

The existing taint-eviction-controller could be extended to cover device taints. However, cloning it lowers the risk of breaking existing stable functionality.

Tolerating taints in pods

Tolerations for device taints could also be added to individual pods. This seems less useful because if pods share the same claim, they are typically part of one larger application with identical tolerations. Experimenting with a new API in the beta ResourceClaim type is a bit easier than it would be in the GA Pod type.

Storing result of patching in ResourceSlice

A controller could read DeviceTaintRules and apply them to ResourceSlices. Then consumers like the scheduler and users would only need to look at ResourceSlices. This has several drawbacks.

We would need to duplicate the taints in the slice status. If we didn’t and directly modified the spec, this controller and the CSI driver as the owner of the slice spec would fight against each other. Also, after removing a patch the original state must be available somewhere, otherwise the controller cannot restore it.

Duplicating the taints might make a slice too large. The limits were chosen so that we have some space left for a status, but not enough for a status that is potentially as large as the spec.

Creating a single DeviceTaintRule could force the controller to update a potentially large number of ResourceSlices. When using rate limiting, updating them all will take longer than client-side patching. When not using rate limiting, this could overwhelm the apiserver.