KEP-5055: DRA: device taints and tolerations
KEP-5055 : DRA: device taints and tolerations
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
With Dynamic Resource Allocation (DRA), DRA drivers publish information about the devices that they manage in ResourceSlices. This information is used by the scheduler when selecting devices for user requests in ResourceClaims.
With this KEP, DRA drivers can mark devices as tainted such that they won’t be used for scheduling new pods. In addition, pods already running with access to a tainted device can be stopped automatically. Cluster administrators can do the same by creating a DeviceTaintRule which applies a taint to all devices matching certain selection criteria, like all devices of a certain driver.
Users can decide to ignore specific taints by adding tolerations to their ResourceClaim.
Motivation
Goals
Enable taking devices offline for maintenance while still allowing test pods to request and use those devices. Being able to do this one device at a time minimizes service level disruption.
Enable users to decide whether they want to keep running a workload in a degraded mode while a device is unhealthy or prefer to get pods rescheduled.
Publish information about devices (“device health”) such that control plane components or admins can decide how to react, without immediately affecting scheduling or workloads.
Non-Goals
- Not part of the plan for this KEP: developing a kubectl command for managing device taints.
Proposal
User Stories
Degraded Devices
A driver itself can detect problems which may or may not be tolerable for workloads, like degraded performance due to overheating. Removing such devices from the ResourceSlice would unconditionally prevent using them for new pods. Instead, taints can be added to ResourceSlices with documented types for decision making on the indicated problems and no immediate effect on scheduling or running pods.
A control plane component or the admins react to that information. They may publish a DeviceTaintRule which prevents using the degraded device for new pods or even evict all pods using it at the moment to replace or reset the device once it is idle.
Once there is such an effect, users can decide to tolerate less critical taints in their workload, at their own risk. Admins scheduling maintenance pods need to tolerate their own taints to get the pod scheduled.
External Health Monitoring
As cluster admin, I am deploying a vendor-provided DRA driver together with a separate monitoring component for hardware aspects that are not available or not supported by that DRA driver. When that component detects problems, it can check its policy configuration and decide to take devices offline by creating a DeviceTaintRule with a taint for affected devices.
Safe Pod Eviction
Selecting the wrong set of devices in a DeviceTaintRule can have potentially
disastrous consequences, including quickly evicting all workloads using any
device in the cluster instead of those using a single device. To avoid this, a
cluster admin can first create a DeviceTaintRule such that it has no immediate
effect. The eviction controller adds a condition with information about what
it would do. kubectl describe then shows that information. At this point,
scheduling is not affected and pods keep running.
Then the admin can edit the DeviceTaintRule to set the desired effect to “evict all affected pods”. The DeviceTaintRule status provides information about the progress.
It can happen that a pod still gets scheduled after activating the taint because the scheduler hasn’t observed that change yet. Such a pod then gets evicted. Either way, eventually no affected pods are running and no new ones get scheduled either.
Risks and Mitigations
A device can be identified by its names (<driver name>/<pool name>/<device name>). It was a conscious
decision for core DRA to not require that the name is tied to one particular
hardware instance to support hot-swapping. Admins are expected to prefer the names.
Health monitoring might prefer to be specific and use a vendor-defined
unique ID attribute. However, supporting this consistently in a DeviceTaintRule
turned out to be hard because the attributes of an allocated device are not
guaranteed to be available (ResourceSlice might have been deleted) and therefore
selecting devices by attributes in a DeviceTaintRule is not possible. Vendors
which want to be very specific about which device incarnation is unhealthy
have to use taints in ResourceSlices.
Without a kubectl extension similar to kubectl taint nodes, the user
experience for admins will be a bit challenging. They need to decide how to
identify the device (by name or with a CEL expression), manually create a
DeviceTaintRule with a unique name, then remember to remove that
DeviceTaintRule again.
Users might be tempted to tolerate taints to get their pods running. They do that at their own risk. Depending on the taint, the application then may not get the performance it needs (degraded hardware) or may fail at runtime (hardware gets turned off). Admission controllers or validating admission policies could be deployed to limit which tolerations may be used, but as taints are not defined by Kubernetes itself, none of that is part of Kubernetes itself.
A taint is not meant to be used as an access control mechanism. Users are allowed to ignore taints (at their own risk). Adding a taint in a live cluster is inherently racy because it first needs to be observed by e.g. the scheduler.
Design Details
The feature is following the approach and APIs taken for node taints and
applies them to devices. A new controller watches tainted devices and deletes
pods using them unless they tolerate the device taint, similar to the
taint-eviction-controller
. A pod which is running or has finalizers will not get removed
immediately. Instead, the DeletionTimestamp gets set. That’s okay for
the purpose of this KEP:
- The kubelet will stop any running containers and mark the pod as completed.
- The ResourceClaim controller will remove such a completed pod from the claim’s
ReservedForand deallocate the claim once it has no consumers.
The semantic of the value associated with a taint key is defined by whoever publishes taints with that key. DRA drivers should use the driver name as domain in the key to avoid conflicts. To support tolerating taints by value, values should not require parsing to extract information. It’s better to use different keys with simple values than one key with a complex value.
If some common patterns emerge, then Kubernetes could standardize the name, value and data for certain taints. Those then have a certain semantic. This is similar to standardizing device attributes. The recommended pattern for names is:
[<prefix].]<some domain owned by the vendor of a DRA driver>/<some-vendor-specific-taint>: the domain part avoids name clashes. This can be the same as a DRA driver name, but isn’t required to. Descriptive names for the different taints defined by a vendor are useful.kubernetes.io/...: reserved for use by Kubernetes.
Effect: None can be used to publish taints which are merely informational.
Such taints are ignored during scheduling and cause no eviction. Therefore
it is not necessary to specify tolerations for them. Nonetheless a toleration
for Effect: None is allowed. Such tolerations then have no effect either.
Taints are cumulative:
- Taints defined by an admin in a DeviceTaintRule get added to the set of taints defined by the DRA driver in a ResourceSlice.
- All taints that are not tolerated apply their effect.
It is valid to have the same taint key with different effects, both within a ResourceSlice and when one is in a ResourceSlice and the other in a DeviceTaintRule:
- Key: “A”, Effect: NoSchedule
- Key: “A”, Effect: NoExecute
What this means is that “A” implies both NoSchedule and NoExecute. The first one could be dropped, but it’s merely redundant, not in conflict with the second one. It would be possible to prevent redundant entries during validation of a ResourceSlice like it is done for node taints , but not when they are in different objects (ResourceSlice and DeviceTaintRule), so instead the consumers need to handle this for device taints at runtime by checking that all of them are tolerated without assuming uniqueness by key and effect (meaning NoExecute toleration does not imply NoSchedule toleration).
To ensure consistency among all pods sharing a ResourceClaim, the toleration for taints gets added to the request in a ResourceClaim, not the pod. This also avoids conflicts like one pod tolerating a taint for scheduling and some other pod not tolerating that.
Device and node taints are applied independently. A node taint applies to all pods on a node, whereas a device taint affects claim allocation and only those pods using the claim.
API
The Device struct inside a ResourceSlice gets extended with taint information. To prevent exceeding the size limit for objects, the number of devices is half of what it normally would be without taints.
If changes of taints are limited to the devices stored in a single ResourceSlice, then there is no need to bump the generation of the pool: a consumer will see and use either the old set of ResourceSlices or the new, updated set. This makes updating taints efficient because it avoids having to update the other ResourceSlices with a different generation or count. DRA drivers which support taints are therefore encouraged to group devices together which might have to be tainted together.
type Device struct {
...
// If specified, these are the driver-defined taints.
//
// The maximum number of taints is 16. If taints are set for
// any device in a ResourceSlice, then the maximum number of
// allowed devices per ResourceSlice is 64 instead of 128.
//
// This is an alpha field and requires enabling the DRADeviceTaints
// feature gate.
//
// +optional
// +listType=atomic
// +featureGate=DRADeviceTaints
Taints []DeviceTaint
}
// DeviceTaintsMaxLength is the maximum number of taints per Device.
const DeviceTaintsMaxLength = 16
// The device this taint is attached to has the "effect" on
// any claim which does not tolerate the taint and, through the claim,
// to pods using the claim.
type DeviceTaint struct {
// The taint key to be applied to a device.
// Must be a label name.
//
// +required
Key string
// The taint value corresponding to the taint key.
// Must be a label value.
//
// +optional
Value string
// The effect of the taint on claims that do not tolerate the taint
// and through such claims on the pods using them.
//
// Valid effects are None, NoSchedule and NoExecute. PreferNoSchedule as used for
// nodes is not valid here. More effects may get added in the future.
// Consumers must treat unknown effects like None.
//
// +required
Effect DeviceTaintEffect
// ^^^^
//
// Implementing PreferNoSchedule would depend on a scoring solution for DRA.
// It might get added as part of that.
//
// A possible future new effect is NoExecuteWithPodDisruptionBudget:
// honor the pod disruption budget instead of simply deleting pods.
// This is currently undecided, it could also be a separate field.
//
// Validation must be prepared to allow unknown enums in stored objects,
// which will enable adding new enums within a single release without
// ratcheting.
// TimeAdded represents the time at which the taint was added or
// (only in a DeviceTaintRule) the effect was modified.
// Added automatically during create or update if not set.
// In addition, in a DeviceTaintRule a value provided during
// an update gets replaced with the current time if the provided
// value is the same as the old one and the new effect is different.
//
// +optional
TimeAdded *metav1.Time
// ^^^
//
// This field was defined as "It is only written for NoExecute taints." for node taints.
// But in practice, Kubernetes never did anything with it (no validation, no defaulting,
// ignored during pod eviction in pkg/controller/tainteviction).
}
// +enum
type DeviceTaintEffect string
const (
// No effect, the taint is purely informational.
DeviceTaintEffectNone DeviceTaintEffect = "None"
// Do not allow new pods to schedule which use a tainted device unless they tolerate the taint,
// but allow all pods submitted to Kubelet without going through the scheduler
// to start, and allow all already-running pods to continue running.
DeviceTaintEffectNoSchedule DeviceTaintEffect = "NoSchedule"
// Evict any already-running pods that do not tolerate the device taint.
DeviceTaintEffectNoExecute DeviceTaintEffect = "NoExecute"
)
DeviceTaint has all the fields of a v1.Taint, but the description is a bit different. In particular, PreferNoSchedule is not valid and None gets added.
Tolerations get added to a DeviceRequest:
type DeviceRequest struct {
...
// If specified, the request's tolerations.
//
// Tolerations for NoSchedule are required to allocate a
// device which has a taint with that effect. The same applies
// to NoExecute.
//
// In addition, should any of the allocated devices get tainted
// with NoExecute after allocation and that effect is not tolerated,
// then all pods consuming the ResourceClaim get deleted to evict
// them. The scheduler will not let new pods reserve the claim while
// it has these tainted devices. Once all pods are evicted, the
// claim will get deallocated.
//
// The maximum number of tolerations is 16.
//
// This is an alpha field and requires enabling the DRADeviceTaints
// feature gate.
//
// +optional
// +listType=atomic
// +featureGate=DRADeviceTaints
Tolerations []DeviceToleration
}
// DeviceTolerationsMaxLength is the maximum number of tolerations in a DeviceRequest.
const DeviceTolerationsMaxLength = 16
// The ResourceClaim this DeviceToleration is attached to tolerates any taint that matches
// the triple <key,value,effect> using the matching operator <operator>.
type DeviceToleration struct {
// Key is the taint key that the toleration applies to. Empty means match all taint keys.
// If the key is empty, operator must be Exists; this combination means to match all values and all keys.
// Must be a label name.
//
// +optional
Key string
// Operator represents a key's relationship to the value.
// Valid operators are Exists and Equal. Defaults to Equal.
// Exists is equivalent to wildcard for value, so that a ResourceClaim can
// tolerate all taints of a particular category.
//
// +optional
// +default="Equal"
Operator DeviceTolerationOperator
// Value is the taint value the toleration matches to.
// If the operator is Exists, the value must be empty, otherwise just a regular string.
// Must be a label value.
//
// +optional
Value string
// Effect indicates the taint effect to match. Empty means match all taint effects.
// When specified, allowed values are NoSchedule and NoExecute.
//
// +optional
Effect DeviceTaintEffect
// TolerationSeconds represents the period of time the toleration (which must be
// of effect NoExecute, otherwise this field is ignored) tolerates the taint. By default,
// it is not set, which means tolerate the taint forever (do not evict). Zero and
// negative values will be treated as 0 (evict immediately) by the system.
// If larger than zero, the time when the pod needs to be evicted is calculated as <time when
// taint was adedd> + <toleration seconds>.
//
// +optional
TolerationSeconds *int64
}
// A toleration operator is the set of operators that can be used in a toleration.
//
// +enum
type DeviceTolerationOperator string
const (
DeviceTolerationOpExists DeviceTolerationOperator = "Exists"
DeviceTolerationOpEqual DeviceTolerationOperator = "Equal"
)
As with Taint, these structs get duplicated to document DRA specific behavior and to ensure that future extensions do not get inherited accidentally.
The implementation of the taint logic gets copied from node taints: resourceclaim.ToleratesTaint corresponds to v1.Toleration.ToleratesTaint .
To enable setting of taints without reconfiguring DRA drivers, the cluster-scoped DeviceTaintRule gets added. The scheduler must merge these additional taints with the ones provided by the DRA drivers on-the-fly while it gathers information about available devices. This could be done each time a scheduling cycle starts, but it would be repetitive work. Instead, a ResourceSlice tracker reacts to informer events for ResourceSlice and DeviceTaintRule and maintains a set of updated slices which also contain the taints set via a DeviceTaintRule. The tracker provides the API of an informer and thus can be used as a replacement for a ResourceSlice informer.
This approach is sufficient for the scheduler (only needs to know about which devices are unusable), but not for the controller:
- It also needs to know the DeviceTaintRule so that it can update the status.
- It needs to handle allocated devices where a DeviceTaintRule causes eviction when there is no ResourceSlice where the taint could be added.
Therefore since 1.35 the controller keeps track of DeviceTaintRules itself.
Supporting DeviceTaintRules depends on quite a bit of additional code (tracker in the kube-scheduler, an entire controller in kube-controller-manager). To support disabling that code in case of problems without having to disable the entire support for device taints, support for DeviceTaintRules is guarded by a separate DRADeviceTaintRule feature gate.
// DeviceTaintRule adds one taint to all devices which match the selector.
// This has the same effect as if the taint was specified directly
// in the ResourceSlice by the DRA driver.
type DeviceTaintRule struct {
metav1.TypeMeta
// Standard object metadata
// +optional
metav1.ObjectMeta
// Spec specifies the selector and one taint.
//
// Changing the spec automatically increments the metadata.generation number.
Spec DeviceTaintRuleSpec
// Status provides information about the effect of what was requested in the spec.
//
// +optional
Status DeviceTaintRuleStatus
}
// DeviceTaintRuleSpec specifies the selector and one taint.
type DeviceTaintRuleSpec struct {
// DeviceSelector defines which device(s) the taint is applied to.
// All selector criteria must be satisfied for a device to
// match. The empty selector matches all devices. Without
// a selector, no devices are matched.
//
// +optional
DeviceSelector *DeviceTaintSelector
// The taint that gets applied to matching devices.
//
// +required
Taint DeviceTaint
}
// DeviceTaintSelector defines which device(s) a DeviceTaintRule applies to.
// The empty selector matches all devices. Without a selector, no devices
// are matched.
type DeviceTaintSelector struct {
// If driver is set, only devices from that driver are selected.
// This fields corresponds to slice.spec.driver.
//
// +optional
Driver *string
// If pool is set, only devices in that pool are selected.
//
// Also setting the driver name may be useful to avoid
// ambiguity when different drivers use the same pool name,
// but this is not required because selecting pools from
// different drivers may also be useful, for example when
// drivers with node-local devices use the node name as
// their pool name.
//
// +optional
Pool *string
// If device is set, only devices with that name are selected.
// This field corresponds to slice.spec.devices[].name.
//
// Setting also driver and pool may be required to avoid ambiguity,
// but is not required.
//
// +optional
Device *string
}
// DeviceTaintRuleStatus provides information about the effect of the DeviceTaintRule.
type DeviceTaintRuleStatus struct {
// Conditions provide information about the current state of the DeviceTaintRule
// in a machine-readable and human-readable format.
//
// The following condition is currently defined as part of this API, more may
// get added:
// - Type: EvictionInProgress
// - Status: True if there are currently pods which need to be evicted, False otherwise
// (includes the effects which don't cause eviction).
// - Reason: not specified, may change
// - Message: includes information about number of pending pods and already evicted pods
// in a human-readable format, updated periodically, may change
//
// For `effect: None`, the condition above gets set once for each change to
// the spec, with the message containing information about what would happen
// if the effect was `NoExecute`. This feedback can be used to decide whether
// changing the effect to `NoExecute` will work as intended. It only gets
// set once to avoid having to constantly update the status.
//
// Must have 8 or less entries.
//
// +optional
// +listType=map
// +listMapKey=type
Conditions []metav1.Condition
}
const DeviceTaintRuleStatusMaxConditions = 8
The condition is meant to be used to check for completion of a triggered eviction with:
kubectl wait --for=condition=EvictionInProgress=false DeviceTaintRule/my-taint-rule
Users relying on this need to be aware of the potential race between scheduler
and controller observing the new taint at different times, which can lead to
pods still being scheduled at a time when the controller thinks that there are
none which need to be evicted and thus sets this condition to False. This
race can be made less likely to go wrong by setting this condition only after a
certain time in the range of 5 to 10 seconds has elapsed since the
DeviceTaintRule was created. That should be long enough to propagate the new
DeviceTaintRule to a running scheduler. After a restart, the kube-scheduler
first waits for cache sync before scheduling pods.
Reasons are not specified as part of the API because it’s not required by the
use cases and would restrict future changes unnecessarily. For example, reason: Idle
can go together with evictionInProgress: False when momentarily there are no
pods which need to be evicted.
Useful information for a effect: None condition message includes:
- Total number of affected devices.
- Total number of pods that would be evicted.
- Total number of namespaces of those pods. This indicates whether the eviction is limited to a single workload or potentially spanning many different ones.
Test Plan
[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
None.
Unit tests
v1.32.0:
k8s.io/dynamic-resource-allocation/structured: 91.3%k8s.io/kubernetes/pkg/apis/resource/validation: 98.6%k8s.io/kubernetes/pkg/controller/tainteviction: 81.8%
v1.33.0:
k8s.io/dynamic-resource-allocation/resourceslice/tracker: 65.8%k8s.io/dynamic-resource-allocation/structured: 91.3%k8s.io/kubernetes/pkg/apis/resource/validation: 97.8%k8s.io/kubernetes/pkg/controller/devicetainteviction: 89.9%
v1.35.0-rc.0:
k8s.io/dynamic-resource-allocation/resourceslice/tracker: 62.1%k8s.io/dynamic-resource-allocation/structured/internal/experimental: 93.7%k8s.io/kubernetes/pkg/apis/resource/validation: 96.8%k8s.io/kubernetes/pkg/controller/devicetainteviction: 87.9%
Test cases that are worth calling out:
- Updating DeviceTaintRule status at a reasonable rate as eviction progresses.
- Combinations of tolerations and taints.
Integration tests
An integration test for the new eviction controller is useful to ensure that permissions are correct. It also covers evicting a higher number of pods than what would be possible in an E2E test.
- source code: https://github.com/kubernetes/kubernetes/blob/v1.35.0-rc.0/test/integration/dra/device_taints_test.go
- job: https://testgrid.k8s.io/sig-release-master-blocking#integration-master&include-filter-by-regex=dra.dra
- triage: https://storage.googleapis.com/k8s-triage/index.html?text=EvictCluster&job=integration&test=dra
e2e tests
Useful E2E tests are checking that the scheduler really honors taints during scheduling. Adding a taint in a ResourceSlice must evict a running pod. Same for adding a taint through a DeviceTaintRule. A toleration for a NoExecute taint must allow a pod to run.
- source code: https://github.com/kubernetes/kubernetes/blob/496077da56dceb7a68c4715a01670e2a5fa582e8/test/e2e/dra/dra.go#L1968-L2084
- job: https://testgrid.k8s.io/sig-node-dynamic-resource-allocation#ci-kind-dra-all&include-filter-by-regex=DRADeviceTaints
- triage: https://storage.googleapis.com/k8s-triage/index.html?test=DRADeviceTaints
Graduation Criteria
Alpha
- Feature implemented behind a feature flag
- Initial e2e tests completed and enabled
Beta
- Gather feedback from developers and surveys
- Additional tests are in Testgrid and linked in KEP
GA
- 3 examples of real-world usage
- Allowing time for feedback
- Conformance tests
Upgrade / Downgrade Strategy
Tainting gets disabled when downgrading to a release without support for it or when disabling the feature. The effect is as if the taints weren’t set.
The validation for taints in ResourceSlices changed in 1.35. Old DRA drivers which now try to publish too many devices per ResourceSlice will be notified about the validation error . The recommended reaction is to log and fail, which indicates to admins that they need to update or reconfigure the DRA driver. The other direction is fine (newer drivers work on an older Kubernetes).
effect: None is only valid for Kubernetes >= 1.35. DeviceTaintRules
and ResourceSlices with that value must be removed before a downgrade.
Version Skew Strategy
During version skew where the apiserver supports the feature and the scheduler doesn’t, taints can be set without encountering errors or warnings, but they won’t have any effect.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
It is possible to disable the feature through the feature gate while leaving the API group enabled. This enables cleanup through the API.
Re-enabling is supported because DeviceTaintRules remain in etcd even if they are inaccessible and existing taints and tolerations are preserved during updates.
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name: DRADeviceTaints
- Components depending on the feature gate:
- kube-apiserver
- kube-scheduler
- kube-controller-manager
- Feature gate name: DRADeviceTaintRules
- Components depending on the feature gate:
- kube-apiserver
- kube-scheduler
- kube-controller-manager
- Other
- Describe the mechanism: resource.k8s.io/v1alpha3 API group
- Will enabling / disabling the feature require downtime of the control plane? Yes, in the apiserver.
- Will enabling / disabling the feature require downtime or reprovisioning of a node? No.
Does enabling the feature change any default behavior?
No.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
Yes. The behavior of scheduling changes when it was in use. Running applications are not affected.
What happens if we reenable the feature if it was previously rolled back?
It takes effect again for scheduling and may evict pods.
Are there any tests for feature enablement/disablement?
This will be covered through unit tests for the apiserver and scheduler.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
If components restart, then work like pod scheduling and eviction continues normally once they are back.
During a partial rollout (feature enabled on at least one API server before it is also enabled in kube-scheduler and kube-controller-manager) a user or DRA driver might taint devices before the rest of the control plane catches up. For NoExecute, the pods will get evicted eventually. For NoSchedule, pods keep running.
What specific metrics should inform a rollback?
The normal health metrics of the control plane components need to be monitored to determine if performance deteriorates.
If the scheduler_pending_pods metric in the kube-scheduler suddenly
increases, then perhaps scheduling no longer works as intended.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
Automated upgrade/downgrade testing verifies that:
- A DeviceTaintRule created before a downgrade prevents pod scheduling after a downgrade.
- A pod which gets scheduled because of a toleration is kept running after an upgrade.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
Usage of DeviceTaintRules can be seen in the apiserver’s
apiserver_resource_objects metric with labels group=resource.k8s.io and
resource=devicetaintrules.
Usage of taints in ResourceSlices and tolerations in ResourceClaims can only be observed by inspecting objects.
How can someone using this feature know that it is working for their instance?
- API DeviceTaintRule.Status
- Condition name: EvictionInProgress
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
As for normal pod scheduling of pods using ResourceClaims there is no SLO for scheduling with taints.
Pod eviction is a best-effort deletion of pods. The goal is that it deletes all affected pods eventually, with no performance guarantees.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric names:
device_taint_eviction_controller_pod_deletions_total,device_taint_eviction_controller_pod_deletion_duration_seconds,workqueue_*with name=“device-taint-eviction-controller” - Components exposing the metric: kube-controller-manager
- Metric names:
Are there any missing metrics that would be useful to have to improve observability of this feature?
No.
Dependencies
Does this feature depend on any specific services running in the cluster?
No.
Scalability
Applying taints to devices scales with number of DeviceTaintRules * number of ResourceSlices but then may still need to compare device names and of course
modify selected devices.
Handling eviction scales with the number of claims and pods using those claims.
Will enabling / using this feature result in any new API calls?
A fixed, small number of clients (primarily the scheduler and controller manager) need to start watching DeviceTaintRules.
Pods are already watched in the controller manager. Evicting them adds one call per pod.
Will enabling / using this feature result in introducing new API types?
DeviceTaintRules must be created explicitly by admins or controller operated by admins. Kubernetes itself does not create them.
The number of DeviceTaintRules is expected to be orders of magnitude smaller than the number of ResourceSlices.
Will enabling / using this feature result in any new calls to the cloud provider?
No.
Will enabling / using this feature result in increasing size or count of the existing API objects?
Enabling it doesn’t. Using tolerations increases the size of ResourceClaims and ResourceClaimTemplates.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
Pod scheduling may become a bit slower because of the additional checks, but only when pods use claims. There are no SLI/SLOs for pods using claims.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
For scheduling, tracking taints should be comparable to the overhead for patching attributes.
For eviction, additional data structures will be needed to track taints and tolerations. This should not be too large.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No, because the feature is not used on nodes.
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
Work is halted while the API server or etcd are unavailable and resumes when they come back.
What are other known failure modes?
None known at this point.
What steps should be taken if SLOs are not being met to determine the problem?
Implementation History
- 1.33: first KEP revision and implementation
- 1.35: revised alpha with
effect: Noneand DeviceTaintRule status - 1.36: graduation to beta (tentative)
Drawbacks
Distributing information across different objects of different types makes it harder for users to get a complete view.
Alternatives
Extending node taint controller
The existing taint-eviction-controller could be extended to cover device taints. However, cloning it lowers the risk of breaking existing stable functionality.
Tolerating taints in pods
Tolerations for device taints could also be added to individual pods. This seems less useful because if pods share the same claim, they are typically part of one larger application with identical tolerations. Experimenting with a new API in the beta ResourceClaim type is a bit easier than it would be in the GA Pod type.
Storing result of patching in ResourceSlice
A controller could read DeviceTaintRules and apply them to ResourceSlices. Then consumers like the scheduler and users would only need to look at ResourceSlices. This has several drawbacks.
We would need to duplicate the taints in the slice status. If we didn’t and directly modified the spec, this controller and the CSI driver as the owner of the slice spec would fight against each other. Also, after removing a patch the original state must be available somewhere, otherwise the controller cannot restore it.
Duplicating the taints might make a slice too large. The limits were chosen so that we have some space left for a status, but not enough for a status that is potentially as large as the spec.
Creating a single DeviceTaintRule could force the controller to update a potentially large number of ResourceSlices. When using rate limiting, updating them all will take longer than client-side patching. When not using rate limiting, this could overwhelm the apiserver.