KEP-5491: DRA: List Types for Attributes

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests within one minor version of promotion to GA
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

The Device Resource Assignment (DRA) API currently allows scalar attribute values to describe device characteristics. However, many real-world device topologies require representing sets of relationships (e.g., multiple PCIe roots, NUMA nodes). This KEP introduces support for list-typed attributes in ResourceSlice and extends(redefine) ResourceClaim’s constraints[].{matchAttribute, distinctAttribute} semantics to fit both list-type attributes and primitive attributes supported previously.

Motivation

The ResourceSlice API allows users to attach scalar attributes to devices. These can be used to allocate devices that share common topology within the node. For certain types of topological relationships, scalar values are insufficient. For example, a CPU may have adjacency to multiple PCIe roots. This enhancement proposes allowing attributes to be lists. The semantics of the MatchAttribute and DistinctAttribute constraints must adapt to the possibility of lists. For example, rather than defining an attribute “match” as equality, it would be defined as a non-empty intersection, treating scalars as single-element lists. Conversely, “distinct” attributes for lists would be defined as an empty intersection.

Goals

Support typed-list in device attribute values.
Extends(redefine) the semantics of ResourceClaim’s constraints[].{matchAttribute,distinctAttribute} fields as below so that it can work with list-type attribute values
- matchAttribute: it is defined as non-empty intersection
- distinctAttribute: it is defined as pairwise disjoint
- note: scalar values are treated as single-element lists
Keep monotonicity in constraint.
- Currently Allocator’s algorithm assumes monotonic constraints only. Monotonic means that once a constraint returns false, adding more devices will never cause it to return true. This allows to bound the computational complexity for searching device combinations which satisfies the specified constraints. This KEP focuses to keep monotonicity of matchAttribute/distinctAttribute semantics.
Maintain backward compatibility and inter-operability for scalar-only attributes.
- matchAttribute/distinctAttribute: existing constraint can work because scalar values are treated as single-value list
- CEL expressions in device selectors: when the attribute type is updated, existing CEL won’t failed to compile. But, we will provide some type-agnostic helper function to achieve easier migration for users/DRA driver developers.

Non-Goals

Introducing generic or complex boolean logic in constraints(KEP-5254: DRA: Constraints with CEL ).
Forcing all drivers to use list attributes immediately.

Proposal

The proposal has mainly two parts:

Add list-types in DeviceAttribute so that DRA drivers can expose the attribute values in typed list(int, string, boolean, version)
Extends the semantics of MatchAttribute/DistinctAttribute field in DeviceConstraint
- For MatchAttribute:
  - Previously: it matches when the attribute values among candidate devices are identical (i.e. ∀i,j, v_i = v_j)
  - This KEP: it matches when the intersection (as a set) of all the list values among candidate devices is non-empty(i.e. (∩ v_k != ∅))
- For DistinctAttribute
  - Previously: it matches when all the attribute values among candidate devices are distinct (i.e. ∀i,j, s.t. i != j, v_i != v_j)
  - This KEP: it matches when all the list values among candidate devices are pairwise disjoint (i.e. ∀i,j, s.t. i != j, v_j ∩ v_k = ∅)

API Changes

Introduce typed-`list` in `DeviceAttribute`

kind: ResourceSlice
spec:
  devices:
  - name: typed-list-attributes
    attributes:
      list-of-string:
        strings: ["pci0000:00", "pci000:01"]
      list-of-int:
        ints: [0, 1, 2]
      list-of-bool:
        bools: [true, false, true]
      list-of-version:
        versions: ["1.0.0", "1.0.1"]

Introduce `.include` function in CEL

When the attribute type was changed from scalar to list. Existing CEL won’t compile due to type mismatch.

// This CEL won't compile if attributes["foo"] type is changed from 1 (scalar) to [1](https://raw.githubusercontent.com/kubernetes/enhancements/master/keps/sig-scheduling/5491-dra-list-types-for-attributes/list)
attributes["foo"] == 1

To maintain backward compatibility for existing CEL expressions, it might be possible to override comparison operators (==, etc.) that allows for a list type where attributes["foo"] == 1 is equivalent to attributes["foo"] == [1]. But we don’t do this way because it wouldn’t be idiomatic and would diverge from normal CEL type system expectations and feels confusing to anyone that already has an understanding of how the CEL type system is suppose to work.

Instead, although user needs to rewrite the existing CEL expressions, it plans to provide a helper function, say .include, which can work in type-agnostic way to make the CEL migration easier:

// assume attribute["foo"] is 1
attribute["foo"].include(1) --> true

// assume attribute["foo"] is [1]
attribute["foo"].include(1) --> true

User Stories (Optional)

Story 1: Hardware Topological Aligned CPUs & GPUs & NICs

Assume several DRA drivers exposed device attribute resource.kubernetes.io/pcieRoot:

apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
  name: cpu
spec:
  driver: "cpu.example.com"
  pool:
    name: "cpu"
    resourceSliceCount: 1
  nodeName: node-1
  devices:
  - name: "cpu-0"
    attributes:
      resource.kubernetes.io/pcieRoot:
        strings:
        - pci0000:01
        - pci0000:02
  - name: "cpu-1"
    attributes:
      resource.kubernetes.io/pcieRoot:
        strings:
        - pci0000:03
        - pci0000:04
---
apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
  name: gpu
spec:
  driver: "gpu.example.com"
  pool:
    name: "gpu"
    resourceSliceCount: 1
  nodeName: node-1
  devices:
  - name: "gpu-0"
    attributes:
      # Assume this driver is a bit old that keeps exposing string for the attribute
      resource.kubernetes.io/pcieRoot:
        string: pci0000:01
---
apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
  name: nic
spec:
  driver: "nic.example.com"
  pool:
    name: "nic"
    resourceSliceCount: 1
  nodeName: node-1
  devices:
  - name: "nic-0"
    attributes:
      # Assume this driver is a bit old that keeps exposing string for the attribute
      resource.kubernetes.io/pcieRoot:
        string: pci0000:01

Then, user can create ResourceClaim resource which requests PCIe topology aligned CPU & GPU & NIC triple like below:

apiVersion: resource.k8s.io/v1
kind: ResourceClaim
spec:
  requests:
  - name: "gpu"
    exactly:
      deviceClassName: gpu.example.com
      count: 1
  - name: "nic"
    exactly:
      deviceClassName: nic.example.com
      count: 1
  - name: "cpu"
    exactly:
      deviceClassName: cpu.example.com
      count: 2
  constraints:
    # "gpu-0", "nic-0" and "cpu-0" above can match
    # because
    # - "pci0000:01" is common.
    # - string attribute can be treated as a single value list
  - requests: ["gpu", "nic", "cpu"]
    matchAttribute: k8s.io/pcieRoot

Story 2

T.B.D.

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Risk 1: Driver adoption lag
- Mitigation: scalar is treated as single value list
Risk 2: Scheduler performance overhead
- bound lengths of the list-typed attribute values

Design Details

Go Type Definitions

`DeviceAttribute`

Note: The total number of individual attribute values per device (scalar fields plus all list elements combined) is limited to 48 (referring ResourceSliceMaxAttributeValuesPerDevice). When any device in a ResourceSlice uses this feature or other advanced features such as taints, the ResourceSlice will be limited to at most 64 devices (referring ResourceSliceMaxDevicesWithAdvancedFeatures).

type DeviceAttribute struct {
    ...

	// IntValues is a non-empty list of numbers.
	//
	// This is an alpha field and requires enabling the DRAListTypeAttributes feature gate.
	//
	// +optional
	// +listType=atomic
	// +k8s:listType=atomic
	// +k8s:alpha(since: "1.36")=+k8s:optional
	// +k8s:alpha(since: "1.36")=+k8s:unionMember
	// +featureGate=DRAListTypeAttributes
	IntValues []int64 `json:"ints,omitempty" protobuf:"varint,6,opt,name=ints"`

	// BoolValues is a non-empty list of true/false values.
	//
	// +optional
	// +listType=atomic
	// +k8s:listType=atomic
	// +k8s:alpha(since: "1.36")=+k8s:optional
	// +k8s:alpha(since: "1.36")=+k8s:unionMember
	// +featureGate=DRAListTypeAttributes
	BoolValues []bool `json:"bools,omitempty" protobuf:"varint,7,opt,name=bools"`

	// StringValues is a non-empty list of strings.
	// Each string must not be longer than 64 characters.
	//
	// This is an alpha field and requires enabling the DRAListTypeAttributes feature gate.
	//
	// +optional
	// +listType=atomic
	// +k8s:listType=atomic
	// +k8s:alpha(since: "1.36")=+k8s:optional
	// +k8s:alpha(since: "1.36")=+k8s:unionMember
	// +k8s:alpha(since: "1.37")=+k8s:eachVal=+k8s:maxBytes=64
	// +featureGate=DRAListTypeAttributes
	StringValues []string `json:"strings,omitempty" protobuf:"bytes,8,opt,name=strings"`

	// VersionValues is a non-empty list of semantic versions according to semver.org spec 2.0.0.
	// Each version string must not be longer than 64 characters.
	//
	// This is an alpha field and requires enabling the DRAListTypeAttributes feature gate.
	//
	// +optional
	// +listType=atomic
	// +k8s:listType=atomic
	// +k8s:alpha(since: "1.36")=+k8s:optional
	// +k8s:alpha(since: "1.36")=+k8s:unionMember
	// +featureGate=DRAListTypeAttributes
	VersionValues []string `json:"versions,omitempty" protobuf:"bytes,9,opt,name=versions"`
}

Implementation (for evaluating constraints)

Since non-empty intersection constraint is monotonic, we would not need updating Allocator.Allocate() algorithm and can keep using constraint interface . We will just extend the current matchAttributeConstraint and distinctAttributeConstraint instances. Or, we could introduce constraint instances for proposed modes (e.g., nonEmptyIntersectionMatchAttributeConstraint, etc.).

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

<package>: <date> - <test coverage>

Integration tests

test name : integration master , triage search

e2e tests

test name : SIG … , triage search

Graduation Criteria

Alpha

Feature implemented behind a feature flag (DRAListTypeAttributes). The Feature gate is disabled by default.
Documentation provided
Initial unit, integration and e2e tests completed and enabled.
All the issues (https://github.com/kubernetes/kubernetes/issues/137905 ) which was identified in the initial implementation should be resolved.

Beta

Feature Gates are enabled by default.
No major outstanding bugs.
1 example of real-world use case.
Feedback collected from the community (developers and users) with adjustments provided, implemented and tested.

GA

2 examples of real-world use cases.
Allowing time for feedback from developers and users.

Upgrade / Downgrade Strategy

Version Skew Strategy

For upgrade, existing ResourceClaim/ResourceSlice will still work as expected, as the new fields are missing there.

For downgrade, when there exists ResourceClaim with matchSemantics/distinctSemantics field or ResourceSlice with list type attribute values, there need to be caution. Although the already allocated claim does not affect, but when re-allocating, matchSemantics/distinctSemantics will be ignored. And, specified attribute in matchAttribute/distinctAttribute is list type, then allocation will be failed.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: DRAListTypeAttributes
- Components depending on the feature gate: kube-apiserver, kube-scheduler
Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?

Does enabling the feature change any default behavior?

Basically, no. Just introducing new API fields in ResourceClaim and ResourceSlice which does NOT change the default behavior when any device attribute type was NOT changed.

However, please note that ResourceClaim’s matchAttribute/distinctAttribute semantics are CHANGED when some device attribute type are changed from scalar to list.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. When disabled, you can not create DeviceAttribute with list-type values. And, existing list-type attribute values are just ignored. But, if specified attribute in matchAttribute/distinctAttribute is list type, allocation will be failed.

What happens if we reenable the feature if it was previously rolled back?

list-type attribute values in DeviceAttribute and matchSemantics/distinctAttribute in ResourceClaim will be available again.

Are there any tests for feature enablement/disablement?

Yes, it will be covered by Unit tests .

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

Events
- Event Reason:
API .status
- Condition name:
- Other field:
Other (treat as last resort)
- Details:

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
Other (treat as last resort)
- Details:

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Yes and no. It does add new fields, which increase the worst case size of ResourceSlice and ResourceClaim object. However, the increase size is bounded for most cases:

ResourceClaim: linear to the number of constraints specified in the resource.
ResourceSlice: linear to the number of devices defined in the resource. And, the number of list items is also bounded.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Not expected. All the proposed constraints in this KEP are monotonic constraint. Thus, worst case of computational complexity for device search is the same.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

Just support formatted string list instead of introducing `list` type

We could add pseudo list type support only for string type attribute (e.g. comma separated string).

Pros:
- Simple, no change in DeviceAttribute
Cons:
- String list only (Can’t support list of int/version).
- prone to mis-formatted string
- extra parsing computation

Introduce `matchSemantics/distinctSemantics` field for flexible/declarative match

Introduce matchSemantics/distinctSemantics fields into constraints field like this:

`matchSemantics` field

kind: ResourceClaim
spec:
  constraints:
  - requests: [ "device1", "device2", "device3" ]
    matchAttribute: "resource.kubernetes.io/pcieRoot"

    # [NEW]
    # An optional field that defines customized "match" semantics over attribute values.
    # This field must not set when "distinctAttribute" is set
    matchSemantics:
      # mode specifies the "match" semantics
      # Identical (∀i,j, v_i = v_j):
      #   All the attribute values among candidate devices are identical,
      #   supporting both list-order-sensitive and set-equivalence comparisons via `listMode`.
      # NonEmptyIntersection (|∩ v_i| >= k (>=1)): 
      #   The intersection (as a set) of list values among candidate devices is non-empty.
      #   The required intersection size could be configurable via `minSize`.
      # For future possible cases:
      # - CommonPrefix/Suffix with customizable length
      # - Identical for aggregated values of the list items (min/max/sum/length)
      mode: Identical | NonEmptyIntersection

      options:
        nonEmptyIntersection:
          # if true, implicit cast from scalar to list will be performed. The default is false.
          coerceScalarToList: true | false
          # minSize specifies the minimum size of the intersection to evaluate as true.
          # Default is 1. The value must be positive integer.
          minSize: 1
       identical:
          coerceScalarToList: true | false    # common option
          # listMode specified the equality as a set(order/duplicates are ignored) or list (order significant). Default is List
          listMode: List | Set

Examples of match semantics mode:

attribute values	`Identical`	`NonEmptyIntersection` (`coerceScalarToList=true`)
`d1="a"`, `d2="b"`	`false`	`false`
`d1=["a", "b"]` , `d2=["b", "a"]`	`false`(`listMode: List`) `true`(`listMode: Set`)	`true` (`d1 ∩ d2 = {"a", "b"}`)
`d1=["a", "b"]` , `d2=["a", "c"]`	`false`	`true` (`d1 ∩ d2 = {"a"}`)
`d1=["a", "b"]` , `d1=["c", "d"]`	`false`	`false` (`d1 ∩ d2 = ∅`)

`distinctSemantics

kind: ResourceClaim
spec:
  constraints:
  - requests: [ "device1", "device2", "device3" ]
    distinctAttribute: "resource.kubernetes.io/numaNode" # note: this is imaginary attribute.

    # [NEW]
    # an optional field that defines customized "distinct" semantics over attribute values
    # this field must not set when "matchAttribute" is set
    distinctSemantics:
      # mode specifies the "distinct" semantics
      # `AllDistinct`:
      #   All the values are distinct, supporting both list-order-sensitive and set-equivalence comparisons via `listMode`.
      #   (i.e. ∀i,j s.t. i ≠ j, v_i != v_j), 
      # `EmptyIntersection`:
      #   The intersection (as a set) of all the list values among candidate devices is empty. (i.e. ∩ v_k = ∅ )
      # `PairwiseDisjoint`:
      #   Every pair of the list values (as a set) of candidate devices is disjoint (i.e. completely no overlap).
      #   (i.e. ∀i,j s.t. i ≠ j, v_i ∩ v_j = ∅),
      # For future possible cases:
      # - NoCommonPrefix/Suffix, PairwiseDisjointPrefix/Suffix with customizable length
      # - AllDistinct for aggregated values of the list items (min/max/sum/length)
      mode: AllDistinct | EmptyIntersection | PairwiseDisjoint

      options:
        allDistinct:
          coerceScalarToList: true | false    # common option
          # listMode specified the equality as a set(order/duplicates are ignored) or list (order significant). Default is List
          listMode: List | Set
        emptyIntersection:
          coerceScalarToList: true | false    # common option
        pairwiseDisjoint:
          coerceScalarToList: true | false    # common option

Examples of distinct semantics mode:

attribute values	`AllDistinct`	`PairwiseDistinct` (`coerceScalarToList=true`)	`EmptyIntersection` (`coerceScalarToList=true`)
`d1="a"`, `d2="b"`	`false`	`false`	`false`
`d1=["a", "b"]` , `d2=["b", "a"]`	`true`(`listMode: List`) `false`(`listMode: Set`)	`false` (`d1 ∩ d2={"a","b"}`)	`false` (`∩dk={"a","b"}`)
`d1=["a", "b"]` , `d2=["a", "c"]`, `d3=["a", "d"]`	`true`	`false` (`di ∩ dj = {"a"} ≠ ∅`)	`false` (`∩ dk = {"a"} ≠ ∅`)
`d1=["a", "b"]` , `d2=["b", "c"]`, `d3=["c", "a"]`	`true`	`false` (`di ∩ dj ≠ ∅`)	`true` (`∩ dk = ∅`)
`d1=["a", "b"]` , `d2=["c", "d"]`, `d3=["e", "f"]`	`true`	`true` (`di ∩ dj = ∅`)	`true` (`∩ dk = ∅`)

Pros/Cons

Pros:
- Flexible
- Declarative
- Extensible
Cons:
- Too much complex even we don’t have use-cases to introduce the complexity

Unified `semantics` field instead of `matchSemantics`/`distinctSemantics`

We can consider unified semantics field for both matchAttribute/distinctAttribute like below:

semantics:
  mode: NonEmptyIntersection | EmptyIntersection | Identical | AllDistinct | PairwiseDisjoint

Pros:
- Simple
Cons:
- Confusing which mode is valid for matchAttribute or distinctAttribute
- Extra validation logics

KEP-5491: DRA: List Types for Attributes

KEP-5491: DRA: List Types for Attributes

Release Signoff Checklist

Summary

Motivation

Goals

Non-Goals

Proposal

API Changes

Introduce typed-list in DeviceAttribute

Introduce .include function in CEL

User Stories (Optional)

Story 1: Hardware Topological Aligned CPUs & GPUs & NICs

Story 2

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Design Details

Go Type Definitions

DeviceAttribute

Implementation (for evaluating constraints)

Test Plan

Prerequisite testing updates

Unit tests

Integration tests

e2e tests

Graduation Criteria

Alpha

Beta

GA

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

Just support formatted string list instead of introducing list type

Introduce matchSemantics/distinctSemantics field for flexible/declarative match

matchSemantics field

`distinctSemantics

Pros/Cons

Unified semantics field instead of matchSemantics/distinctSemantics

Infrastructure Needed (Optional)

Introduce typed-`list` in `DeviceAttribute`

Introduce `.include` function in CEL

`DeviceAttribute`

Just support formatted string list instead of introducing `list` type

Introduce `matchSemantics/distinctSemantics` field for flexible/declarative match

`matchSemantics` field

Unified `semantics` field instead of `matchSemantics`/`distinctSemantics`