KEP-3633: Introduce MatchLabelKeys and MismatchLabelKeys to PodAffinity and PodAntiAffinity

Implementation History
STABLE Implemented
Created 2022-11-09
Latest v1.33
Milestones
Alpha v1.29
Beta v1.31
Stable v1.33
Ownership
Owning SIG
SIG Scheduling
Primary Authors

KEP-3633: Introduce MatchLabelKeys and MismatchLabelKeys to PodAffinity and PodAntiAffinity

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
    • e2e Tests for all Beta API Operations (endpoints)
    • (R) Ensure GA e2e tests for meet requirements for Conformance Tests
    • (R) Minimum Two Week Window for GA e2e tests to prove flake free
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • “Implementation History” section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This KEP proposes introducing a complementary field MatchLabelKeys to PodAffinityTerm. This enables users to finely control the scope where Pods are expected to co-exist (PodAffinity) or not (PodAntiAffinity), on top of the existing LabelSelector.

Motivation

During a workload’s rolling upgrade, depending on its upgradeStrategy, old and new versions of Pods may co-exist in the cluster. As scheduler cannot distinguish “old” from “new”, it cannot properly honor the API semantics of PodAffinity and PodAntiAffiniy during the upgrade. In the worse case, new version of Pod cannot be scheduled if it’s a saturated cluster: 1 , 2 , 3 .

On the other hand, on an idle cluster, this can cause the scheduling result sub-optimal because some qualifying Nodes are filtered out incorrectly.

The same issue applies to other scheduling directives as well. For example, MatchLabelKeys was introduced in topologyConstaint in KEP-3243: Respect PodTopologySpread after rolling upgrades .

Goals

  • Introduce MatchLabelKeys and MismatchLabelKeys in PodAffinityTerm to let users define the scope where Pods are evaluated in required and preferred Pod(Anti)Affinity.

Non-Goals

  • Apply additional internal labels when evaluating MatchLabelKeys or MismatchLabelKeys

Proposal

User Stories (Optional)

Story 1

When users run a rolling update with a deployment that uses required PodAffinity, and they want only replicas from the same replicaset to be evaluated.

The deployment controller adds pod-template-hash to underlying ReplicaSet and thus every Pod created from Deployment carries the hash string.

Therefore, users can use pod-template-hash in matchLabelSelector.Key to inform the scheduler to only evaluate Pods with the same pod-template-hash value.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: application-server
...
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - database
        topologyKey: topology.kubernetes.io/zone
        matchLabelKeys: # ADDED
        - pod-template-hash

Story 2

Let’s say all Pods on each tenant get tenant label via a controller or a manifest management tool like Helm. Although the value of tenant label is unknown when composing the workload’s manifest, the cluster admin still wants to achieve exclusive 1:1 tenant to domain placement.

By applying the following affinity globally using a mutating webhook, the cluster admin can ensure that the Pods from the same tenant will land on the same domain exclusively, meaning Pods from other tenants won’t land on the same domain.

affinity:
  podAffinity:      # ensures the pods of this tenant land on the same node pool
    requiredDuringSchedulingIgnoredDuringExecution:
    - matchLabelKeys:
        - tenant
      topologyKey: node-pool
  podAntiAffinity:  # ensures only Pods from this tenant lands on the same node pool
    requiredDuringSchedulingIgnoredDuringExecution:
    - mismatchLabelKeys:
        - tenant
      labelSelector:
        matchExpressions:
        - key: tenant
          operator: Exists
      topologyKey: node-pool

Notes/Constraints/Caveats (Optional)

In most scenarios, users can use the label keyed with pod-template-hash added automatically by the Deployment controller to distinguish between different revisions in a single Deployment. But for more complex scenarios (e.g., Pod(Anti)Affinity associating two deployments at the same time), users are responsible for providing common labels to identify which pods should be grouped.

Risks and Mitigations

In addition to using pod-template-hash added by the Deployment controller, users can also provide the customized key in matchLabelKeys to identify which pods should be grouped. If so, the user needs to ensure that it is correct and not duplicated with other unrelated workloads.

Design Details

A new optional fields matchLabelKeys and mismatchLabelKeys are introduced to PodAffinityTerm.

type MatchLabelSelector struct {
  // Key is used to lookup value from the incoming pod labels, 
  // and that key-value label is merged with `LabelSelector`.
  // Key that doesn't exist in the incoming pod labels will be ignored. 
  Key  string
  // Operator defines how key-value, fetched via the above `Keys`, is merged into LabelSelector.
  // Only `In` and `NotIn` are expected.
  // If Operator is `In`, `key in (value)` is merged with LabelSelector. 
  // If Operator is `NotIn`, `key notin (value)` is merged with LabelSelector. 
  //
  // +optional
  Operator       LabelSelectorOperator
}

type PodAffinityTerm struct {
  LabelSelector *metav1.LabelSelector
  Namespaces []string
  TopologyKey string
  NamespaceSelector *metav1.LabelSelector

	// MatchLabelKeys is a set of pod label keys to select which pods will
	// be taken into consideration. The keys are used to lookup values from the
	// incoming pod labels, those key-value labels are merged with `LabelSelector` as `key in (value)`
	// to select the group of existing pods which pods will be taken into consideration 
	// for the incoming pod's pod (anti) affinity. Keys that don't exist in the incoming 
	// pod labels will be ignored. The default value is empty.
	// +optional
	MatchLabelKeys []string
  // MismatchLabelKeys is a set of pod label keys to select which pods will 
	// be taken into consideration. The keys are used to lookup values from the
	// incoming pod labels, those key-value labels are merged with `LabelSelector` as `key notin (value)`
	// to select the group of existing pods which pods will be taken into consideration 
	// for the incoming pod's pod (anti) affinity. Keys that don't exist in the incoming 
	// pod labels will be ignored. The default value is empty.
	// +optional
	MismatchLabelKeys []string
}

When a Pod is created, kube-apiserver will obtain the labels from the pod labels by the key in matchLabelKeys or mismatchLabelKeys, and merge to LabelSelector of PodAffinityTerm depending on field:

  • If matchLabelKeys, key in (value) is merged with LabelSelector.
  • If mismatchLabelKeys, key notin (value) is merged with LabelSelector.

For example, when this sample Pod is created,

apiVersion: v1
kind: Pod
metadata:
  name: sample
  namespace: sample-namespace
  labels:
    tenant: tenant-a
...
  affinity:
    podAntiAffinity:  
      requiredDuringSchedulingIgnoredDuringExecution:
      - mismatchLabelKeys:
          - tenant
        labelSelector:
          matchExpressions:
          - key: tenant
            operator: Exists
        topologyKey: node-pool

kube-apiserver modifies the labelSelector like the following:

affinity:
  podAntiAffinity:  
    requiredDuringSchedulingIgnoredDuringExecution:
      - mismatchLabelKeys:
          - tenant
        labelSelector:
          matchExpressions:
          - key: tenant
            operator: Exists
+         - key: tenant
+           operator: NotIn
+           values: 
+             - tenant-a
        topologyKey: node-pool

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates
Unit tests
  • k8s.io/kubernetes/pkg/scheduler/framework/plugins/interpodaffinity/filtering.go: 2022-11-09 14:43 JST (The commit hash: 20a9f7786aa4ee0b6e1619c7974ea4562d2b2500) - 91.7%
  • k8s.io/kubernetes/pkg/scheduler/framework/plugins/interpodaffinity/filtering.go: 2022-11-09 14:43 JST (The commit hash: 20a9f7786aa4ee0b6e1619c7974ea4562d2b2500) - 87.3%
Integration tests
  • These tests will be added.
    • matchLabelKeys/mismatchLabelKeys in PodAffinity (both in Filter and Score) works as expected.
    • matchLabelKeys/mismatchLabelKeys in PodAntiAffinity (both in Filter and Score) works as expected.
    • matchLabelKeys/mismatchLabelKeys with the feature gate enabled/disabled.

Filter

Score

e2e tests

N/A

This feature doesn’t introduce any new API endpoints and doesn’t interact with other components. So, E2E tests doesn’t add extra value to integration tests.

Graduation Criteria

Alpha

  • Feature implemented behind a feature flag
  • Unit tests and integration tests are implemented
  • No significant performance degradation is observed from the benchmark test

Beta

  • The feature gate is enabled by default.

GA

  • No negative feedback.
  • No bug issues reported.

Upgrade / Downgrade Strategy

Upgrade

The previous PodAffinity/PodAntiAffinity behavior will not be broken. Users can continue to use their Pod specs as it is.

To use this enhancement, users need to enable the feature gate (during this feature is in the alpha.), and add matchLabelKeys/mismatchLabelKeys on their PodAffinity/PodAntiAffinity.

Downgrade

kube-apiserver will reject Pod creation with matchLabelKeys/mismatchLabelKeys in PodAffinity/PodAntiAffinity. But, regarding existing Pods, we leave matchLabelKeys/mismatchLabelKeys and generated LabelSelector even after downgraded.

Version Skew Strategy

N/A

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?
  • Feature gate (also fill in values in kep.yaml)
    • Feature gate name: MatchLabelKeysInPodAffinity
    • Components depending on the feature gate: kube-apiserver
  • Other
Does enabling the feature change any default behavior?

No.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

The feature can be disabled in Alpha and Beta versions by restarting kube-apiserver the feature-gate off. In terms of Stable versions, users can choose to opt-out by not setting the matchLabelKeys/mismatchLabelKeys field.

What happens if we reenable the feature if it was previously rolled back?

Scheduling of newly created pods with MatchLabelSelector set is affected. All already existing pods are unafected.

Are there any tests for feature enablement/disablement?

No. But, the tests to confirm the behavior on switching the feature gate will be added, https://github.com/kubernetes/kubernetes/issues/123156 .

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

It shouldn’t impact already running workloads. It’s an opt-in feature, and users need to set matchLabelKeys or mismatchLabelKeys field in PodAffinity or PodAntiAffinity to use this feature.

When this feature is disabled by the feature flag, the already created Pod’s matchLabelKeys/mismatchLabelKeys is kept and the labelSelector is not modified back. But, the newly created Pod’s matchLabelKeys or mismatchLabelKeys field is ignored and silently dropped.

What specific metrics should inform a rollback?
  • A spike on metric schedule_attempts_total{result="error|unschedulable"} when pods using this feature are added.

The only possibility of the bug is in the Pod creation process in kube-apiserver and it results in some unintended scheduling.

Also, the scheduler’s latency may also get increased because of the additional calculation for the label selector made from matchLabelKeys/mismatchLabelKeys. But, it should be tiny increase because the scheduler doesn’t get changed at all for this feature, and using matchLabelKeys/mismatchLabelKeys just equals to adding some Pods with additional label selectors to the cluster.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

We’ll test it via this scenario:

  1. start kube-apiserver with this feature disabled.
  2. create one Pod with matchLabelKeys in PodAffinity.
  3. No change should be observed in labelSelector in PodAffinity because the feature is disabled.
  4. restart kube-apiserver with this feature enabled. (First enablement)
  5. No change in Pod created at (2).
  6. create one Pod with matchLabelKeys in PodAffinity.
  7. labelSelector in PodAffinity should be changed based on matchLabelKeys and label value in the Pod because the feature is enabled.
  8. restart kube-apiserver with this feature disabled. (First disablement)
  9. No change in Pods created before.
  10. create one Pod with matchLabelKeys in PodAffinity.
  11. No change should be observed in labelSelector in PodAffinity because the feature is disabled.
  12. restart kube-apiserver with this feature enabled. (Second enablement)
  13. No change in Pods created before.
  14. create one Pod with matchLabelKeys in PodAffinity.
  15. labelSelector in PodAffinity should be changed based on matchLabelKeys and label value in the Pod because the feature is enabled.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

The operator can query pods with matchLabelKeys or mismatchLabelKeys field set in PodAffinity or PodAntiAffinity.

How can someone using this feature know that it is working for their instance?
  • Other (treat as last resort)
    • Details: This feature doesn’t cause any logs, any events, any pod status updates. But, people can determine it’s being evaluated by looking at labelSelector in PodAffinity or PodAntiAffinity in which they set matchLabelKey or mismatchLabelKeys. If labelSelector is modified after Pods’ creation, this feature is working correctly.
What are the reasonable SLOs (Service Level Objectives) for the enhancement?

Metric plugin_execution_duration_seconds{plugin="InterPodAffinity"} <= 100ms on 90-percentile.

This feature shouldn’t change the latency of InterPodAffinity plugin at all.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
  • Metrics
    • Metric name:
      • Metric name: schedule_attempts_total{result="error|unschedulable"}
    • Components exposing the metric: kube-scheduler
Are there any missing metrics that would be useful to have to improve observability of this feature?

No.

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Yes. there is an additional work: kube-apiserver uses the keys in matchLabelKeys or mismatchLabelKeys to look up label values from the pod, and change labelSelector according to them. The impact in the latency of pod creation request in kube-apiserver should be negligible.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

If the API server and/or etcd is not available, this feature will not be available. This is because this feature depends on Pods creation.

What are other known failure modes?

N/A.

What steps should be taken if SLOs are not being met to determine the problem?
  • Check the metric schedule_attempts_total{result="error|unschedulable"} to determine if the number of attempts increased. If increased, You need to determine the cause of the failure by the event of the pod. If it’s caused by plugin InterPodAffinity, you can further analyze this problem by looking at labelSelector in PodAffinity/PodAntiAffinity with matchLabelKeys/mismatchLabelKeys. If labelSelector seems to be updated incorrectly, it’s the problem in this feature.

Implementation History

  • 2022-11-09: Initial KEP PR is submitted.
  • 2023-05-14 / 2023-06-08: PRs to change it from MatchLabelKeys to MatchLabelSelector are submitted. (to satisfy the user story 2)
  • 2024-01-28: The PR to update KEP for beta is submitted.
  • 2024-03-14: The PR to change the feature gate for beta is merged.
  • 2025-01-06: The PR to update KEP for GA is submitted.

Drawbacks

Alternatives

implement as a new enum in LabelSelector

Implement new enum values ExistsWithSameValue and ExistsWithDifferentValue in LabelSelector.

  • ExistsWithSameValue: look up the label value keyed with the key specified in the labelSelector, and match with Pods which have the same label value on the key.
  • ExistsWithDifferentValue: look up the label value keyed with the key specified in the labelSelector, and match with Pods which have the same label key, but with the different label value on the key.

But, this idea is rejected because:

  • it’s difficult to prepare all existing clients to handle new enums.
  • labelSelector is going to be required to know who has this labelSelector to handle these new enums, and it’s a tough road to change all code handling labelSelector.

Example

a set of Pods A doesn’t want to co-exist with other set of Pods, but want the set of Pods A co-located

spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: pod-set
            operator: ExistsWithSameValue
        topologyKey: kubernetes.io/hostname
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: pod-set
            operator: ExistsWithDifferentValue
        topologyKey: kubernetes.io/hostname

smooth rolling upgrade for PodAntiAffinity:

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - pause
        topologyKey: kubernetes.io/hostname
      - labelSelector:
          matchExpressions:
          - key: pod-template-hash
            operator: ExistsWithSameValue
        topologyKey: kubernetes.io/hostname

Infrastructure Needed (Optional)