KEP-3094: Take taints/tolerations into consideration when calculating PodTopologySpread skew

Implementation History
STABLE Implemented
Created 2021-12-30
Latest v1.33
Milestones
Alpha v1.25
Beta v1.26
Stable 1.33
Ownership
Owning SIG
SIG Scheduling
Primary Authors

KEP-3094: Take taints/tolerations into consideration when calculating PodTopologySpread skew

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
    • e2e Tests for all Beta API Operations (endpoints)
    • (R) Ensure GA e2e tests for meet requirements for Conformance Tests
    • (R) Minimum Two Week Window for GA e2e tests to prove flake free
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • “Implementation History” section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This KEP introduces an option for end-users to specify whether to take taints/tolerations into consideration when calculating pod topology spread skew.

Motivation

Currently when calculating pod topology spread skew, tainted nodes are treated the same as other regular nodes. This behavior may lead to unexpected Pending pods as the skew constraint can only be satisfied on the tainted nodes.(See issue ).

Besides, given that we have already some node inclusion policies(nodeAffinity/nodeSelector) plumbed into PodTopologySpread implicitly, we’d like to use this chance to use a new API to represent the semantics explicitly.

Goals

  • Introduce two new fields to define all node inclusion policies explicitly
  • Provide an option for end-users to specify whether to respect taints or not

Non-Goals

  • Support customized taints in array

Proposal

Introduce two new fields to TopologySpreadConstraint to define all node inclusion policies including nodeAffinity and nodeTaint.

User Stories (Optional)

Story 1

When calculating pod topology spread skew, I want to exclude nodes that don’t tolerate all taints to prevent pods from falling into unexpected Pending state.

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

  • Checking nodeAffinity and nodeTaints the same time may lead to performance problem, we need to verify this by adding performance tests. If performance problem does exists, we’d like to add a toleration parser and cache the parsed object during PreFilter.

Design Details

Two new fields named NodeAffinityPolicy and NodeTaintsPolicy will be introduced to TopologySpreadConstraint:

type TopologySpreadConstraint struct {
	// NodeAffinityPolicy indicates how we will treat Pod's nodeAffinity/nodeSelector
	// when calculating pod topology spread skew. Options are:
	// - Honor: only nodes matching nodeAffinity/nodeSelector are included in the calculations.
	// - Ignore: nodeAffinity/nodeSelector are ignored. All nodes are included in the calculations.
	//
	// If this value is nil, the behavior is equivalent to the Honor policy.
	// This is a alpha-level feature enabled by the NodeInclusionPolicyInPodTopologySpread feature flag.
	// +optional
	NodeAffinityPolicy *NodeInclusionPolicy
	// NodeTaintsPolicy indicates how we will treat node taints when calculating
	// pod topology spread skew. Options are:
	// - Honor: nodes without taints, along with tainted nodes for which the incoming pod
	// has a toleration, are included.
	// - Ignore: node taints are ignored. All nodes are included.
	//
	// If this value is nil, the behavior is equivalent to the Ignore policy.
	// This is a alpha-level feature enabled by the NodeInclusionPolicyInPodTopologySpread feature flag.
	// +optional
	NodeTaintsPolicy *NodeInclusionPolicy
}

We will define two NodeInclusionPolicy:

// NodeInclusionPolicy defines the type of node inclusion policy
type NodeInclusionPolicy string

const (
	// NodeInclusionPolicyIgnore means ignore this scheduling policy when calculating pod topology spread skew.
	NodeInclusionPolicyIgnore NodeInclusionPolicy = "Ignore"
	// NodeInclusionPolicyHonor means use this scheduling policy when calculating pod topology spread skew.
	NodeInclusionPolicyHonor NodeInclusionPolicy = "Honor"
)

We will check these policies in the extension points of PreFilter/PreScore in PodTopologySpread plugin, some refactoring works are also needed, but we will not change the default behavior.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

No.

Unit tests
  • pkg/api/pod: 2025-02-10 - 79.1%
  • pkg/apis/core/validation: 2025-02-10 - 84.4%
  • pkg/scheduler: 2025-02-10 - 80.7%
  • pkg/scheduler/framework/plugins/defaultpreemption: 2025-02-10 - 80.4%
  • pkg/scheduler/framework/plugins/podtopologyspread: 2025-02-10 - 87.5%
Integration tests
  • tests with policies honored/ignored in filtering
  • tests with policies honored/ignored in scoring
e2e tests

None. Considering we didn’t introduce any new API endpoints in this KEP and this feature only impacts the kube-scheduler, so rely on integration tests to verify the scheduling results is enough.

Graduation Criteria

Alpha

  • Feature implemented behind feature gate.
  • Unit and integration tests passed as designed in TestPlan .

Beta

  • Feature is enabled by default
  • Benchmark tests passed, and there is no performance problem.
  • Gather feedback from developers.

GA

  • No negative feedback.

Upgrade / Downgrade Strategy

  • Upgrade
    • While the feature gate is enabled, NodeAffinityPolicy and NodeTaintsPolicy are allowed to use by end-users.
    • While the feature gate is enabled, and we don’t set these two fields, default values will be configured, which will maintain previous behavior.
  • Downgrade
    • Previously configured values will be ignored.

Version Skew Strategy

Kube-scheduler generally has the same version as api-server. So no version skew strategy.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?
  • Feature gate (also fill in values in kep.yaml)
    • Feature gate name: NodeInclusionPolicyInPodTopologySpread
    • Components depending on the feature gate: kube-scheduler, kube-apiserver
Does enabling the feature change any default behavior?

No, it’s backwards compatible.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

The feature can be disabled in Alpha and Beta stage, but once GA, there’s no way to disable it. But you can leave it opt-out by unset the two fields, and it will fall back to the default behavior.

What happens if we reenable the feature if it was previously rolled back?

The policies are respected again.

Are there any tests for feature enablement/disablement?

We have tests here:

  • pkg/registry/core/pod/strategy_test.go#TestNodeInclusionPolicyEnablementInCreating
  • pkg/registry/core/pod/strategy_test.go#TestNodeInclusionPolicyEnablementInUpdating

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

It’s an opt-in feature for end-users and will maintain current behaviors if not set, so it will not impact the running workloads.

What specific metrics should inform a rollback?
  • A spike on metric schedule_attempts_total{result=“error|unschedulable”} when pods using this feature are added.
  • Metric plugin_execution_duration_seconds{plugin=“PodTopologySpread”} larger than 100ms on 90-percentile.
  • A spike on failure events with keyword “failed spreadConstraint” in scheduler log.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Yes, it was tested manually prior to upgrade following below steps, and behaved as expected.

  1. Install kubernetes v1.24 cluster with two workloads via installation tools like Kind.

  2. Let’s name these nodes as node1 and node2, both labelled with key kubernetes.io/hostname.

  3. Add a taint to node1 like foo=bar:NoSchedule

  4. Apply a deployment like:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: nginx
    spec:
      replicas: 2
      selector:
        matchLabels:
          foo: bar
      template:
        metadata:
          labels:
            foo: bar
        spec:
          restartPolicy: Always
          containers:
          - name: nginx
            image: nginx:1.14.2
          topologySpreadConstraints:
            - maxSkew: 1
              topologyKey: kubernetes.io/hostname
              whenUnsatisfiable: DoNotSchedule
              labelSelector:
                matchLabels:
                  foo: bar
    
  5. We’ll see one pod pending.

  6. Delete the deployment via kubectl delete -f.

  7. Configure the api-server with feature-gate NodeInclusionPolicyInPodTopologySpread enabled.

  8. Redeploy the deployment with NodeTaintsPolicy honored.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: nginx
    spec:
      replicas: 2
      selector:
        matchLabels:
          foo: bar
      template:
        metadata:
          labels:
            foo: bar
        spec:
          restartPolicy: Always
          containers:
          - name: nginx
            image: nginx:1.14.2
          topologySpreadConstraints:
            - maxSkew: 1
              topologyKey: kubernetes.io/hostname
              whenUnsatisfiable: DoNotSchedule
              nodeTaintsPolicy: Honor
              labelSelector:
                matchLabels:
                  foo: bar
    
  9. All pods will be allocated successfully.

  10. Delete the deployment.

  11. Disable the feature gate with api-server restarted.

  12. Apply the deployment for the third time, we’ll see one pending again.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

Operator can query pod.spec.topologySpreadConstraints[].NodeAffinityPolicy and pod.spec.topologySpreadConstraints[].NodeAffinityPolicy to identify whether this is set to non-default values

How can someone using this feature know that it is working for their instance?
  • Other (treat as last resort)
    • Details: We can only observe the behaviors based on pod scheduling results.
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
  • Metric plugin_execution_duration_seconds{plugin=“PodTopologySpread”} <= 100ms on 90-percentile.
  • Frequency of critical error keywords <= 2 times per minute.
  • Frequency of regular scheduling failures < 10 times per minute.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
  • Metric plugin_execution_duration_seconds{plugin=“PodTopologySpread”} to indicate the scheduling latency for a pod using this feature.
  • Frequency of critical error keywords in scheduler log:
    • “PreFilterPodTopologySpread”
    • “convert to podtopologyspread.preFilterState error”
    • “hard topology spread constraints”
    • “internal error: get paths from key”
  • Frequency of regular scheduling failures (with keyword “failed spreadConstraint”) in scheduler log.
Are there any missing metrics that would be useful to have to improve observability of this feature?

No.

Dependencies

Does this feature depend on any specific services running in the cluster?

No

Scalability

Will enabling / using this feature result in any new API calls?

No

Will enabling / using this feature result in introducing new API types?

No

Will enabling / using this feature result in any new calls to the cloud provider?

No

Will enabling / using this feature result in increasing size or count of the existing API objects?

No

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

It only works in pod scheduling, but if the API server or etcd down, pods will not be scheduled successfully.

What are other known failure modes?

Configuration errors are logged to stderr.

What steps should be taken if SLOs are not being met to determine the problem?

If we see obviously performance degradation or error rate going up with this feature gate enabled, we should disable it ASAP, and restart the apiserver. If we have fewer workloads, we can disable the policy in PodTopologySpread one by one for emergency.

Implementation History

  • 2021.01.12: KEP proposed for review, including motivation, proposal, risks, test plan and graduation criteria.
  • 2022.09.22: Graduate to Beta in v1.26.
  • 2025.02.14: Graduate to GA in v1.33.

Drawbacks

None, it’s a backward compatible feature, if users don’t want it, no need to configure anything.

Alternatives

  • The community has discussed about changing the current behavior implicitly, but considering this will introduce a break user-facing change, for backwards compatibility, we decided to add a feature as switch for end-users.
  • We have also discussed about whether to support specific taints, but considering there’s no strong demands from end-users, we will delay this until needed.

Infrastructure Needed (Optional)

No