KEP-3243: Respect PodTopologySpread after rolling upgrades
KEP-3243: Respect PodTopologySpread after rolling upgrades
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests for meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
The pod topology spread feature allows users to define the group of pods over which spreading is applied using a LabelSelector. This means the user should know the exact label key and value when defining the pod spec.
This KEP proposes a complementary field to LabelSelector named MatchLabelKeys in
TopologySpreadConstraint which represents a set of label keys only.
At a pod creation, kube-apiserver will use those keys to look up label values from the incoming pod
and those key-value labels will be merged with existing LabelSelector to identify the group of existing pods over
which the spreading skew will be calculated.
Note that in case MatchLabelKeys is supported in the cluster-level default constraints
(see https://github.com/kubernetes/kubernetes/issues/129198)
, kube-scheduler will also handle it separately.
The main case that this new way for identifying pods will enable is constraining skew spreading calculation to happen at the revision level in Deployments during rolling upgrades.
Motivation
PodTopologySpread is widely used in production environments, especially in service type workloads which employ Deployments. However, currently it has a limitation that manifests during rolling updates which causes the deployment to end up out of balance (98215 , 105661 , k8s-pod-topology spread is not respected after rollout ).
The root cause is that PodTopologySpread constraints allow defining a key-value label selector, which applies to all pods in a Deployment irrespective of their owning ReplicaSet. As a result, when a new revision is rolled out, spreading will apply across pods from both the old and new ReplicaSets, and so by the time the new ReplicaSet is completely rolled out and the old one is rolled back, the actual spreading we are left with may not match expectations because the deleted pods from the older ReplicaSet will cause skewed distribution for the remaining pods.
Currently, users are given two solutions to this problem. The first is to add a revision label to Deployment and update it manually at each rolling upgrade (both the label on the podTemplate and the selector in the podTopologySpread constraint), while the second is to deploy a descheduler to re-balance the pod distribution. The former solution isn’t user friendly and requires manual tuning, which is error prone; while the latter requires installing and maintaining an extra controller. In this proposal, we propose a native way to maintain pod balance after a rolling upgrade in Deployments that use PodTopologySpread.
Goals
- Allow users to define PodTopologySpread constraints such that they apply only within the boundaries of a Deployment revision during rolling upgrades.
Non-Goals
Proposal
User Stories (Optional)
Story 1
When users apply a rolling update to a deployment that uses PodTopologySpread, the spread should be respected only within the new revision, not across all revisions of the deployment.
Notes/Constraints/Caveats (Optional)
In most scenarios, users can use the label keyed with pod-template-hash added
automatically by the Deployment controller to distinguish between different
revisions in a single Deployment. But for more complex scenarios
(eg. topology spread associating two deployments at the same time), users are
responsible for providing common labels to identify which pods should be grouped.
Risks and Mitigations
Possible misuse
In addition to using pod-template-hash added by the Deployment controller,
users can also provide the customized key in MatchLabelKeys to identify
which pods should be grouped. If so, the user needs to ensure that it is
correct and not duplicated with other unrelated workloads.
The update to labels specified at matchLabelKeys isn’t supported
MatchLabelKeys is handled and merged into LabelSelector at a pod’s creation.
It means this feature doesn’t support the label’s update even though a user
could update the label that is specified at matchLabelKeys after a pod’s creation.
So, in such cases, the update of the label isn’t reflected onto the merged LabelSelector,
even though users might expect it to be.
On the documentation, we’ll declare it’s not recommended to use matchLabelKeys with labels that might be updated.
Also, we assume the risk is acceptably low because:
- It’s a fairly low probability to happen because pods are usually managed by another resource (e.g., deployment),
and the update to pod template’s labels on a deployment recreates pods, instead of directly updating the labels on existing pods.
Also, even if users somehow use bare pods (which is not recommended in the first place),
there’s usually only a tiny moment between the pod creation and the pod getting scheduled, which makes this risk further rarer to happen,
unless many pods are often getting stuck being unschedulable for a long time in the cluster (which is not recommended)
or the labels specified at
matchLabelKeysare frequently updated (which we’ll declare as not recommended). - If it happens,
selfMatchNumwill be 0 and bothmatchNumandminMatchNumwill be retained. Consequently, depending on the current number of matching pods in the domain,matchNum-minMatchNummight be bigger thanmaxSkew, and the pod(s) could be unschedulable. But, it does not mean that the unfortunate pods would be unschedulable forever.
Design Details
A new optional field named MatchLabelKeys will be introduced to TopologySpreadConstraint.
Currently, when scheduling a pod, the LabelSelector defined in the pod is used
to identify the group of pods over which spreading will be calculated.
MatchLabelKeys adds another constraint to how this group of pods is identified.
type TopologySpreadConstraint struct {
MaxSkew int32
TopologyKey string
WhenUnsatisfiable UnsatisfiableConstraintAction
LabelSelector *metav1.LabelSelector
// MatchLabelKeys is a set of pod label keys to select the pods over which
// spreading will be calculated. The keys are used to lookup values from the
// incoming pod labels, those key-value labels are ANDed with `LabelSelector`
// to select the group of existing pods over which spreading will be calculated
// for the incoming pod. Keys that don't exist in the incoming pod labels will
// be ignored.
MatchLabelKeys []string
}
When a Pod is created, kube-apiserver will obtain the labels from the pod
by the keys in matchLabelKeys and the key-value labels are merged to LabelSelector
of TopologySpreadConstraint.
For example, when this sample Pod is created,
apiVersion: v1
kind: Pod
metadata:
name: sample
labels:
app: sample
...
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector: {}
matchLabelKeys: # ADDED
- app
kube-apiserver modifies the labelSelector like the following:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
+ matchExpressions:
+ - key: app
+ operator: In
+ values:
+ - sample
matchLabelKeys:
- app
In addition, kube-scheduler will handle matchLabelKeys within the cluster-level default constraints
in the scheduler configuration in the future (see https://github.com/kubernetes/kubernetes/issues/129198)
.
Finally, the feature will be guarded by a new feature flag MatchLabelKeysInPodTopologySpread. If the feature is
disabled, the field matchLabelKeys and corresponding labelSelector are preserved
if it was already set in the persisted Pod object, otherwise new Pod with the field
creation will be rejected by kube-apiserver.
Also kube-scheduler will ignore matchLabelKeys in the cluster-level default constraints configuration.
[v1.34] design change and a safe upgrade path
Previously, kube-scheduler just internally handled matchLabelKeys before the calculation of scheduling results.
But, we changed the implementation design to the current form to make the design align with PodAffinity’s matchLabelKeys.
(See the detailed discussion in the alternative section
)
However, this implementation change could break matchLabelKeys of unscheduled pods created before the upgrade
because kube-apiserver only handles matchLabelKeys at pods creation, that is,
it doesn’t handle matchLabelKeys at existing unscheduled pods.
So, for a safe upgrade path from v1.33 to v1.34, kube-scheduler would handle not only matchLabelKeys
from the default constraints, but also all incoming pods during v1.34.
We’re going to change kube-scheduler to only concern matchLabelKeys from the default constraints at v1.35 for efficiency,
assuming kube-apiserver handles matchLabelKeys of all incoming pods.
Also, in case of bugs in this new design, users can disable this feature through a new feature flag,
MatchLabelKeysInPodTopologySpreadSelectorMerge (enabled by default).
(See more details in Feature Enablement and Rollback
)
Test Plan
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
Unit tests
k8s.io/kubernetes/pkg/scheduler/framework/plugins/podtopologyspread:2025-01-14 JST (The commit hash: ccd2b4e8a719dabe8605b1e6b2e74bb5352696e1)-87.5%k8s.io/kubernetes/pkg/scheduler/framework/plugins/podtopologyspread/plugin.go:2025-01-14 JST (The commit hash: ccd2b4e8a719dabe8605b1e6b2e74bb5352696e1)-84.8%k8s.io/kubernetes/pkg/registry/core/pod/strategy.go:2025-01-14 JST (The commit hash: ccd2b4e8a719dabe8605b1e6b2e74bb5352696e1)-65%
Integration tests
These cases will be added in the existed integration tests:
- Feature gate enable/disable tests
MatchLabelKeysinTopologySpreadConstraintworks as expected- Verify no significant performance degradation
k8s.io/kubernetes/test/integration/scheduler/filters/filters_test.go: https://storage.googleapis.com/k8s-triage/index.html?test=TestPodTopologySpreadFilterk8s.io/kubernetes/test/integration/scheduler/scoring/priorities_test.go: https://storage.googleapis.com/k8s-triage/index.html?test=TestPodTopologySpreadScoringk8s.io/kubernetes/test/integration/scheduler_perf/scheduler_perf_test.go: https://storage.googleapis.com/k8s-triage/index.html?test=BenchmarkPerfScheduling
e2e tests
These cases will be added in the existed e2e tests:
- Feature gate enable/disable tests
MatchLabelKeysinTopologySpreadConstraintworks as expected
k8s.io/kubernetes/test/e2e/scheduling/predicates.go: https://storage.googleapis.com/k8s-triage/index.html?sig=schedulingk8s.io/kubernetes/test/e2e/scheduling/priorities.go: https://storage.googleapis.com/k8s-triage/index.html?sig=scheduling
Graduation Criteria
Alpha
- Feature implemented behind feature gate.
- Unit and integration tests passed as designed in TestPlan .
Beta
- Feature is enabled by default
- Benchmark tests passed, and there is no performance degradation.
- Update documents to reflect the changes.
GA
- No negative feedback.
- Update documents to reflect the changes.
Upgrade / Downgrade Strategy
In the event of an upgrade, kube-apiserver will start to accept and store the field MatchLabelKeys.
In the event of a downgrade, kube-apiserver will reject pod creation with matchLabelKeys in TopologySpreadConstraint.
But, regarding existing pods, we leave matchLabelKeys and generated LabelSelector even after downgraded.
kube-scheduler will ignore MatchLabelKeys if it was set in the cluster-level default constraints configuration.
Version Skew Strategy
There’s no version skew issue.
We changed the implementation design between v1.34 and v1.35, but we designed the change not to involve any version skew issue as described at [v1.34] design change and a safe upgrade path .
Production Readiness Review Questionnaire
Feature Enablement and Rollback
MatchLabelKeysInPodTopologySpreadfeature flag enables theMatchLabelKeysfeature inTopologySpreadConstraint.MatchLabelKeysInPodTopologySpreadSelectorMergefeature flag enables the new design described at [v1.34] design change and a safe upgrade path .- If
MatchLabelKeysInPodTopologySpreadSelectorMergeis disabled whileMatchLabelKeysInPodTopologySpreadis enabled, Kubernetes handlesMatchLabelKeyswith the classic design, kube-scheduler handles it. However, that’s basically not recommended unless you encounter a bug in a new design behavior. - This flag cannot be enabled on its own, and has to be enabled together with
MatchLabelKeysInPodTopologySpread. EnablingMatchLabelKeysInPodTopologySpreadSelectorMergealone has no effect, andmatchLabelKeyswill be ignored.
- If
The MatchLabelKeysInPodTopologySpreadSelectorMerge feature flag has been added in v1.34 and enabled by default.
This flag can be disabled to revert the implementation design change in v1.34
and go back to the previous behavior in case of bug.
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name:
MatchLabelKeysInPodTopologySpread - Components depending on the feature gate:
kube-scheduler,kube-apiserver
- Feature gate name:
- Feature gate (also fill in values in
kep.yaml)- Feature gate name:
MatchLabelKeysInPodTopologySpreadSelectorMerge - Components depending on the feature gate:
kube-apiserver
- Feature gate name:
Does enabling the feature change any default behavior?
No.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
The feature can be disabled in Alpha and Beta versions by restarting kube-apiserver and kube-scheduler with feature-gate off. One caveat is that pods that used the feature will continue to have the MatchLabelKeys field set and the corresponding LabelSelector even after disabling the feature gate. In terms of Stable versions, users can choose to opt-out by not setting the matchLabelKeys field.
What happens if we reenable the feature if it was previously rolled back?
Newly created pods need to follow this policy when scheduling. Old pods will not be affected.
Are there any tests for feature enablement/disablement?
No. The unit tests that are exercising the switch of feature gate itself will be added.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
It won’t impact already running workloads because it is an opt-in feature in kube-apiserver and kube-scheduler. But during a rolling upgrade, if some apiservers have not enabled the feature, they will not be able to accept and store the field “MatchLabelKeys” and the pods associated with these apiservers will not be able to use this feature. As a result, pods belonging to the same deployment may have different scheduling outcomes.
What specific metrics should inform a rollback?
- If the metric
schedule_attempts_total{result="error|unschedulable"}increased significantly after pods using this feature are added. - If the metric
plugin_execution_duration_seconds{plugin="PodTopologySpread"}increased to higher than 100ms on 90% after pods using this feature are added.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
Yes, it was tested manually by following the steps below, and it was working at intended.
- create a kubernetes cluster v1.26 with 3 nodes where
MatchLabelKeysInPodTopologySpreadfeature is disabled. - deploy a deployment with this yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 12
selector:
matchLabels:
foo: bar
template:
metadata:
labels:
foo: bar
spec:
restartPolicy: Always
containers:
- name: nginx
image: nginx:1.14.2
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
matchLabelKeys:
- pod-template-hash
- pods spread across nodes as 4/4/4
- update the deployment nginx image to
nginx:1.15.0 - pods spread across nodes as 5/4/3
- delete deployment nginx
- upgrade kubenetes cluster to v1.27 (at master branch) while
MatchLabelKeysInPodTopologySpreadis enabled. - deploy a deployment nginx like step2
- pods spread across nodes as 4/4/4
- update the deployment nginx image to
nginx:1.15.0 - pods spread across nodes as 4/4/4
- delete deployment nginx
- downgrade kubenetes cluster to v1.26 where
MatchLabelKeysInPodTopologySpreadfeature is enabled. - deploy a deployment nginx like step2
- pods spread across nodes as 4/4/4
- update the deployment nginx image to
nginx:1.15.0 - pods spread across nodes as 4/4/4
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
Operator can query pods that have the pod.spec.topologySpreadConstraints.matchLabelKeys field set to determine if the feature is in use by workloads.
How can someone using this feature know that it is working for their instance?
- Other (treat as last resort)
- Details: We can determine if this feature is being used by checking pods that have only
MatchLabelKeysset inTopologySpreadConstraint.
- Details: We can determine if this feature is being used by checking pods that have only
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
Metric plugin_execution_duration_seconds{plugin=“PodTopologySpread”} <= 100ms on 90-percentile.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Component exposing the metric: kube-scheduler
- Metric name:
plugin_execution_duration_seconds{plugin="PodTopologySpread"} - Metric name:
schedule_attempts_total{result="error|unschedulable"}
- Metric name:
- Component exposing the metric: kube-scheduler
Are there any missing metrics that would be useful to have to improve observability of this feature?
Yes, there were , and it’s been implemented in #115082 and #118025 .
Dependencies
Does this feature depend on any specific services running in the cluster?
No.
Scalability
Will enabling / using this feature result in any new API calls?
No.
Will enabling / using this feature result in introducing new API types?
No.
Will enabling / using this feature result in any new calls to the cloud provider?
No.
Will enabling / using this feature result in increasing size or count of the existing API objects?
No.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
Yes. there is an additional work:
kube-apiserver uses the keys in matchLabelKeys to look up label values from the pod,
and change LabelSelector according to them.
kube-scheduler also handles matchLabelKeys if the cluster-level default constraints has it.
The impact in the latency of pod creation request in kube-apiserver and the scheduling latency
should be negligible.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
No.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No.
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
If the API server and/or etcd is not available, this feature will not be available. This is because the kube-scheduler needs to update the scheduling results to the pod via the API server/etcd.
What are other known failure modes?
N/A
What steps should be taken if SLOs are not being met to determine the problem?
- Check the metric
plugin_execution_duration_seconds{plugin="PodTopologySpread"}to determine if the latency increased. If increased, it means this feature may increased scheduling latency. You can disable the featureMatchLabelKeysInPodTopologySpreadto see if it’s the cause of the increased latency. - Check the metric
schedule_attempts_total{result="error|unschedulable"}to determine if the number of attempts increased. If increased, You need to determine the cause of the failure by the event of the pod. If it’s caused by pluginPodTopologySpread, You can further analyze this problem by looking at the kube-scheduler log.
Implementation History
- 2022-03-17: Initial KEP
- 2022-06-08: KEP merged
- 2023-01-16: Graduate to Beta
- 2025-01-23: Change the implementation design to be aligned with PodAffinity’s
matchLabelKeys - 2025-04-07: Add a new feature flag
MatchLabelKeysInPodTopologySpreadSelectorMergeand update milestone
Drawbacks
Alternatives
use pod generateName
Use pod.generateName to distinguish new/old pods that belong to the
revisions of the same workload in scheduler plugin. It’s decided not to
support because of the following reason: scheduler needs to ensure universal
and scheduler plugin shouldn’t have special treatment for any labels/fields.
implement MatchLabelKeys in only either the scheduler plugin or kube-apiserver
Technically, we can implement this feature within the PodTopologySpread plugin only;
merging the key-value labels corresponding to MatchLabelKeys into LabelSelector internally
within the plugin before calculating the scheduling results.
This is the actual implementation up to 1.33.
But, it may confuse users because this behavior would be different from PodAffinity’s MatchLabelKeys.
Also, we cannot implement this feature only within kube-apiserver because it’d make it
impossible to handle MatchLabelKeys within the cluster-level default constraints
in the scheduler configuration in the future (see https://github.com/kubernetes/kubernetes/issues/129198)
.
So we decided to go with the design that implements this feature within both
the PodTopologySpread plugin and kube-apiserver.
Although the final design has a downside requiring us to maintain two implementations handling MatchLabelKeys,
each implementation is simple and we regard the risk of increased maintenance overhead as fairly low.