KEP-3280: Guarantee PodDisruptionBudget When Preemption Happens
KEP-3280: Guarantee PodDisruptionBudget When Preemption Happens
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
This design proposal suggests adding a field AllowDisruptionByPriorityGreaterThanOrEqual in the PriorityClass API
to explicitly indicate that PodDisruptionBudget of the pods corresponding to this priorty class can only be
violated by pods with the priority value greater than or equal to the value of AllowDisruptionByPriorityGreaterThanOrEqual
during the scheduler preemption process, this proposal allows cluster administrators to define PriorityClasses that restrict
PDB violations during preemption to satisfy the needs for high availability of services during scheduler preemption due to
some reasons such as high availability of services(https://github.com/kubernetes/kubernetes/issues/91492#issuecomment-1029484252)
.
Motivation
PodDisruptionBudget (PDB) is used to limit the number of concurrent disruptions that your application experiences,
allowing for high availability. Users can set the field .spec.maxUnavailable or .spec.minAvailable to declare
the current minimum availability or maximum unavailability to be maintained after eviction.
However, there is currently an issue where the kube-scheduler does not strictly guarantee PDBs during the preemption process. The scheduler supports PDBs when preempting pods, but the adherence to PDBs is best effort. The scheduler attempts to select victims whose PDBs are not violated during preemption, but if no such victims are found, preemption will still take place, resulting in the removal of lower-priority pods despite their PDBs being violated.
PodDisruptionBudgets (PDBs) are frequently used for stability and the possibility of violating a PDB during the preemption process is not acceptable for certain users. As such, it is beneficial to provide users with the option to choose if they want the PDB to be guaranteed during preemption or not.
Goals
- Provide an option to the cluster administrators to configure whether the scheduler needs to make
PodDisruptionBudgetguaranteed when preemption happens.
Non-Goals
- Let the application developers influence preemption behavior directly.
Proposal
User Stories (Optional)
Story 1
User deployed a service in a cluster and to ensure its high availability,
User created PDB for this deployment and set .spec.minAvailable to 3. User wanted the PDB to be
guaranteed even in case of scheduler preemption.
Story 2
User created a Tensorflow distributed job that requires a minimum of 5 workers running.
User created PDB for the job and set .spec.minAvailable to 5. User wanted the PDB to be
guaranteed even in case of scheduler preemption to ensure the stability of the whole job.
Notes/Constraints/Caveats (Optional)
Risks and Mitigations
If a user sets a PodDisruptionBudget (PDB) for some low-priority pods and sets the AllowDisruptionByPriorityGreaterThanOrEqual
to PriorityClass, high-priority (less than the value of allowDisruptionByPriorityGreaterThanOrEqual) pods will not be able
to violate the PDB and preempt these pods during the scheduling process. This may result in high-priority (less than the value of allowDisruptionByPriorityGreaterThanOrEqual) pods being unable to schedule while low-priority pods continue to run normally.
Although the above situations may arise, due to the fact that PriorityClass is created and managed by cluster administrators with no permission for application owners to perform actions, administrators are able to uniformly configure according to the requirements. Additionally, implementation will include the addition of additional logging or event descriptions to clearly inform the user of the reason why preemption did not occur.
Design Details
In order to address the issue mentioned above, a new field AllowDisruptionByPriorityGreaterThanOrEqual will
be added to PriorityClass. Users will be able to set this field to indicate that the PodDisruptionBudget of
the pods associated with this priority class can only be violated during scheduler preemption by other pods
with a priority value greater than or equal to AllowDisruptionByPriorityGreaterThanOrEqual. At the same time, to
prevent situations where core components or necessary add-ons cannot be scheduled due to the inability to violate
PDBs, the value of AllowDisruptionByPriorityGreaterThanOrEqual cannot be greater than the priority value of
system-cluster-critical and system-node-critical.priority.
type PriorityClass struct {
metav1.TypeMeta
metav1.ObjectMeta
Value int32
GlobalDefault bool
Description string
PreemptionPolicy *core.PreemptionPolicy
// AllowDisruptionByPriorityGreaterThanOrEqual indicates that a PodDisruptionBudget set for pods associated
// with this priority class can only be violated by pods with a priority value greater than or equal to
// AllowDisruptionByPriorityGreaterThanOrEqual during a preemption process. The value of AllowDisruptionByPriorityGreaterThanOrEqual
// cannot be greater than the priority value of system-cluster-critical or system-node-critical.
// A null value indicates that the PodDisruptionBudget is allow to be disrupted by any other pods with a higher priority.
// +optional
AllowDisruptionByPriorityGreaterThanOrEqual *int32
}
The AllowDisruptionByPriorityGreaterThanOrEqual field in PodSpec will be populated during pod admission,
similarly to how the PriorityClass Value is populated. Storing the AllowDisruptionByPriorityGreaterThanOrEqual
field in the pod spec has several benefits:
- The scheduler does not need to be aware of PiorityClasses, as all relevant information is in the pod.
- Mutating PriorityClass objects does not impact existing pods.
// PodSpec is a description of a pod.
type PodSpec struct {
PriorityClassName string
Priority *int32
PreemptionPolicy *PreemptionPolicy
+ // AllowDisruptionByPriorityGreaterThanOrEqual indicates that a PodDisruptionBudget set for pods associated
+ // with this priority class can only be violated by pods with a priority value greater than or equal to
+ // AllowDisruptionByPriorityGreaterThanOrEqual during a preemption process. The value of AllowDisruptionByPriorityGreaterThanOrEqual
+ // cannot be greater than the priority value of system-cluster-critical or system-node-critical. When Priority Admission Controller is
+ // enabled, it prevents users from setting this field. The admission controller populates this field from PriorityClassName.
+ // A null value indicates that the PodDisruptionBudget is allow to be disrupted by any other pods with a higher priority.
+ // +optiona
+ AllowDisruptionByPriorityGreaterThanOrEqual *int32
}
The following is an example of a PriorityClass where the user sets the allowDisruptionByPriorityGreaterThanOrEqual as 1000
indicating that the pods corresponding to this priorty class can only be violated by the pods with a value of priority
greater than or equal to 1000 during the scheduler preemption process
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority
value: 100
globalDefault: false
description: "This priority class should be used for XYZ service pods only."
allowDisruptionByPriorityGreaterThanOrEqual: 1000
The scheduler plugin defaultpreemption needs to check the value set in the AllowDisruptionByPriorityGreaterThanOrEqual field
when selecting victims.
- if the priority of the preemptor is greater than or equal to the value of
AllowDisruptionByPriorityGreaterThanOrEqualin victim pod, the implementation will remain consistent with the existing behavior, meaning that the scheduler will try to select victims whose PDBs are not violated by preemption, but if no such victims are found, preemption will still happen and lower priority pods will be preempted via the/evictionsendpoint despite their PDBs being violated. If the/evictionendpoint returns a response429 Too Many Requestsand the scheduler will fallback to deletion as an alternative. - if the priority of the preemptor is less than the value of
AllowDisruptionByPriorityGreaterThanOrEqualin victim pod, the scheduler will check if the victim’ PDBs will be violated when selecting victims- if violate the victims’ PDBs, this victim will not be selected as candidates.
- if not violate the victims’ PDBs, scheduler will preempt this pod via the
/evictionsendpoint. If it responds200 OK, it means the eviction is allowed, and the victim is deleted, similar to sending a DELETE request to the Pod URL. If it responses429 Too Many Requests, the scheduler will output an error log and choose another victim among the candidate victims to preempt until it succeeds or there are no more candidates.
PreemptionPolicy vs AllowDisruptionByPriorityGreaterThanOrEqual
The PreemptionPolicy is used to describe the behavior of the pods associated with the PriorityClass during preemption as the preemptor.
The AllowDisruptionByPriorityGreaterThanOrEqual is used to describe the policy of the pods associated with the PriorityClass during
preemption as the victim. During a preemption process, if the preemptor is configured with the PreemptionPolicy and the victim is
configured with the AllowDisruptionByPriorityGreaterThanOrEqual, the AllowDisruptionByPriorityGreaterThanOrEqual takes priority over
the PreemptionPolicy.
Test Plan
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
Unit tests
pkg/scheduler/apis/config/v1:2023-01-19-83.9%pkg/scheduler:2023-01-19-77.1%k8s.io/kubernetes/pkg/scheduler/framework/plugins/defaultpreemption:2023-01-19-85.4%
Integration tests
These cases will be added in the existed integration tests:
- Feature gate enable/disable tests
- During scheduling,
AllowDisruptionByPriorityGreaterThanOrEqualinPriorityClassworks as expected - Verify no significant performance degradation
k8s.io/kubernetes/kubernetes/test/integration/scheduler/preemption/preemption_test.go: https://storage.googleapis.com/k8s-triage/index.html?test=TestPreemptionk8s.io/kubernetes/test/integration/scheduler_perf/scheduler_perf_test.go: https://storage.googleapis.com/k8s-triage/index.html?test=BenchmarkPerfScheduling
e2e tests
- These cases will be added in the existed e2e tests in
k8s.io/kubernetes/kubernetes/test/e2e/scheduling/preemption.go- Feature gate enable/disable tests
- During scheduling,
AllowDisruptionByPriorityGreaterThanOrEqualinPriorityClassworks as expected
Graduation Criteria
Alpha
- Feature implemented behind feature gate.
- Unit and integration tests passed as designed in TestPlan .
Beta
- Feature is enabled by default
- Benchmark tests passed, and there is no performance degradation.
- Update documents to reflect the changes.
GA
- No negative feedback.
- Update documents to reflect the changes.
Upgrade / Downgrade Strategy
In the event of an upgrade, kube-apiserver will start to accept and store the field AllowDisruptionByPriorityGreaterThanOrEqual in PriorityClass and Pod.
In the event of a downgrade, kube-scheduler will ignore AllowDisruptionByPriorityGreaterThanOrEqual in PriorityClass and Pod even if it was set.
Version Skew Strategy
N/A
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name:
DisruptionPolicyInPriorityClass - Components depending on the feature gate:
kube-scheduler,kube-apiserver
- Feature gate name:
Does enabling the feature change any default behavior?
No.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
The feature can be disabled in Alpha and Beta versions by restarting
kube-apiserver and kube-scheduler with feature-gate off.
One caveat is that PriorityClasses and Pods that used the feature will continue to have the
field AllowDisruptionByPriorityGreaterThanOrEqual set in PriorityClass even after disabling
the feature gate, however kube-scheduler will not take the field into account.
What happens if we reenable the feature if it was previously rolled back?
- The newly created PriorityClasses and Pods will contain the field
AllowDisruptionByPriorityGreaterThanOrEqual. - The scheduler will check the value in the field
AllowDisruptionByPriorityGreaterThanOrEqualinPodif preemption occurs during scheduling
Are there any tests for feature enablement/disablement?
No. The unit tests that are exercising the switch of feature gate itself will be added.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
What specific metrics should inform a rollback?
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
How can someone using this feature know that it is working for their instance?
- Events
- Event Reason:
- API .status
- Condition name:
- Other field:
- Other (treat as last resort)
- Details:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- Other (treat as last resort)
- Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?
Dependencies
Does this feature depend on any specific services running in the cluster?
Scalability
Will enabling / using this feature result in any new API calls?
Will enabling / using this feature result in introducing new API types?
Will enabling / using this feature result in any new calls to the cloud provider?
Will enabling / using this feature result in increasing size or count of the existing API objects?
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
What are other known failure modes?
What steps should be taken if SLOs are not being met to determine the problem?
Implementation History
- 2023-01-19: Initial KEP
- 2023-01-28: Move the responsibilities from
AllowDisruptionByPriorityGreaterThanOrEqualtoPriorityClass
Drawbacks
Alternatives
Add
PreemptLowerPriorityWithoutViolatePDBas an option in thePreemptionPolicyof the preempter. When the preemptor is set toPreemptLowerPriorityWithoutViolatePDB, the pods that would violate PDBs will be excluded when selecting victims. However, if we want to guarantee some pods’ PDBs, we need to modify thePreemptionPolicyfor all other pods toPreemptLowerPriorityWithoutViolatePDB. The cost of this operation may be relatively large.A field
PreemptionPolicyis added to thePodDisruptionBudget(PDB) API to indicate whether or not to guarantee thePodDisruptionBudgetduring the scheduler preemption process. Two simple policies are provided,PreferNotPreempted, which indicates that the scheduler will try to avoid violating the PDB during preemption, but it cannot be guaranteed, andRequiredNotPreempted, which indicates that the PDB will not be violated during scheduler preemption. And, if the preempter has a priority ClassName ofsystem-cluster-criticalorsystem-node-critical, it may still potentially violate the victim’s PDB. But, there is a potential conflict between the creators of PDB and PriorityClass, who may have different priorities (cluster scope and namespace scope), which may result in high-priority pods being blocked from preemption in the cluster.Add a new field in the args of preemption plugins to identify if all or some preemptions filtered by Selector cannot violate PDBs. It’s cluster-scope or profile-scope. And also to ensure the security and stability of the cluster, we can also add a list called
PriorityClassesAllowViolatePDBin the configuration to identify that whenPreemptLowerPriorityWithoutViolatePDBis set to true, the pods with these priority classes can also preempt other pods in violation of the PDBs likesystem-cluster-criticalorsystem-node-criticalor other priority classes created by users. However, it’s too mandatory and inflexible to set it in cluster-scope. And also, schedulers need to restart when we want to update the args.