KEP-3280: Guarantee PodDisruptionBudget When Preemption Happens

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This design proposal suggests adding a field AllowDisruptionByPriorityGreaterThanOrEqual in the PriorityClass API to explicitly indicate that PodDisruptionBudget of the pods corresponding to this priorty class can only be violated by pods with the priority value greater than or equal to the value of AllowDisruptionByPriorityGreaterThanOrEqual during the scheduler preemption process, this proposal allows cluster administrators to define PriorityClasses that restrict PDB violations during preemption to satisfy the needs for high availability of services during scheduler preemption due to some reasons such as high availability of services(https://github.com/kubernetes/kubernetes/issues/91492#issuecomment-1029484252) .

Motivation

PodDisruptionBudget (PDB) is used to limit the number of concurrent disruptions that your application experiences, allowing for high availability. Users can set the field .spec.maxUnavailable or .spec.minAvailable to declare the current minimum availability or maximum unavailability to be maintained after eviction.

However, there is currently an issue where the kube-scheduler does not strictly guarantee PDBs during the preemption process. The scheduler supports PDBs when preempting pods, but the adherence to PDBs is best effort. The scheduler attempts to select victims whose PDBs are not violated during preemption, but if no such victims are found, preemption will still take place, resulting in the removal of lower-priority pods despite their PDBs being violated.

PodDisruptionBudgets (PDBs) are frequently used for stability and the possibility of violating a PDB during the preemption process is not acceptable for certain users. As such, it is beneficial to provide users with the option to choose if they want the PDB to be guaranteed during preemption or not.

Goals

Provide an option to the cluster administrators to configure whether the scheduler needs to make PodDisruptionBudget guaranteed when preemption happens.

Non-Goals

Let the application developers influence preemption behavior directly.

Proposal

User Stories (Optional)

Story 1

User deployed a service in a cluster and to ensure its high availability, User created PDB for this deployment and set .spec.minAvailable to 3. User wanted the PDB to be guaranteed even in case of scheduler preemption.

Story 2

User created a Tensorflow distributed job that requires a minimum of 5 workers running. User created PDB for the job and set .spec.minAvailable to 5. User wanted the PDB to be guaranteed even in case of scheduler preemption to ensure the stability of the whole job.

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

If a user sets a PodDisruptionBudget (PDB) for some low-priority pods and sets the AllowDisruptionByPriorityGreaterThanOrEqual to PriorityClass, high-priority (less than the value of allowDisruptionByPriorityGreaterThanOrEqual) pods will not be able to violate the PDB and preempt these pods during the scheduling process. This may result in high-priority (less than the value of allowDisruptionByPriorityGreaterThanOrEqual) pods being unable to schedule while low-priority pods continue to run normally.

Although the above situations may arise, due to the fact that PriorityClass is created and managed by cluster administrators with no permission for application owners to perform actions, administrators are able to uniformly configure according to the requirements. Additionally, implementation will include the addition of additional logging or event descriptions to clearly inform the user of the reason why preemption did not occur.

Design Details

In order to address the issue mentioned above, a new field AllowDisruptionByPriorityGreaterThanOrEqual will be added to PriorityClass. Users will be able to set this field to indicate that the PodDisruptionBudget of the pods associated with this priority class can only be violated during scheduler preemption by other pods with a priority value greater than or equal to AllowDisruptionByPriorityGreaterThanOrEqual. At the same time, to prevent situations where core components or necessary add-ons cannot be scheduled due to the inability to violate PDBs, the value of AllowDisruptionByPriorityGreaterThanOrEqual cannot be greater than the priority value of system-cluster-critical and system-node-critical.priority.

type PriorityClass struct {
  metav1.TypeMeta
  metav1.ObjectMeta
  Value int32
  GlobalDefault bool
  Description string
  PreemptionPolicy *core.PreemptionPolicy

  // AllowDisruptionByPriorityGreaterThanOrEqual indicates that a PodDisruptionBudget set for pods associated 
  // with this priority class can only be violated by pods with a priority value greater than or equal to 
  // AllowDisruptionByPriorityGreaterThanOrEqual during a preemption process. The value of AllowDisruptionByPriorityGreaterThanOrEqual 
  // cannot be greater than the priority value of system-cluster-critical or system-node-critical.
  // A null value indicates that the PodDisruptionBudget is allow to be disrupted by any other pods with a higher priority.
  // +optional
  AllowDisruptionByPriorityGreaterThanOrEqual *int32
}

The AllowDisruptionByPriorityGreaterThanOrEqual field in PodSpec will be populated during pod admission, similarly to how the PriorityClass Value is populated. Storing the AllowDisruptionByPriorityGreaterThanOrEqual field in the pod spec has several benefits:

The scheduler does not need to be aware of PiorityClasses, as all relevant information is in the pod.
Mutating PriorityClass objects does not impact existing pods.

// PodSpec is a description of a pod.
type PodSpec struct {
  PriorityClassName string
  Priority *int32
  PreemptionPolicy *PreemptionPolicy

+ // AllowDisruptionByPriorityGreaterThanOrEqual indicates that a PodDisruptionBudget set for pods associated 
+ // with this priority class can only be violated by pods with a priority value greater than or equal to 
+ // AllowDisruptionByPriorityGreaterThanOrEqual during a preemption process. The value of AllowDisruptionByPriorityGreaterThanOrEqual 
+ // cannot be greater than the priority value of system-cluster-critical or system-node-critical. When Priority Admission Controller is 
+ // enabled, it prevents users from setting this field. The admission controller populates this field from PriorityClassName.
+ // A null value indicates that the PodDisruptionBudget is allow to be disrupted by any other pods with a higher priority.
+ // +optiona
+ AllowDisruptionByPriorityGreaterThanOrEqual *int32
}

The following is an example of a PriorityClass where the user sets the allowDisruptionByPriorityGreaterThanOrEqual as 1000 indicating that the pods corresponding to this priorty class can only be violated by the pods with a value of priority greater than or equal to 1000 during the scheduler preemption process

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 100
globalDefault: false
description: "This priority class should be used for XYZ service pods only."
allowDisruptionByPriorityGreaterThanOrEqual: 1000

The scheduler plugin defaultpreemption needs to check the value set in the AllowDisruptionByPriorityGreaterThanOrEqual field when selecting victims.

if the priority of the preemptor is greater than or equal to the value of AllowDisruptionByPriorityGreaterThanOrEqual in victim pod, the implementation will remain consistent with the existing behavior, meaning that the scheduler will try to select victims whose PDBs are not violated by preemption, but if no such victims are found, preemption will still happen and lower priority pods will be preempted via the /evictions endpoint despite their PDBs being violated. If the /eviction endpoint returns a response 429 Too Many Requests and the scheduler will fallback to deletion as an alternative.
if the priority of the preemptor is less than the value of AllowDisruptionByPriorityGreaterThanOrEqual in victim pod, the scheduler will check if the victim’ PDBs will be violated when selecting victims
- if violate the victims’ PDBs, this victim will not be selected as candidates.
- if not violate the victims’ PDBs, scheduler will preempt this pod via the /evictions endpoint. If it responds 200 OK, it means the eviction is allowed, and the victim is deleted, similar to sending a DELETE request to the Pod URL. If it responses 429 Too Many Requests, the scheduler will output an error log and choose another victim among the candidate victims to preempt until it succeeds or there are no more candidates.

PreemptionPolicy vs AllowDisruptionByPriorityGreaterThanOrEqual

The PreemptionPolicy is used to describe the behavior of the pods associated with the PriorityClass during preemption as the preemptor. The AllowDisruptionByPriorityGreaterThanOrEqual is used to describe the policy of the pods associated with the PriorityClass during preemption as the victim. During a preemption process, if the preemptor is configured with the PreemptionPolicy and the victim is configured with the AllowDisruptionByPriorityGreaterThanOrEqual, the AllowDisruptionByPriorityGreaterThanOrEqual takes priority over the PreemptionPolicy.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

pkg/scheduler/apis/config/v1: 2023-01-19 - 83.9%
pkg/scheduler: 2023-01-19 - 77.1%
k8s.io/kubernetes/pkg/scheduler/framework/plugins/defaultpreemption: 2023-01-19 - 85.4%

Integration tests

These cases will be added in the existed integration tests:
- Feature gate enable/disable tests
- During scheduling, AllowDisruptionByPriorityGreaterThanOrEqual in PriorityClass works as expected
- Verify no significant performance degradation
k8s.io/kubernetes/kubernetes/test/integration/scheduler/preemption/preemption_test.go: https://storage.googleapis.com/k8s-triage/index.html?test=TestPreemption
k8s.io/kubernetes/test/integration/scheduler_perf/scheduler_perf_test.go: https://storage.googleapis.com/k8s-triage/index.html?test=BenchmarkPerfScheduling

e2e tests

These cases will be added in the existed e2e tests in k8s.io/kubernetes/kubernetes/test/e2e/scheduling/preemption.go
- Feature gate enable/disable tests
- During scheduling, AllowDisruptionByPriorityGreaterThanOrEqual in PriorityClass works as expected

Graduation Criteria

Alpha

Feature implemented behind feature gate.
Unit and integration tests passed as designed in TestPlan .

Beta

Feature is enabled by default
Benchmark tests passed, and there is no performance degradation.
Update documents to reflect the changes.

GA

No negative feedback.
Update documents to reflect the changes.

Upgrade / Downgrade Strategy

In the event of an upgrade, kube-apiserver will start to accept and store the field AllowDisruptionByPriorityGreaterThanOrEqual in PriorityClass and Pod. In the event of a downgrade, kube-scheduler will ignore AllowDisruptionByPriorityGreaterThanOrEqual in PriorityClass and Pod even if it was set.

Version Skew Strategy

N/A

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: DisruptionPolicyInPriorityClass
- Components depending on the feature gate: kube-scheduler, kube-apiserver

Does enabling the feature change any default behavior?

No.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

The feature can be disabled in Alpha and Beta versions by restarting kube-apiserver and kube-scheduler with feature-gate off. One caveat is that PriorityClasses and Pods that used the feature will continue to have the field AllowDisruptionByPriorityGreaterThanOrEqual set in PriorityClass even after disabling the feature gate, however kube-scheduler will not take the field into account.

What happens if we reenable the feature if it was previously rolled back?

The newly created PriorityClasses and Pods will contain the field AllowDisruptionByPriorityGreaterThanOrEqual.
The scheduler will check the value in the field AllowDisruptionByPriorityGreaterThanOrEqual in Pod if preemption occurs during scheduling

Are there any tests for feature enablement/disablement?

No. The unit tests that are exercising the switch of feature gate itself will be added.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

Events
- Event Reason:
API .status
- Condition name:
- Other field:
Other (treat as last resort)
- Details:

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
Other (treat as last resort)
- Details:

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

2023-01-19: Initial KEP
2023-01-28: Move the responsibilities from AllowDisruptionByPriorityGreaterThanOrEqual to PriorityClass

Drawbacks

Alternatives

Add PreemptLowerPriorityWithoutViolatePDB as an option in the PreemptionPolicy of the preempter. When the preemptor is set to PreemptLowerPriorityWithoutViolatePDB, the pods that would violate PDBs will be excluded when selecting victims. However, if we want to guarantee some pods’ PDBs, we need to modify the PreemptionPolicy for all other pods to PreemptLowerPriorityWithoutViolatePDB. The cost of this operation may be relatively large.
A field PreemptionPolicy is added to the PodDisruptionBudget (PDB) API to indicate whether or not to guarantee the PodDisruptionBudget during the scheduler preemption process. Two simple policies are provided, PreferNotPreempted, which indicates that the scheduler will try to avoid violating the PDB during preemption, but it cannot be guaranteed, and RequiredNotPreempted, which indicates that the PDB will not be violated during scheduler preemption. And, if the preempter has a priority ClassName of system-cluster-critical or system-node-critical, it may still potentially violate the victim’s PDB. But, there is a potential conflict between the creators of PDB and PriorityClass, who may have different priorities (cluster scope and namespace scope), which may result in high-priority pods being blocked from preemption in the cluster.
Add a new field in the args of preemption plugins to identify if all or some preemptions filtered by Selector cannot violate PDBs. It’s cluster-scope or profile-scope. And also to ensure the security and stability of the cluster, we can also add a list called PriorityClassesAllowViolatePDB in the configuration to identify that when PreemptLowerPriorityWithoutViolatePDB is set to true, the pods with these priority classes can also preempt other pods in violation of the PDBs like system-cluster-critical or system-node-critical or other priority classes created by users. However, it’s too mandatory and inflexible to set it in cluster-scope. And also, schedulers need to restart when we want to update the args.

KEP-3280: Guarantee PodDisruptionBudget When Preemption Happens

KEP-3280: Guarantee PodDisruptionBudget When Preemption Happens

Release Signoff Checklist

Summary

Motivation

Goals

Non-Goals

Proposal

User Stories (Optional)

Story 1

Story 2

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Design Details

PreemptionPolicy vs AllowDisruptionByPriorityGreaterThanOrEqual

Test Plan

Prerequisite testing updates

Unit tests

Integration tests

e2e tests

Graduation Criteria

Alpha

Beta

GA

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

Infrastructure Needed (Optional)