KEP-5142: Pop pod from backoffQ when activeQ is empty

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
- Risks and Mitigations
- Low priority pod could be chosen to pop, even if high priority pod has a slightly later backoff expiration
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
- Move pods in flushBackoffQCompleted when activeQ is empty

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This KEP proposes improving scheduling queue behavior by popping pods from the backoffQ when the activeQ is empty. This would allow to process potentially schedulable pods ASAP, eliminating a penalty effect of the backoff queue.

Motivation

There are three queues in the scheduler:

activeQ contains pods ready for scheduling,
unschedulableQ holds pods that were unschedulable in their scheduling cycle and are waiting for cluster state to change. These pods are then moved to backoffQ to apply a penalty.
backoffQ stores pods that failed scheduling attempts (either due to being unschedulable or errors) and could be schedulable again, but applying a backoff penalty, scaled with the number of attempts.

When the activeQ is not empty, the scheduler pops the highest priority pod from the activeQ. However, when the activeQ is empty, the kube-scheduler idles, even if pods are in the backoffQ waiting for their backoff period to expire. To avoid delaying assessment of potentially schedulable pods, kube-scheduler should consider those pods for scheduling, even if the backoff time hasn’t expired yet. However, pods that are in the backoffQ due to errors, should not bypass the backoff time, since it plays also rate limiting role, avoiding system overload due to too frequent retries.

Goals

Improve scheduling throughput and kube-scheduler utilization when the activeQ is empty, but pods are waiting in the backoffQ.
Run PreEnqueue plugins when putting a pod into the backoffQ.

Non-Goals

Refactor the scheduling queue by changing backoff logic or merging the activeQ with the backoffQ.

Proposal

At the beginning of the scheduling cycle, a pod is popped from the activeQ. Currently, when activeQ is empty, it waits until some pod is placed into the queue. This KEP proposes to pop the pod from the backoffQ when the activeQ is empty, however the current mechanism of moving pods from the backoffQ to activeQ (aka flushing) will still work as before to avoid the problem of pods starvation, which was the original reason of introducing the backoffQ.

To ensure the PreEnqueue is called for each pod taken into the scheduling cycle, PreEnqueue plugins would be called before putting pods into the backoffQ. It won’t be done again when moving pods from the backoffQ to the activeQ.

Risks and Mitigations

A tiny delay on the first scheduling attempts for newly created pods

While the scheduler handles a pod directly popping from the backoffQ, another pod that should be scheduled before the pod being scheduled now, may appear in the activeQ. However, in the real world, if the scheduling latency is short enough, there won’t be a visible downgrade in throughput. This will only happen if there are no pods in the activeQ, so this can be mitigated by an appropriate rate of pod creation.

Backoff won’t be working as a natural rate limiter in case of errors

In case of API calls errors (e.g., network issues), the backoffQ allows to limit the number of retries in a short term. This proposal will take those pods earlier, leading to losing this mechanism.

After merging kubernetes#128748 , it will be possible to distinguish pods backing off because of errors from those backing off because of an unschedulable attempt. To preserve the efficiency of the pop() function, it will be necessary to divide the backoffQ into two queues: one for pods that were unschedulable, and another for those rejected due to an error. Then popping will be performed only from the former, keeping the error backoff intact.

This has to be resolved before the beta is released, which means before the release of the feature.

One pod in the backoffQ could starve the others

The head of the BackoffQ is the pod with the closest backoff expiration, and the backoff time is calculated based on the number of scheduling failures that the pod has experienced. If one pod has a smaller attempt counter than others, could the scheduler keep popping this pod ahead of other pods because the pod’s backoff expires faster than others? Actually, that wouldn’t happen because the scheduler would increment the attempt counter of pods from the backoffQ as well, which would make the backoff time larger after each scheduling attempt, and the pod that had a smaller attempt number eventually won’t be popped out.

Low priority pod could be chosen to pop, even if high priority pod has a slightly later backoff expiration

The current mechanism of flushing from backoffQ to activeQ is done each second, taking all pods with backoff expired. It means that, when they come to activeQ, they are sorted by priority there and taken in this order from activeQ. It is important, because preemption of a lower priority pod could happen if a higher priority pod is scheduled later.

To mitigate this, lessFn function of backoffQ’s heap will be changed, splitting the time to make one second windows (by ignoring milliseconds) in which pods will be sorted by priority. Those whole windows will be eventually flushed to activeQ, making no change in current behavior.

Design Details

Popping from the backoffQ in activeQ’s pop()

To achieve the goal, activeQ’s pop() method needs to be changed:

If the activeQ is empty, then instead of waiting for a pod to arrive at the activeQ, popping from the backoffQ is tried.
If the backoffQ is empty, then pop() is waiting for a pod as previously.
If the backoffQ is not empty, then the pod is processed like the pod would be taken from the activeQ, including increasing attempts number. It is popping from a heap data structure, so it should be fast enough not to cause any performance troubles.

To support monitoring, when popping from the backoffQ, the scheduler_queue_incoming_pods_total metric with an activeQ queue and a new PopFromBackoffQ event label will be incremented.

Notifying activeQ condition when a new pod appears in the backoffQ

Pods might appear in the backoffQ while pop() is hanging on point 2. That’s why it will be required to call broadcast() on the condition after adding a pod to the backoffQ. It shouldn’t cause any performance issues.

We could eventually want to move the backoffQ under activeQ’s lock, but it’s out of scope of this KEP.

Calling PreEnqueue for the backoffQ

Currently, we call PreEnqueue at a single place, every time pods are being moved to the activeQ. But, with this proposal, PreEnqueue will be called before moving a pod to the backoffQ, not when popping pods directly from the backoffQ. Otherwise, a direct popping would be inefficient: it has to take the top backoffQ pod, check if it goes through PreEnqueue plugins, if not check the next backoffQ pod, until it finds the pod that goes through all PreEnqueue plugins. Also, it means we’d have two paths that PreEnqueue plugins are invoked: when new pods are created and entering the scheduling queue, and when pods are pushed into the backoffQ. At the moveToActiveQ level, these two paths could be distinguished by checking if the event is equal to BackoffComplete.

Change backoffQ less function

As mentioned in risks, backoffQ’s heap lessFn function has to be changed to apply priority within 1 second windows. The actual implementation takes backoff expiration times of two pods and compares which is lower. The new version will ignore the milliseconds and use priorities to compare pods within those windows. To make ordering predictable, in case of equal priorities within the same window, the whole backoff time expiration will be eventually compared. See the pseudocode:

func podsCompareBackoffCompleted(pInfo1, pInfo2 *framework.QueuedPodInfo) bool {
	if pInfo1.BackoffTime.InSeconds() != pInfo2.BackoffTime.InSeconds() {
		return pInfo1.BackoffTime.Before(pInfo2.BackoffTime)
	}
	if pInfo1.Priority != pInfo2.Priority {
		return pInfo1.Priority > pInfo2.Priority
	}
	return pInfo1.BackoffTime.Before(pInfo2.BackoffTime)
}

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

k8s.io/kubernetes/pkg/scheduler/backend/queue: 2025-02-06 - 91.4

Integration tests

k8s.io/kubernetes/test/integration/scheduler/queueing - add test cases covering the scenario.
scheduler_perf - add test cases measuring performance in this scenario.

e2e tests

The feature is scoped within the kube-scheduler internally, so there is no interaction between other components. The whole feature should be already covered by integration tests.

Graduation Criteria

The feature will start from beta and be enabled by default, because it is an internal kube-scheduler feature and guarded by a flag.

Alpha

N/A

Beta

Feature implemented behind a feature flag and enabled by default.
All tests from Test Plan implemented.
Make sure backoff in case of error is not skipped.

GA

Gather feedback from users and fix reported bugs.

Upgrade / Downgrade Strategy

Upgrade

During the beta period, the feature gate SchedulerPopFromBackoffQ is enabled by default, so users don’t need to opt in. This is a purely in-memory feature for the kube-scheduler, so no special actions are required outside the scheduler.

Downgrade

Users need to disable the feature gate.

Version Skew Strategy

This is a purely in-memory feature for the kube-scheduler, and hence no version skew strategy.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: SchedulerPopFromBackoffQ
- Components depending on the feature gate: kube-scheduler

Does enabling the feature change any default behavior?

Pods that are backing off might be scheduled earlier when the activeQ is empty.

Can the feature be disabled once it has been enabled (i.e., can we roll back the enablement)?

Yes. The feature can be disabled in Beta version by restarting the kube-scheduler with the feature-gate off.

What happens if we re-enable the feature if it was previously rolled back?

The scheduler again starts to pop pods from the backoffQ when the activeQ is empty.

Are there any tests for feature enablement/disablement?

Given it’s a purely in-memory feature and enablement/disablement requires restarting the component (to change the value of the feature flag), having feature tests is enough.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

The partial failure in the rollout isn’t there because the scheduler is the only component to roll out this feature. But, if upgrading the scheduler itself fails somehow, new Pods won’t be scheduled anymore, while Pods, which are already scheduled, won’t be affected in any case.

What specific metrics should inform a rollback?

Abnormal values of metrics related to the scheduling queue, meaning pods are stuck in the activeQ:

The scheduler_schedule_attempts_total metric with the scheduled label is almost constant, while there are pending pods that should be schedulable. This could mean that pods from the backoffQ are taken instead of those from the activeQ.
The scheduler_pending_pods metric with the active label is not decreasing, while with the backoff is almost constant.
The scheduler_queue_incoming_pods_total metric with the PopFromBackoffQ label is increasing when there are pods in the activeQ. If this metric with this specific label is always higher than for other labels, it could also mean that this feature should be rolled back.
The scheduler_pod_scheduling_sli_duration_seconds metric is visibly higher for schedulable pods.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

No. This feature is an in-memory feature of the scheduler and thus calculations start from the beginning every time the scheduler is restarted. So, just upgrading it and upgrade->downgrade->upgrade are both the same.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

They can check scheduler_queue_incoming_pods_total with the PopFromBackoffQ event label.

How can someone using this feature know that it is working for their instance?

N/A

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

In the default scheduler, we should see the throughput around 100-150 pods/s (ref ), and this feature shouldn’t bring any regression there.

Based on that schedule_attempts_total shouldn’t be less than 100 in a second, if there are enough unscheduled pods within the cluster.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name:
  - schedule_attempts_total
  - scheduler_schedule_attempts_total with scheduled label
  - scheduler_pending_pods with active and backoff labels
  - scheduler_pod_scheduling_sli_duration_seconds
- Components exposing the metric: kube-scheduler

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in a non-negligible increase of resource usage (CPU, RAM, disk, IO,…) in any components?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

N/A

What are other known failure modes?

Unknown

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

6th Feb 2025: The initial KEP is submitted.

Drawbacks

Alternatives

Move pods in flushBackoffQCompleted when activeQ is empty

Moving the pod popping from the backoffQ to the existing flushBackoffQCompleted function (which already periodically moves pods to the activeQ) avoids changing PreEnqueue behavior, but it has some downsides. Because flushing runs every second, it would be needed to pop more pods when the activeQ is empty. This requires figuring out how many pods to pop, either by making it configurable or calculating it. Also, if schedulable pods show up in the activeQ between flushes, a bunch of pods from the backoffQ might break activeQ priorities and slow down scheduling for the pods that are ready to go.