KEP-5278: Nominated node name for an expected pod placement

Implementation History
BETA Implementable
Created 2025-05-07
Latest v1.35
Milestones
Alpha v1.34
Beta v1.35
Stable v1.37
Ownership
Owning SIG
SIG Scheduling
Participating SIGs

KEP-5278: Nominated node name for an expected pod placement

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
    • e2e Tests for all Beta API Operations (endpoints)
    • (R) Ensure GA e2e tests meet requirements for Conformance Tests
    • (R) Minimum Two Week Window for GA e2e tests to prove flake free
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • “Implementation History” section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Use NominatedNodeName to express pod placement, expected by the scheduler.

Besides of using NominatedNodeName to indicate ongoing preemption, the scheduler can specify it at the beginning of a binding cycle to show an expected pod placement to other components.

Motivation

External components need to know where the pod is going to be bound

The scheduler reserves the place for the pod when the pod is entering the binding cycle. This reservation is internally implemented in the scheduler’s cache, and is not visible to other components.

The specific problem is, as shown at #125491 , if the binding cycle takes time before binding pods to nodes (e.g., PreBind takes time to handle volumes) the cluster autoscaler cannot understand the pod is going to be bound there, misunderstands the node is low-utilized (because the scheduler keeps the place of the pod), and deletes the node.

We can expose those internal reservations with NominatedNodeName so that external components can take a more appropriate action based on the expected pod placement.

Please note that the NominatedNodeName can express reservation of node resources only, but some resources can be managed by the DRA plugin and be expressed using ResourceClaim allocation. In order to correctly account all the resources needed by a pod, both the nomination and ResourceClaim status update needs to be reflected in the api-server.

Retain the scheduling decision

At the binding cycle (e.g., PreBind), some plugins could handle something (e.g., volumes, devices) based on the pod’s scheduling result.

If the scheduler restarts while it’s handling some pods at binding cycles, kube-scheduler could decide to schedule a pod to a different node. If we can keep where the pod was going to go at NominatedNodeName, the scheduler likely picks up the same node, and the PreBind plugins can restart their work from where they were before the restart.

Goals

  • The scheduler will use NominatedNodeName to express where the pod is going to go before actually binding them.

Non-Goals

  • External components can suggest a specific node to kube-scheduler using NominatedNodeName.
    • This is not in scope of this feature for the time being. See the alternatives section for more details.

Proposal

User Stories (Optional)

Here is the all use cases of NominatedNodeNames that we’re taking into consideration:

  • The scheduler puts it after the preemption (already implemented)
  • The scheduler puts it at the beginning of binding cycles (only if the binding cycles involve PreBind or WaitOnPermit phase)

(Possibly, our future initiative around the workload scheduling (including gang scheduling) can also utilize it, but we don’t discuss it here because it’s not yet concrete.)

Story 1: Prevent inappropriate scale downs by Cluster Autoscaler

Pod binding may take significant amount of time (even at the order of minutes, e.g. due to volume binding). During that time, components other than the scheduler don’t have the information that such a placement decision has already been made and is already executed. Without having this information, other components may decide to take conflicting actions (e.g. ClusterAutoscaler or Karpenter may decide to delete that particular node).

We need a way to share the information about already made scheduling decisions with those components to prevent that.

Story 2: Scheduler can resume its work after restart

Pod binding may take significant amount of time (even at the order of minutes, e.g. due to volume binding). During that time, scheduler may be restarted, lost its leader lock etc. Given the placement decision was only stored in schedulers memory, the new incarnation of the scheduler has no visibility into it and can decide to put a pod on a different node. This would result in wasting the work that has already been done and increase the end-to-end pod startup latency.

We need a mechanism to be able to resume the already started work in majority of such situations.

Risks and Mitigations

Confusing semantics of NominatedNodeName

Up until now, NominatedNodeName was expressing the decision made by scheduler to put a given pod on a given node, while waiting for the preemption. The decision could be changed later so it didn’t have to be a final decision, but it was describing the “current plan of record”.

If we add the case of delayed binding, we effectively get a state machine with the following states:

  1. pending pod
  2. pod nominated to node and waiting for preemption
  3. pod allocated to node and waiting for binding
  4. pod bound

The important part is that if we decide to use NominatedNodeName to store information for both (2) and (3), we’re effectively losing the ability to distinguish between those states.

We may argue that as long as the decision was made by the scheduler, the exact reason and state probably isn’t that important - the content of NominatedNodeName can be interpreted as “current plan of record for this pod from scheduler perspective”.

If we look from consumption point of view - these are effectively the same. We want to expose the information, that as of now a given node is considered as a potential placement for a given pod. It may change, but for now that’s what is being considered.

External components may set NominatedNodeName

Currently NominatedNodeName field is intended as read-only for components other than kube-scheduler. However there are no measures preventing other actors from overwriting the field. This is not considered a substantial risk to scheduling.

Scheduler interprets NominatedNodeName as a suggestion for optimal placement for a pod. If at the beginning of a scheduling cycle NNN is set (e.g. to N1), the scheduler will start the scheduling attempt with trying to place the pod on N1. This could go two ways:

A. Pod fits on N1. Pod is bound, after binding NNN gets cleared in api-server. The only risk here is that N1 could not be the optimal placement for the pod.

B. Pods does not fit on N1 (or N1 is invalid). Scheduler restarts the scheduling cycle, ignoring NNN value. Filtering, Scoring and other phases get executed, standard scheduling procedure continues. If the pod is deemed unschedulable, scheduler clears NNN field before moving the pod to unschedulable / backoff queue. The risk in this case is that the scheduler spends time trying to fit the pod on N1 in the beginning - which is not a huge overhead compared to the entire scheduling cycle.

If NominatedNodeName gets overwritten further into the scheduling cycle, or when the pod is waiting in a scheduling queue, it does not impact kube-scheduler’s work.

Note that this logic is not newly introduced by this KEP, it’s present in kube-scheduler since v1.22 and KEP-1923 .

Node nominations need to be considered together with reserving DRA resources

The semantics of node nomination are in fact resource reservation, either in scheduler memory or in external components (after the nomination got persisted to the api-server). Since pods consume both node resources and DRA resources, it’s important to persist both at the same (or almost the same) point in time.

This is consistent with the current implementation: ResourceClaim allocation is stored in status in PreBinding phase, therefore in conjunction to node nomination it effectively allows to reserve a complete set of resources (both node and DRA) to enable their correct accounting.

Note that node nomination is set before WaitOnPermit phase, but ResourceClaim status gets published in PreBinding, therefore pods waiting on WaitOnPermit would have only nominations published, and not ResourceClaim statuses. This however is not considered an issue, as long as there are no in-tree plugins supporting WaitOnPermit, and the Gang Scheduling feature is starting in alpha. This means that the fix to this issue will block Gang Scheduling promotion to beta.

Increasing the load to kube-apiserver

Setting a NominatedNodeName is an additional API call that then multiple components in the system need to process. In the extreme case when this is always set before binding the pod, this would double the number of API calls from scheduler, which isn’t really acceptable from scalability and performance reasons.

To mitigate this problem, we:

  • skip setting NNN when all Permit and PreBind plugins have no work to do for this pod. (We’ll discuss how-to in the later section.)

For cases with delayed binding, we make an argument that the additional calls are acceptable, as there are other calls related to those operations (e.g. PV creation, PVC binding, etc.) - so the overhead of setting NNN is a smaller percentage of the whole e2e pod startup flow.

Design Details

The scheduler puts NominatedNodeName

The scheduler needs to update NominatedNodeName with the node that it determines the pod is going to at the beginning of binding cycles.

As discussed at Increasing the load to kube-apiserver , we should set NominatedNodeName only when some Permit plugins (at WaitOnPermit) or PreBind plugins work.

We can know when there is Permit plugins that will work at WaitOnPermit or not by the status returned from Permit() functions. If one or more Permit() returned Wait status, we have to put NominatedNodeName at the beginning of binding cycles, before actually starting to wait at WaitOnPermit.

And, for PreBind plugins, we need to add a new function to PreBindPlugin.

type PreBindPlugin interface {
	Plugin
	// **New Function** (or we can have a separate Plugin interface for this, if we're concerned about a breaking change for custom plugins)
	// It's called before PreBind, and the plugin is supposed to return Success, Skip, or Error status.
	// If it returns Skip, it means this PreBind plugin has nothing to do with the pod.
	// This function should be lightweight, and shouldn't do any actual operation, e.g., creating a volume etc
	PreBindPreFlight(ctx context.Context, state *CycleState, p *v1.Pod, nodeName string) *Status

	PreBind(ctx context.Context, state *CycleState, p *v1.Pod, nodeName string) *Status
}

The scheduler would run a new function PreBindPreFlight() before PreBind() functions, and if all PreBind plugins return Skip status from new functions, we can skip setting NominatedNodeName.

This is a similar approach we’re doing with PreFilter/PreScore -> Filter/Score. We determine if each plugin is relevant to the pod by Skip status from PreFilter/PreScore, and then determine whether to run Filter/Score function accordingly.

In this way, even if users have some PreBind custom plugins, they can implement PreBindPreFlight() appropriately so that the scheduler can wisely skip setting NominatedNodeName, taking their custom logic into consideration.

The scheduler’s cache for NominatedNodeName

Here, we’ll ensure that works for non-existing nodes too and if those nodes won’t appear in the future, it won’t leak the memory.

The scheduler stores NominatedNodeName data at nominator . This nominator holds NominatedNodeName data even if the node doesn’t exist. So, this caching mechanism should work correctly for non-existing NNN node scenario.

Also, this cached info is cleared deletePodFromSchedulingQueue . This deletePodFromSchedulingQueue is called when unscheduled pods are removed, or pods are assigned to nodes (EventHandler calls DeleteFunc handler when the condition is no longer met).

So, as a conclusion, there should be nothing to implement newly around it. We’ll ensure this scenario works correctly via tests.

The scheduler clears NominatedNodeName after scheduling failure

As of now the scheduler clears the NominatedNodeName field at the end of failed scheduling cycle, if it found the nominated node unschedulable for the pod. This logic remains unchanged.

NOTE: The previous version of this KEP, that allowed external components to set NominatedNodeName, deliberately left the NominatedNodeName field unchanged after scheduling failure. With the KEP update for v1.35 this logic is being reverted, and scheduler goes back to clearing the field after scheduling failure.

Kube-apiserver clears NominatedNodeName when receiving binding requests

We update kube-apiserver so that it clears NominatedNodeName when receiving binding requests.

Handling ResourceClaim status updates

Since ResourceClaim status update is complementary to node nomination (reserves resources in a similar way), it’s desired that both will be set at the beginning of the PreBinding phase (before the pod starts waiting for resources to be ready for binding). The order of actions in the device management plugin is correct, however the scheduler performs the prebinding actions of different plugins sequentially. As a result it may happen that e.g. a long lasting PVC provisioning may delay exporting ResourceClaim allocation status. This is not desired, as it allows a gap in time when DRA resources are not reserved - causing problems similar to the ones originally fixed by this KEP - kubernetes/kubernetes#125491

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates
Unit tests
  • k8s.io/kubernetes/pkg/scheduler: 2025-10-15 - 70.8
  • k8s.io/kubernetes/pkg/registry/core/pod/storage: 2025-10-15 - 78.8
  • k8s.io/kubernetes/pkg/apis/core/validation: 2025-10-15 - 85.2
Integration tests

Tests: test/integration/scheduler/nominated_node_name : integration master , triage search

Covering scenarios:

  • scheduler sets NNN before PreBind and WaitOnPermit, and does not set NNN when PreBind and Permit phases are skipped for the pod
  • The scheduler prefers to picking up nodes based on NominatedNodeName on pods, if the nodes are available.
  • The scheduler ignores NominatedNodeName reservations on pods when it’s scheduling higher priority pods.
  • The scheduler overwrites NominatedNodeName when it performs the preemption, or when it finds another spot in another node and proceeding to the binding cycle (assuming there’s a PreBind plugin).
  • And, the scheduler (actually kube-apiserver, when receiving a binding request) clears NominatedNodeName when the pod is actually bound.

Also scheduler-perf was used to verify that the change did not impact scheduling throughput.

e2e tests

We won’t implement any e2e tests because we can test everything with integration tests described above, and an e2e test wouldn’t add any additional value.

Graduation Criteria

Beta

  • The feature is implemented behind the feature gate.
  • The tests are implemented.

GA

  • There are several official components starting to use this:
    • The cluster autoscaler starts to use this feature.
    • Kueue starts to use this feature.
  • No negative feedback or bug.

Upgrade / Downgrade Strategy

Upgrade

During the beta period, the feature gates NominatedNodeNameForExpectation and ClearingNominatedNodeNameAfterBinding are enabled by default, no action is needed.

Downgrade

Users need to disable the feature gates, and restart kube-scheduler and kube-apiserver.

On downgrade to the version that doesn’t have this feature, there aren’t any action that users need to take. For pods that have NominatedNodeName set, scheduler will try to honor it, but:

  • if the pod is still not schedulable, it will clear the field
  • if the pod is schedulable, but to a different node - it will also clear it (and potentially set it to a different value if preemption is needed)

Version Skew Strategy

If kube-apiserver’s version is older than kube-scheduler, and doesn’t have the implementation change from this KEP, NominatedNodeName won’t be cleared at the binding api call. But, ideally, users should use the same version of kube-scheduler and kube-apiserver. For old kube-apiserver, the NominateNodeName will not be cleared on binding - this is fine, because unsetting it is not critical for correctness, it’s only done to reduce potential user confusion.

However, it’s not that not clearing NominatedNodeName will actually cause something wrong in the scheduling flow, but, it’s just that it might lead to a user’s confusion, as discussed in Confusion if NominatedNodeName is different from NodeName after all .

So, we can say the risk caused by this version difference would be fairly low.

On the other hand, if kube-scheduler’s version is older than kube-apiserver, and doesn’t have the implementation change from this KEP, nothing goes wrong because kube-apiserver just clears NominatedNodeName from the pods at the binding API, which is fine by the today’s scheduler implementation as well.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?
  • Feature gate (also fill in values in kep.yaml)
    • Feature gate name: NominatedNodeNameForExpectation
    • Components depending on the feature gate: kube-scheduler
  • Feature gate
    • Feature gate name: ClearingNominatedNodeNameAfterBinding
    • Components depending on the feature gate: kube-apiserver
Does enabling the feature change any default behavior?

Pods that are processed by Permit or PreBind plugins get NominatedNodeName during binding cycles.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. The feature can be disabled in Beta version by restarting the kube-scheduler and kube-apiserver with the feature-gates off.

What happens if we reenable the feature if it was previously rolled back?

The scheduler just again starts to put NominatedNodeName at the beginning of binding cycles (if applicable).

Are there any tests for feature enablement/disablement?

No. This feature is only changing when a NominatedNodeName field will be set - it doesn’t introduce a new API. However reacting to it is purely in-memory, so enablement/disablement tests wouldn’t really differ from regular feature tests.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

The scheduler and kube-apiserver are involved in this feature.

If upgrading the scheduler fails somehow, new Pods won’t be scheduled anymore until rolling back, while Pods, which are already scheduled, won’t be affected. If upgrading kube-apiserver fails somehow, the whole Kubernetes will not be able to function properly until rolling back.

Even if one of them cannot be upgraded properly somehow, and gets rolled back, there’ll be nothing behaving wrong in the scheduling flow, see Version Skew Strategy .

What specific metrics should inform a rollback?
  • The schedule_attempts_total metric with the error label is increasing abnormally.
  • The scheduler_pod_scheduling_sli_duration_seconds or scheduling_attempt_duration_seconds gets too long.
    • Although, for pods that have to go through Permit/PreBind plugins, it’s expected that their scheduling+binding latency would get higher because of an additional API call for NominatedNodeName.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

The following manual test has been executed after implelemting the feature.

  1. upgrade
  2. request scheduling of a pod that will need a long preBinding phase (e.g. uses volumes)
  3. check that NNN gets set for that pod
  4. before binding completes, restart the scheduler with nominatedNodeNameForExpectationEnabled = false
  5. check that the pod gets scheduled and bound successfully to the same node
  6. request scheduling another pod with expected long preBinding phase
  7. check that NNN does not get set in PreBind
  8. restart the scheduler with nominatedNodeNameForExpectationEnabled = true
  9. check that the pod gets scheduled and bound on any node
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

They can check that pods with delayed binding gets NominatedNodeName while waiting for the scheduler to provision their resources at PreBind.

How can someone using this feature know that it is working for their instance?
  • Other (treat as last resort)
    • Details: NominatedNodeName on the pods during its scheduling period.
What are the reasonable SLOs (Service Level Objectives) for the enhancement?

We need to make sure the scheduling throughput doesn’t get much regressed by this enhancements, especially for pods that go through PreBind or Permit.

The scheduling throughput depends on what types of pods are in your cluster, and also what types of scheduler customization you add.

So, here we just give a hint of a reasonable SLO, you need to adjust it based on your cluster’s usual behaviors.

In the default scheduler, we should see the throughput around 100-150 pods/s (ref ). This feature does not bring any regression there.

Based on that:

  • schedule_attempts_total shouldn’t be less than 100 in a second.
  • the average of scheduling_algorithm_duration_seconds shouldn’t be above 10 ms.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
  • Metrics
    • schedule_attempts_total with scheduled label.
    • scheduler_pod_scheduling_sli_duration_seconds with scheduled label.
Are there any missing metrics that would be useful to have to improve observability of this feature?

No.

Dependencies

No.

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

Yes.

  • API call type: PATCH pods.
  • estimated throughput: Each pod that goes through Permit or PreBind plugins triggers one additional API call. In the default scheduler, pods with DRA or delayed binding PVC would be those.
  • originating component: Kube-scheduler
Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Yes - but it should be negligible impact. The memory usage in kube-scheduler is supposed to increase because when NominatedNodeName is added on the pods, the scheduler’s internal component called nominator has to record them so that scheduling cycles can refer to them as necessary.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

The scheduler itself doesn’t work anymore in that case.

What are other known failure modes?

Unknown.

What steps should be taken if SLOs are not being met to determine the problem?

Since SLOs can be impacted by multiple components and mechanisms in kubernetes, there is not straightforward algorithm to determine the problem. The general approach to investigating issues is described below.

If kube-scheduler SLOs are not being met, we should first check if other components of kubernetes (e.g. kube-apiserver) are experiencing slowdown or increased error rates as well. If that is the case, we should find out whether there is a global issue with an already-determined cause. A longer turnaround in kube-apiserver handling API requests may result in rising values of scheduling_algorithm_duration_seconds and lower values of schedule_attempts_total.

If we suspect that there is an ongoing problem inside kube-scheduler and that it is triggered by handling nominated node names, we should check kube-scheduler logs for failed scheduling of pods that had been waiting for preemption of victims, or for failed binding of pods that have nominated node name set - and investigate further.

Implementation History

  • 7th May 2025: The initial KEP is submitted.
  • 31st Jul 2025: The enhancement was demoted to alpha, because it haven’t met all beta requirements for v1.34.
  • 9th Oct 2025: The enhancement was promoted to beta, with the scope narrowed down to allow setting NominatedNodeName only in the kube-scheduler, having other components (e.g. Cluster Autoscaler or Karpenter) use the field as read-only.

Drawbacks

Alternatives

Introduce a new field

Instead of using NominatedNodeName to let external components to hint scheduler, we considered introducing a dedicated field for that purpose. However, as discussed above, we don’t have any clear usecases where distinguishing the source of the setting really matters and with multiple external components it doesn’t eliminate the potential races either. If in the future we realize that distinsuighing that is needed, we believe that we can model such state muchine with an additional field in a purely additive way.

Allow NominatedNodeName to be set by other components

In v1.35 this feature is being narrowed down to one-way communication: only kube-scheduler is allowed to set NominatedNodeName, while for other components this field should be read-only.

The alternative to consider for future releases is that other components can set NominatedNodeName in pending pods to indicate the pod is preferred to be scheduled on a specific node.

Motivation: External components want to specify a preferred pod placement

The ClusterAutoscaler or Karpenter internally calculate the pod placement, and create new nodes or un-gate pods based on the calculation result. The shape and count of newly added nodes assumes some particular pod placement and the pods may not fit or satisfy scheduling constraints if placed differently.

By specifying their expectation on NominatedNodeName, the scheduler can first check whether the pod can go to the nominated node, reducing end-to-end scheduling time.

Goals
  • Make sure external components can use NominatedNodeName to express where they prefer the pod is going to.
    • Probably, you can do this with a today’s scheduler as well. This proposal wants to discuss/make sure if it actually works, and then add tests etc.
Non-Goals
  • External components can enforce the scheduler to pick up a specific node via NominatedNodeName.
    • NominatedNodeName is just a hint for scheduler and doesn’t represent a hard requirement
User stories

The use case supported by this feature is:

  • The ClusterAutoscaler or Karpenter sets NominatedNodeName after creating a new node for pending pod(s), so that the scheduler can utilize the result of scheduling simulations already calculated by those components
Story 1: ClusterAutoscaler or Karpenter can influence scheduling decisions

ClusterAutoscaler or Karpenter perform scheduling simulations to decide what nodes should be added to make pending pods schedulable. Their decisions assume a certain placement - if pending pods are placed differently, they may not fit on the newly added nodes or may not satisfy their scheduling constraints.

In order to improve the end-to-end pod startup latency when cluster scale-up is needed, we need a mechanism to communicate the results of scheduling simulations from ClusterAutoscaler or Karpenter to scheduler.

Story 2: Kueue specifies NominatedNodeName to indicate where it prefers pods being scheduled to

Kueue supports scheduling features that are not (yet) supported in core scheduling, such as topology-aware scheduling. When it determines the optimal placement, it needs a mechanism to pass that information to the scheduler. Currently it is using NodeSelector to enforce placement of pods and only then ungates the pods. Scheduler doesn’t take that information into account until pods are ungated and can schedule other pods in those places in the meantime. It would be beneficial to pass that information to scheduler sooner, as well as allow scheduler to change the decision if the topology constraints are just the soft ones.

Risks and Mitigations
NominatedNodeName can be set by other components now.

There aren’t any guardrails preventing other components from setting NominatedNodeName now. In such cases, the semantic is not well defined now and the outcome of it may not match user expectations.

This section is a step towards clarifying this semantic instead of maintaining status-quo.

Confusing semantics of NominatedNodeName

Up until now, NominatedNodeName was expressing the decision made by scheduler to put a given pod on a given node, while waiting for the preemption. The decision could be changed later so it didn’t have to be a final decision, but it was describing the “current plan of record”.

If we put more components into the picture (e.g. ClusterAutoscaler and Karpenter), we effectively get a more complex state machine, with the following states:

  1. pending pod
  2. pod proposed to node (by external component) [not approved by scheduler]
  3. pod nominated to node (based on external proposal) and waiting for node (e.g. being created & ready)
  4. pod nominated to node and waiting for preemption
  5. pod allocated to node and waiting for binding
  6. pod bound

The important part is that if we decide to use NominatedNodeName to store all that information, we’re effectively losing the ability to distinguish between those states.

We may argue that as long as the decision was made by the scheduler, the exact reason and state probably isn’t that important - the content of NominatedNodeName can be interpreted as “current plan of record for this pod from scheduler perspective”.

But the pod proposed to node state is visibly different. In particular external components may overallocate the pods on the node, those pods may not match scheduling constraints etc. We can’t claim that it’s a current plan of record of the scheduler. It’s a hint that we want scheduler to take into account.

In other words, from state machine perspective, there is visible difference in who sets the NominatedNodeName. If it was scheduler, it may mean that there is already ongoing preemption. If it was an external component, it’s just a hint that may even be ignored. However, if we look from consumption point of view - these are effectively the same. We want to expose the information, that as of now a given node is considered as a potential placement for a given pod. It may change, but for now that’s what considered.

Eventually, we may introduce some state machine, where external components could also approve schedulers decisions by exposing these states more concretely via the API. But we will be able to achieve it in an additive way by exposing the information about the state.

However, we don’t need this state machine now, so we just introduce the following rules:

  • Any component can set NominatedNodeName if it is currently unset.
  • Scheduler is allowed to overwrite NominatedNodeName at any time in case of preemption or the beginning of the binding cycle.
  • No external components can overwrite NominatedNodeName set by a different component.
  • If NominatedNodeName is set, the component who set it is responsible for updating or clearing it if its plans were changed (using PUT or APPLY to ensure it won’t conflict with potential update from scheduler) to reflect the new hint.

Moreover:

  • Regardless of who set NominatedNodeName, its readers should always take that into consideration (e.g. ClusterAutoscaler or Karpenter when trying to scale down nodes).
  • In case of faulty components (e.g. overallocation the nodes), these decisions will simply be rejected by the scheduler (although the NominatedNodeName will remain set for the unschedulability period).
Race condition

If an external component adds NominatedNodeName to the pod that is going through a scheduling cycle, NominatedNodeName isn’t taken into account (of course), and the pod could be scheduled onto a different node.

But, this should be fine because, either way, we’re not saying NominatedNodeName is something forcing the scheduler to pick up the node, rather it’s just a preference.

What if there are multiple components that could set NominatedNodeName on the same pod

It’s not something newly introduced by this KEP because anyone can set NominatedNodeName today, but discuss here to form our suggestion.

Multiple controllers might keep overwriting NominatedNodeName that is set by the others. Of course, we can regard that just as user’s fault though, that’d be undesired situation.

There could be several ideas to mitigate, or even completely solve by adding a new API. But, we wouldn’t like to introduce any complexity right now because we’re not sure how many users would start using this, and hit this problem.

So, for now, we’ll just document it somewhere as a risk, unrecommended situation, and in the future, we’ll consider something if we actually observe this problem getting bigger by many people starting using it.

Invalid NominatedNodeName prevents the pod from scheduling

Currently, NominatedNodeName field is cleared at the end of failed scheduling cycle if it found the nominated node unschedulable for the pod. However, in order to make it work for ClusterAutoscaler and Karpenter, we will remove this logic, and NominatedNodeName could stay on the node forever, despite not being a valid suggestions anymore. As an example, imagine a scenario, where ClusterAutoscaler created a new node and nominated a pod to it, but before that pod was scheduled, a new higher-priority pod appeared and used the space on that newly created node. In such a case, it all worked as expected, but we ended up with NominatedNodeName set uncorrectly.

As a mitigation:

  • an external component that originally set the NominatedNodeName is responsible for clearing or updating the field to reflect the state
  • if it won’t happen, given that NominatedNodeName is just a hint for scheduler, it will continue to processing the pod just having a minor performance hit (trying to process a node set via NNN first, but falling back to all nodes anyway). We claim that the additional cost of checking NominatedNodeName first is acceptable (even for big clusters where the performance is critical) because it’s just one iteration of Filter plugins (e.g., if you have 1000 nodes and 16 parallelism (default value), the scheduler needs around 62 iterations of Filter plugins, approximately. So, adding one iteration on top of that doesn’t matter).

Confusion if NominatedNodeName is different from NodeName after all

If an external component adds NominatedNodeName, but the scheduler picks up a different node, NominatedNodeName is just overwritten by a final decision of the scheduler.

But, if an external component updates NominatedNodeName that is set by the scheduler, the pod could end up having different NominatedNodeName and NodeName.

We will update the logic so that NominatedNodeName field is cleared during binding call

We believe that ensuring that NominatedNodeName can’t be set after the pod is already bound is niche enough feature that doesn’t justify an attempt to strengthening the validation.

Design Details

If we take into account external components setting NominatedNodeName, the design needs to be extended as following:

External components put NominatedNodeName

There aren’t any restrictions preventing other components from setting NominatedNodeName as of now. However, we don’t have any validation of how that currently works. To support the usecases mentioned above we will adjust the scheduler to do the following:

  • if NominatedNodeName is set, but corresponding Node doesn’t exist, kube-scheduler will NOT clear it when the pod is unschedulable [assuming that a node might appear soon]
  • We will rely on the fact that a pod with NominatedNodeName set is resulting in the in-memory reservation for requested resources. Higher-priority pods can ignore it, but pods with equal or lower priority don’t have access to these resources. This allows us to prioritize nominated pods when nomination was done by external components. We just need to ensure that in case when NominatedNodeName was assigned by an external component, this nomination will get reflected in scheduler memory.

We will implement integration tests simulating the above behavior of external components.

The scheduler only modifies NominatedNodeName, does not clear it in any case

As of now, scheduler clears the NominatedNodeName field at the end of failed scheduling cycle if it found the nominated node unschedulable for the pod. However, this won’t work if ClusterAutoscaler or Karpenter would set it during scale up.

In the most basic case, the node may not yet exist, so clearly it would be unschedulable for the pod. However, potential mitigation of ignoring non-existing nodes wouldn’t work either in the following case:

  1. Pods are unschedulable. For the simplicity, let’s say all of them are rejected by NodeResourceFit plugin. (i.e., no node has enough CPU/memory for pod’s request)
  2. CA finds them, calculates nodes necessary to be created
  3. CA puts NominatedNodeName on each pod
  4. The scheduler keeps trying to schedule those pending pods though, here let’s say they’re unschedulable (no cluster event happens that could make pods schedulable) until the node is created.
  5. The nodes are created, and registered to kube-apiserver. Let’s say, at this point, nodes have un-ready taints.
  6. The scheduler observes Node/Create event, NodeResourceFit plugin QHint returns Queue, and those pending pods are requeued to activeQ.
  7. The scheduling cycle starts handling those pending pods.
  8. However, because nodes have un-ready taints, pods are rejected by TaintToleration plugin.
  9. The scheduler clears NominatedNodeName because it finds the nominated node (= new node) unschedulable.

In order to avoid the above scenarios, we simply remove the clearing logic. This means that scheduler will never clear the NominatedNodeName - it may update it though if based on its scheduling algorithm it decides to ignore the current value of NominatedNodeName and put it on a different node (either to signal the preemption, or record the decision before binding as described in the above sections).

Test plan: Integration tests

We’re going to add these integration tests:

  • The scheduler doesn’t clear NominatedNodeName when the nominated node isn’t available and the pod is unschedulable.
    • And, once the nodes appears, the pod with NNN set is scheduled there (even if there are other equal-priority pending pods).

Also, with scheduler-perf , we’ll make sure the scheduling throughputs for pods that go through Permit or PreBind don’t get regress too much. We need to accept a small regression to some extent since there’ll be a new API call to set NominatedNodeName. But, as discussed, assuming PreBind already makes some API calls for the pods, the regression there should be small.

Infrastructure Needed (Optional)