KEP-5366: Graceful Leader Transition

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Future Work (Stories)
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This proposal outlines a plan to modify the leader election mechanism for key Kubernetes components (kube-scheduler, kube-controller-manager, cloud-controller-manager). The goal is to enable these components to gracefully release the leader lock and transition back to a follower state without a full process restart. This change will be introduced behind a new feature gate.

Motivation

Many high-availability (HA) Kubernetes components, including kube-scheduler and kube-controller-manager, rely on the leader election library in client-go. The current implementation mandates that when a component loses its leader lock, it must shut down immediately. This is typically handled by calling klog.FlushAndExit() in the OnStoppedLeading callback.

When the leadership is lost, the component shuts down and waits for the kubelet to detect that the component is unhealthy and restart it. This has several significant drawbacks:

High Overhead: Restarting a component process incurs unnecessary computational overhead and increases latency during a leadership transition
No Graceful Shutdown: The immediate call to os.Exit() prevents any graaceful shutdown or cleanup operations.

Goals

Allow leader-elected components to transition to a follower state without restarting the process
Enable graceful handling of leader lock loss

Non-Goals

This KEP allows work towards graceful shutdown of controllers, but the actual mechanisms of how we do graceful shutdown is outside the scope of this KEP.

Proposal

We propose a phased approach. Phase 1 ensures controllers can safely shut down, and Phase 2 builds on that to enable faster failovers by actively releasing the lease on shutdown. Phases 1 and 2 are the committed scope of this KEP. A future Phase 3 would remove the need for process restarts entirely (graceful leader transition). It is out of scope here and captured as future work .

Phase 1: Controller Sanitation (Pre-Alpha)

This phase ensures that kube-controller-manager controllers gracefully terminate without leaking goroutines by strictly enforcing context cancellation within their control loops.

Note: This phase is largely addressed by PR #134910 and PR #134945 , which standardizes Run termination.

Objective: Ensure that kcm controller goroutines properly terminate when context is cancelled.
Mechanism: Refactor controller management to track all spawned goroutines via wg.Go() and wg.Wait().
Feature Gate: None (Technical debt cleanup).

Phase 2: Fast Lease Release

Once we are confident that controllers shut down gracefully (Phase 1), we can optimize the leadership transition. Instead of waiting for the lease TTL to expire, the leader will actively release the lock upon shutdown. Note: kube-scheduler already implements this behavior, so this phase only targets kube-controller-manager.

Objective: Reduce failover latency.
Mechanism: Modify client-go/tools/leaderelection to perform an active release of the Lease object (removing the holder identity) when the context is cancelled.
Feature Gate: ControllerManagerReleaseLeaderElectionLockOnExit

Risks and Mitigations

Kubernetes leader elected controllers and the scheduler have been running without graceful shutdown for years.

Risk 1: Resource exhaustion: Memory leaks may exist in the processes that were previously masked by doing a full shutdown and restart loop.

Severity: Medium high
Controllers will continue to function (potentially in degraded state due to lack of resources), and may be restarted frequently. However, cluster should continue to function.

Risk 2: Wedged KCM: There is a risk that controllers and the scheduler are not properly respecting context shutdowns. This can either result in multiple instances of controllers running or no instances running despite the lock being held.

Severity: High
Breaking mutual exclusion guarantees can put the cluster into a non-desirable state. A manual user intervention is possible but if the problem is triggered due to a problematic component, the issue will resurface and the best path for mitigation is to turn off the feature.

Risk 3: Futureproofing: An additional risk is that even if all the current code is safe and respects shutting down gracefully, new controllers/modifications to kcm or scheduler could create subtle problems in shutdown and transition.

Severity: High
Leads to either risk 1 or 2.

Mitigations:

Audit and add tests for the existing controllers and the scheduler to ensure proper handling of context shutdowns. See test plan section for more details.
Graceful shutdown modifications will not be guarded by a feature gate, but the code change to remove the os.Exit() line will be guarded by a feature gate.
Document the new development best practices for graceful shutdown requirements for modified components that are leader elected.

Design Details

Phase 1 Implementation

All controllers must standardize their startup sequences. When the controller returns and the leader lock is released, all associated goroutines generally must be cancelled.

Phase 2 Implementation

The leader lock will be proactively released when the context is cancelled and the leader prepares to step down. This release must occur only after all controller goroutines have returned. This behavior will be guarded by the ControllerManagerReleaseLeaderElectionLockOnExit feature gate.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

To properly test that components can respect context cancellation shutdowns and not leak memory, we will run both manual tests with profiling, and automated testing that transitions the leader lock multiple times.

We will also test that things like caches, reflectors, and go funcs are properly cleared/stopped when the leader lock is lost.

Scenarios:

Component acquires leader lock should start its control loop
Component releasing leader lock should shut down its control loop
Component acquiring and releasing the leader lock multiple times should not leak memory
Component acquiring and releasing the leader lock multiple times should have ONLY ONE control loop running
If kube-apiserver takes a long time to release client calls, the graceful release will properly wait until the controller has returned
Releasing a lock will stop the control loop
Only one control loop should be running at all times
Components in follower mode should not start control loop or otherwise allocate unnecessary memory

Prerequisite testing updates

Unit tests

This is primarily testing for leader election’s interaction with components, and will be tested via integration and e2e tests.

Integration tests

See the above scenarios for test plan. kube-scheduler and kcm in particular will be integration tested that they shut down properly.

e2e tests

See the above scenarios for test plan.

Graduation Criteria

Alpha

ControllerManagerReleaseLeaderElectionLockOnExit feature gate implemented.
Phase 1 implemented and controller startup and shutdown logic is handled gracefully.
Runtime detection of leaked goroutines.
Test that controller-manager and scheduler do not leak memory on leadership transitions.

Beta

ControllerManagerReleaseLeaderElectionLockOnExit graduates to beta, enabled by default in v1.37.
e2e tests
Address how to minimize risks of putting KCM or scheduler in a “wedged” state

GA

Upgrade / Downgrade Strategy

No changes.

Version Skew Strategy

This is a control plane change. Skew should not affect the feature.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: ControllerManagerReleaseLeaderElectionLockOnExit
- Components depending on the feature gate: kube-controller-manager
Will enabling / disabling the feature require downtime of the control plane? Yes, components need to be restarted.
Will enabling / disabling the feature require downtime or reprovisioning of a node? No.

Does enabling the feature change any default behavior?

Yes. With the ControllerManagerReleaseLeaderElectionLockOnExit feature gate enabled, kube-controller-manager actively releases its leader lease on shutdown (clearing the holder identity) instead of leaving it to expire by TTL, so a standby instance can acquire leadership without waiting out the lease duration. The component still exits when it loses leadership.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, the feature can be disabled by setting the ControllerManagerReleaseLeaderElectionLockOnExit feature gate to false and restarting kube-controller-manager. This reverts to the previous behavior where the leader lease is left to expire by TTL on shutdown. This should not break existing workloads as it restores the prior, well-understood behavior.

What happens if we reenable the feature if it was previously rolled back?

If the feature is re-enabled after being rolled back, kube-controller-manager will once again actively release its lease on shutdown. There are no special considerations for re-enabling.

Are there any tests for feature enablement/disablement?

No.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

A rollout could fail if:

kube-controller-manager does not correctly handle context cancellation when losing leadership, leading to incomplete shutdown of internal controllers.
The lease is released before controllers have fully stopped, briefly allowing a standby instance to start while the old leader is still finishing work.

Impact on workloads:

If a leader component becomes unstable (e.g., due to memory leaks or improper shutdown), its ability to perform its duties (scheduling, controller management) could be impaired. In an HA setup, another instance should take over leadership, but frequent transitions or instability could degrade overall cluster performance or reliability.
If a component fails to release resources correctly upon losing leadership, it might lead to resource contention or incorrect behavior. Rollback (disabling the feature gate and restarting components) should revert to the previous stable behavior.

What specific metrics should inform a rollback?

Some things to look at:

Component restart counts: An increase in restarts for kube-controller-manager or kube-scheduler.
Log messages indicating errors during leader transition or resource cleanup.
General cluster health indicators like API server latency, pod scheduling latency, or controller reconciliation errors.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

n/a

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No. This feature introduces new behavior behind a feature gate and does not deprecate or remove any existing features, APIs, fields, or flags.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

An operator can determine if the feature is active by inspecting the command-line flags of kube-controller-manager to verify that the ControllerManagerReleaseLeaderElectionLockOnExit feature gate is enabled.

Observing component logs for messages indicating an active lease release on shutdown (rather than waiting for the lease to expire) would also confirm its use.

How can someone using this feature know that it is working for their instance?

This feature is primarily for cluster operators. Operators can verify its operation by:

Observing failover latency: after a graceful kube-controller-manager shutdown (e.g. during a rolling upgrade), a standby instance should acquire leadership promptly rather than waiting out the lease duration.
Inspecting the Lease object: the holder identity is cleared when the leader shuts down, instead of remaining set until the lease expires.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

No increase to leader transition time with the feature enabled.
No increase to memory usage with feature is enabled

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Component logs can be used to verify graceful shutdown steps are being executed.

Are there any missing metrics that would be useful to have to improve observability of this feature?

n/a

Dependencies

Does this feature depend on any specific services running in the cluster?

This is a control plane feature and requires the components that the feature runs on (kube-scheduler, kube-controller-manager) to be active, as well as the kube-apiserver and etcd for leader election.

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

Leader election cannot function without apiserver or etcd.

What are other known failure modes?

Premature lease release
- Detection: briefly more than one active kube-controller-manager after a leadership change (duplicate controller activity or conflicting writes).
- Mitigations: the lease is released only after controller goroutines return. Turn off the feature to fall back to TTL expiry.
- Diagnostics: inspect the Lease object and controller logs around handoff.
- Testing: integration tests for shutdown ordering.

What steps should be taken if SLOs are not being met to determine the problem?

Disable the feature.

Implementation History

2025-06-03 - KEP was marked as implementable

Drawbacks

Adds a small risk of briefly running two leaders if the lease is released before controllers have fully stopped.

Alternatives

n/a

Future Work (Stories)

This feature enables the user stories below, but require additional modification to the kcm and scheduler code that they are outside the scope of this KEP.

Graceful Leader Transition

A possible future direction is for leader-elected components to keep running and return to a follower state on lost leadership instead of exiting the process, decoupling “stop leading” from “process exit”. This would be gated separately (e.g. a GracefulLeaderTransition gate) and requires resolving metric re-registration conflicts, health-check deregistration, and resource cleanup on transition.

Story 1

In an HA configuration, cloud provider A wants to balance controllers over multiple control plane instances. With graceful transitions, multiple locks can be used by a single KCM instance such that a subset of components run under each lock.

Story 2

A extension developer has controller manager that should dynamically start and shutdown controllers based on cluster state (such as the registration of resources that declare how a CRD should be reconciled). The extension developer requires that controllers shutdown gracefully so that only the controller loops that SHOULD be running continue to run, and that no resources are leaked over time as controllers are started and stopped.

KEP-5366: Graceful Leader Transition

KEP-5366: Graceful Leader Transition

Release Signoff Checklist

Summary

Motivation

Goals

Non-Goals

Proposal

Phase 1: Controller Sanitation (Pre-Alpha)

Phase 2: Fast Lease Release

Risks and Mitigations

Design Details

Phase 1 Implementation

Phase 2 Implementation

Test Plan

Prerequisite testing updates

Unit tests

Integration tests

e2e tests

Graduation Criteria

Alpha

Beta

GA

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

Future Work (Stories)

Graceful Leader Transition

Story 1

Story 2

Infrastructure Needed (Optional)