KEP-5366: Graceful Leader Transition
KEP-5366: Graceful Leader Transition
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Future Work (Stories)
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
This proposal outlines a plan to modify the leader election mechanism for key Kubernetes components (kube-scheduler, kube-controller-manager, cloud-controller-manager). The goal is to enable these components to gracefully release the leader lock and transition back to a follower state without a full process restart. This change will be introduced behind a new feature gate.
Motivation
Many high-availability (HA) Kubernetes components, including kube-scheduler and
kube-controller-manager, rely on the leader election library in client-go. The
current implementation mandates that when a component loses its leader lock, it
must shut down immediately. This is typically handled by calling
klog.FlushAndExit() in the OnStoppedLeading callback.
When the leadership is lost, the component shuts down and waits for the kubelet to detect that the component is unhealthy and restart it. This has several significant drawbacks:
- High Overhead: Restarting a component process incurs unnecessary computational overhead and increases latency during a leadership transition
- No Graceful Shutdown: The immediate call to
os.Exit()prevents any graaceful shutdown or cleanup operations.
Goals
- Allow leader-elected components to transition to a follower state without restarting the process
- Enable graceful handling of leader lock loss
Non-Goals
- This KEP allows work towards graceful shutdown of controllers, but the actual mechanisms of how we do graceful shutdown is outside the scope of this KEP.
Proposal
We propose a phased approach to implementing graceful leader transitions. This allows us to incrementally de-risk the change by first ensuring that controllers can safely shut down (Phase 1), then enabling faster failovers (Phase 2), and finally removing the need for process restarts (Phase 3).
Phase 1: Controller Sanitation (Pre-Alpha)
This phase ensures that kube-controller-manager controllers gracefully terminate without leaking
goroutines by strictly enforcing context cancellation within their control loops.
Note: This phase is largely addressed by PR #134910
and PR #134945
, which standardizes Run termination.
- Objective: Ensure that kcm controller goroutines properly terminate when context is cancelled.
- Mechanism: Refactor controller management to track all spawned goroutines via
wg.Go()andwg.Wait(). - Feature Gate: None (Technical debt cleanup).
Phase 2: Fast Lease Release
Once we are confident that controllers shut down gracefully (Phase 1), we can optimize the leadership transition. Instead of waiting for the lease TTL to expire, the leader will actively release the lock upon shutdown. Note: kube-scheduler already implements this behavior, so this phase only targets kube-controller-manager.
- Objective: Reduce failover latency.
- Mechanism: Modify
client-go/tools/leaderelectionto perform an active release of theLeaseobject (removing the holder identity) when the context is cancelled. - Feature Gate:
ControllerManagerReleaseLeaderElectionLockOnExit
Phase 3: Graceful Transition
The final state where the process does not exit upon losing leadership.
- Objective: Decouple “Stop Leading” from “Process Exit”.
- Mechanism: Refactor the main entrypoint to loop
Run()instead of exiting. Identify and handle metric registration conflicts (Prometheus panic on re-registration) and liveness probe interactions. - Feature Gate:
GracefulLeaderTransition
Risks and Mitigations
Kubernetes leader elected controllers and the scheduler have been running without graceful shutdown for years.
Risk 1: Resource exhaustion: Memory leaks may exist in the processes that were previously masked by doing a full shutdown and restart loop.
- Severity: Medium high
- Controllers will continue to function (potentially in degraded state due to lack of resources), and may be restarted frequently. However, cluster should continue to function.
Risk 2: Wedged KCM: There is a risk that controllers and the scheduler are not properly respecting context shutdowns. This can either result in multiple instances of controllers running or no instances running despite the lock being held.
- Severity: High
- Breaking mutual exclusion guarantees can put the cluster into a non-desirable state. A manual user intervention is possible but if the problem is triggered due to a problematic component, the issue will resurface and the best path for mitigation is to turn off the feature.
Risk 3: Futureproofing: An additional risk is that even if all the current code is safe and respects shutting down gracefully, new controllers/modifications to kcm or scheduler could create subtle problems in shutdown and transition.
- Severity: High
- Leads to either risk 1 or 2.
Mitigations:
- Audit and add tests for the existing controllers and the scheduler to ensure proper handling of context shutdowns. See test plan section for more details.
- Graceful shutdown modifications will not be guarded by a feature gate, but the
code change to remove the
os.Exit()line will be guarded by a feature gate. - Document the new development best practices for graceful shutdown requirements for modified components that are leader elected.
Design Details
Phase 1 Implementation
All controllers must standardize their startup sequences. When the controller returns and the leader lock is released, all associated goroutines generally must be cancelled.
Phase 2 Implementation
The leader lock will be proactively released when the context is cancelled and the leader prepares to
step down. This release must occur only after all controller goroutines have returned. This behavior will
be guarded by the ControllerManagerReleaseLeaderElectionLockOnExit feature gate.
Phase 3 Implementation
The core change involves modifying the OnStoppedLeading callback to prevent forceful exits. We will
wrap the leader election in a wait.Until() loop to retry election upon loss. This mirrors the pattern
used by the Coordinated Leader Election controller
(code
).
The controller-manager sets up controller level health checks in
non-reversible ways and will need to be modified so that handlers can be
deregistered from the mux when leadership is lost. All resources created after a
KCM becomes leader must be released when it loses leadership. This will be done
through context cancellation and cleanup logic. Some additional refactoring may
be needed to clean up processes gracefully when a leader lock is released. To
verify that individual controllers relinquish the control loop, we can add a
ValidatingAdmissionPolicy that warns when a controller that is not the leader
sends a write request to the apiserver, and fails the test. This will help us
identify locations where context cancellations are not respected.
Similarly for scheduler, assumptions that the process will be terminated losing the leader lock are made. Many scheduler resources are created before the leader election process. These will be modified to either defer resource creation or add a resetting mechanism when the leader is lost.
Prometheus clients panic on re-registration. We need to see if the metrics can be unregistered or reset on subsequent attempts at initializing the metrics.
Finally, during the leader-to-follower transition, the /healthz endpoint must correctly reflect the
follower state (healthy but not leading) to prevent Kubelet restarts.
Test Plan
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
To properly test that components can respect context cancellation shutdowns and not leak memory, we will run both manual tests with profiling, and automated testing that transitions the leader lock multiple times.
We will also test that things like caches, reflectors, and go funcs are properly cleared/stopped when the leader lock is lost.
Scenarios:
- Component acquires leader lock should start its control loop
- Component releasing leader lock should shut down its control loop
- Component acquiring and releasing the leader lock multiple times should not leak memory
- Component acquiring and releasing the leader lock multiple times should have ONLY ONE control loop running
- If kube-apiserver takes a long time to release client calls, the graceful release will properly wait until the controller has returned
- Releasing a lock will stop the control loop
- Only one control loop should be running at all times
- Components in follower mode should not start control loop or otherwise allocate unnecessary memory
Prerequisite testing updates
Unit tests
This is primarily testing for leader election’s interaction with components, and will be tested via integration and e2e tests.
Integration tests
See the above scenarios for test plan. kube-scheduler and kcm in particular will be integration tested that they shut down properly.
e2e tests
See the above scenarios for test plan.
Graduation Criteria
Alpha 1
ControllerManagerReleaseLeaderElectionLockOnExitfeature gate implemented.- Phase 1 implemented and controller startup and shutdown logic is handled gracefully.
Alpha 2
GracefulLeaderTransitionfeature gate implemented.- Runtime detection of leaked goroutines.
- Test that controller-manager and scheduler do not leak memory on leadership transitions.
Beta
- e2e tests
- Address how to minimize risks of putting KCM or scheduler in a “wedged” state
GA
- TBD
Upgrade / Downgrade Strategy
No changes.
Version Skew Strategy
This is a control plane change. Skew should not affect the feature.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name: GracefulLeaderTransition
- Components depending on the feature gate: kube-scheduler, kube-controller-manager, cloud-controller-manager
- Will enabling / disabling the feature require downtime of the control plane? Yes, components need to be restarted.
- Will enabling / disabling the feature require downtime or reprovisioning of a node? No.
Does enabling the feature change any default behavior?
Yes. When the GracefulLeaderTransition feature gate is enabled, leader-elected
components (kube-scheduler, kube-controller-manager, cloud-controller-manager)
will attempt to gracefully release the leader lock and transition to a follower
state without a full process restart. Previously, these components would shut
down immediately upon losing leadership.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
Yes, the feature can be disabled by setting the GracefulLeaderTransition
feature gate to false and restarting the affected components (kube-scheduler,
kube-controller-manager, cloud-controller-manager). This will revert to the
previous behavior where components shut down immediately upon losing leadership.
This should not break existing workloads as it restores the prior,
well-understood behavior.
What happens if we reenable the feature if it was previously rolled back?
If the feature is re-enabled after being rolled back, the components will once again use the graceful leader transition mechanism. There are no special considerations for re-enabling.
Are there any tests for feature enablement/disablement?
No.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
A rollout could fail if:
- Components (kube-scheduler, kube-controller-manager, cloud-controller-manager) do not correctly handle context cancellation when losing leadership, leading to incomplete shutdown of internal controllers.
- Memory leaks occur in the components because they no longer fully restart on leader transition, which previously masked such leaks.
Impact on workloads:
- If a leader component becomes unstable (e.g., due to memory leaks or improper shutdown), its ability to perform its duties (scheduling, controller management) could be impaired. In an HA setup, another instance should take over leadership, but frequent transitions or instability could degrade overall cluster performance or reliability.
- If a component fails to release resources correctly upon losing leadership, it might lead to resource contention or incorrect behavior. Rollback (disabling the feature gate and restarting components) should revert to the previous stable behavior.
What specific metrics should inform a rollback?
Some things to look at:
- Component restart counts: An increase in restarts for kube-controller-manager or kube-scheduler.
- Log messages indicating errors during leader transition or resource cleanup.
- General cluster health indicators like API server latency, pod scheduling latency, or controller reconciliation errors.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
n/a
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No. This feature introduces new behavior behind a feature gate and does not deprecate or remove any existing features, APIs, fields, or flags.
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
An operator can determine if the feature is active by inspecting the
command-line flags of the relevant components (kube-scheduler,
kube-controller-manager, cloud-controller-manager) to verify that the
GracefulLeaderTransition feature gate is enabled.
Observing component logs for messages indicating graceful leader release (as opposed to immediate shutdown) would also confirm its use.
How can someone using this feature know that it is working for their instance?
This feature is primarily for cluster operators. Operators can verify its operation by:
- Observing component logs: Logs for kube-scheduler, kube-controller-manager, and cloud-controller-manager should indicate that upon losing leadership, the component attempts a graceful shutdown of its internal loops and returns to a follower state to re-attempt leader election, rather than exiting.
- Monitoring component behavior: Affected components should not restart (i.e., no new PIDs) immediately after losing leadership if the graceful transition is successful. They should continue running and attempt to reacquire leadership.
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
- No increase to leader transition time with the feature enabled.
- No increase to memory usage with feature is enabled
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
Component logs can be used to verify graceful shutdown steps are being executed.
Are there any missing metrics that would be useful to have to improve observability of this feature?
n/a
Dependencies
Does this feature depend on any specific services running in the cluster?
This is a control plane feature and requires the components that the feature runs on (kube-scheduler, kube-controller-manager) to be active, as well as the kube-apiserver and etcd for leader election.
Scalability
Will enabling / using this feature result in any new API calls?
No
Will enabling / using this feature result in introducing new API types?
No
Will enabling / using this feature result in any new calls to the cloud provider?
No
Will enabling / using this feature result in increasing size or count of the existing API objects?
No
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
No.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No.
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
Leader election cannot function without apiserver or etcd.
What are other known failure modes?
- Memory Leak
- Detection: kcm or kube-scheduler memory constantly increasing after leader changes.
- Mitigations: Restart the container, turn off the feature. running user workloads?
- Diagnostics: Looking at memory consumption of KCM and kube-scheduler.
- Testing: Tests will be done manually.
What steps should be taken if SLOs are not being met to determine the problem?
Disable the feature.
Implementation History
- 2025-06-03 - KEP was marked as implementable
Drawbacks
Introduces additional risk of memory leak.
Alternatives
n/a
Future Work (Stories)
This feature enables the user stories below, but require additional modification to the kcm and scheduler code that they are outside the scope of this KEP.
Story 1
In an HA configuration, cloud provider A wants to balance controllers over multiple control plane instances. With graceful transitions, multiple locks can be used by a single KCM instance such that a subset of components run under each lock.
Story 2
A extension developer has controller manager that should dynamically start and shutdown controllers based on cluster state (such as the registration of resources that declare how a CRD should be reconciled). The extension developer requires that controllers shutdown gracefully so that only the controller loops that SHOULD be running continue to run, and that no resources are leaked over time as controllers are started and stopped.