KEP-4355: Coordinated Leader Election

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Future Work
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This proposes a component leader election mechanism that is safer for upgrades and rollbacks.

This leader election approach continues to use leases, but with two key modifications:

Instead of a race by component instances to claim the lease, component instances declare candidacy for a lease and a election coordinator claims the lease for the best available candidate. This allows the election coordinator to pick a candidate with the lowest version to ensure that skew rules are not violated.
The election coordinator can mark a lease as “end of term” to signal to the current leader to stop renewing the lease. This allows the election coordinator to preempt the current leader and replace it with a better one.

Motivation

The most common upgrade approach used for Kubernetes control plane components is a node-by-node approach where all the component of a control plane node are terminated together and then restarted at the new version. This process is performed node-by-node across a high availability configuration.

Systems using node-by-node upgrades:

Cluster API
kubeadm
KIND

To respect the Kubernetes skew policy :

Upgrades should keep controller managers and schedulers at the old version until all apiservers are upgraded.
Rollbacks should rollback controller managers and schedulers at the old version before any apiservers are rolledback.

But a node-by-node upgrade or rollback does not achieve this today.

For 3 node control plane upgrade, there is about a 25% chance of a new version of the controller running while old versions of the apiserver are active, resulting in a skew violation. (Consider case where the 2nd node upgraded has the lease)
For rollback, it is almost a certainty that skew will be violated.

There is also the possiblity that the lease will be lost by a leader during an upgrade or rollback resulting in the version of the controller flip-flopping between old and new.

Goals

During HA upgrades/rollbacks/downgrades,

Leader elected components:

Change versions at predictable times
Do not violate version skew, even during node-by-node rollbacks

The control plane:

Can safely canary components and nodes at the new version for an extended period of time, or to pause an upgrade at any step during an upgrade. This enhancement, combined with UVIP helps achieve this.

Non-Goals

Change the default leader election for components.

Proposal

Offer an opt-in leader election mechanism to:
- Elect the candidate with the oldest version available.
- Provide a way to preempt the current leader on the upcoming expiry of the term.
- Reuse the existing lease mechanism as much as possible.

Component Lease Candidates

Components will create lease candidates similar to those used by apiserver identity. Some key differences are certain fields like LeaseTransitions and HolderIdentity are removed. See the API section for the full API.

e.g.:

apiVersion: coordination.k8s.io/v1
kind: LeaseCandidate
metadata:
  name: some-custom-controller-0001A
  namespace: kube-system
spec:
  leaseName: some-custom-controller
  binary-version: "1.29"
  compatibility-version: "1.29"
  leaseDurationSeconds: 300
  renewTime: "2023-12-05T02:33:08.685777Z"

A component “lease candidate” announces candidacy for leadership by specifying spec.leaseName in its lease candidate lease. If the LeaseCandidate object expires, the component is considered unavailable for leader election purposes. “Expires” is defined more clearly in the Renewal Interval section.

Coordinated Election Controller

A new Coordinated Election Controller will reconcile component leader Leases (primary resource) and Lease Candidate Leases (secondary resource, changes trigger reconciliation of related leader leases).

Coordinated Election Controller reconciliation loop:

If no leader lease exists for a components:
- Elect leader from candidates by preparing a freshly renewed Lease with:
  - spec.holderIdentity set to the identity of the elected leader
If there is a better candidate than current leader:
- Sets preferredHolder on the leader Lease to the name of the next leader, signaling that the leader should stop renewing the lease and yield leadership

flowchart TD
   A[Reconcile] --> |Process Leader Lease| B
   B{Lease Status?} --> |Better Leader Exists| D
   B --> |Expired/Missing| E
   D[End Lease Term]
   E[Elect Leader]

Example of a lease created by Coordinated Election Controller:

apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  annotations:
  name: some-custom-controller
  namespace: kube-system
spec:
  holderIdentity: controller-a
  leaseDurationSeconds: 10
  leaseTransitions: 0
  renewTime: "2023-12-05T18:58:31.295467Z"

The Coordinated Election Controller will run in the kube-apiserver.

In an HA configuration, the Coordinated Leader Election Controller will have its own lease similar to how other leader elected controllers behaves today. It will be responsible for renewing its own lease and gracefully shutdown if the lease is expired. Only one instance of the coordinated leader election controller will be active at a time, and this prevents instances of the coordinated leader election controller from interfering with each other. Unlike in KCM, the coordinated leader election controller must gracefully shutdown and restart as it will be running in the kube-apiserver and calling os.Exit() is not an option.

Coordinated Lease Lock

A new controller tools/leaderelection/leasecandidate will be added to client-go that:

Creates LeaseCandidate Lease when ready to be Leader
Renews LeaseCandidate lease infrequently (once every 300 seconds)
Watches its LeaseCandidate lease for the updates to the pingTime field. If the pingTime field is later than renewTime, it signals that the LeaseCandidate should be renewed and the renewTime is subsequently updated.
Watches Leader Lease, waiting to be elected leader by the Coordinated Election Controller
When it becomes leader:
- Perform role of active component instance
- Renew leader lease periodically
- Stop renewing if lease field spec.preferredHolder is non nil
If leader lease expires:
- Yield leadership and return to acting as a candidate component instance. For certain components, this may involve shutting down and restarting.

flowchart TD
    A[Started] -->|Create LeaseCandidate Lease| B
    B[Candidate] --> |Elected| C[Leader]
    C --> |Renew Leader Lease| C
    C -->|Better Candidate Available / Leader Lease Expired| D[Yield Leadership]
    D[Yield Leadership] -.-> |Shutdown/Restart if necessary| A

Renewal Interval and Performance

The leader lease will have renewal interval and duration (2s and 15s). This is similar to the renewal interval of the current leader lease.

For component leases, keeping a short renewal interval will add many unnecessary writes to the apiserver. The component leases renewal interval will default to 5 mins.

When the leader lease is marked as end of term or available, the coordinated leader election controller will update the pingTime field of all component lease candidate objects and wait up to 5 seconds. During that time, components will update their component lease renewTime. The leader election controller will then pick the leader based on its criteria from the set of component leases that have ack’d the request.

Strategy

There are cases where a user may want to change the leader election algorithm and this can be done via the spec.Strategy field in a Lease.

The Strategy field signals to the coordinated leader election controller the appropriate algorithm to use when selecting leaders.

We will allow for the existence of a lease without a holder. This will allow Strategy to be injected and preserved for leases that may not want to use the default selected by CLE. If there are no candidate objects, the Strategy field will remain empty to indicate that the Lease is not managed by the CLE controller. Otherwise the strategy will always default to MinimumCompatibilityVersion. The Lease may also be created or updated by a third party to the desired spec.Strategy if an alternate strategy is preferred. This may be done either by the candidates, users, or additional controllers.

Releasing a Lease will involve resetting the holderIdentity to nil instead of deletion. This will preserve Strategy when a Lease object is released and reacquired by another candidate.

Alternative for Strategy

Creating a new LeaseConfiguration resource

We can create a new resource LeaseConfiguration to set up the defaults for Strategy and other configurations extensible in the future. This is a very clean approach that allows users to change the strategy at will without needing to recompile/restart anything. The main drawback is the introduction of a new resource and more complexity in leader election logic and watching.

kind: LeaseConfiguration
spec:
  targetLease: "kube-system/kube-controller-manager"
  strategy: "MinimumCompatibilityVersion"

YAML/CLI configuration on the kube-apiserver

We can also populate the default by directly setting up the CLE controller to ingest the proper defaults. For instance, ingesting a YAML configuration in the form of a list of KV pairs of lease:strategy pairs will allow the CLE controller to directly determine the Strategy used for each component. This has the added benefit of requiring no API changes as it is optional whether to include the strategy in the Lease object.

The drawback of this method is that elevated permissions are needed to configure the kube-apiserver. In addition, an apiserver restart may be needed when the Strategy needs to be changed.

Strategy propagated from LeaseCandidate

One other alternative is that Strategy could be an option specified by a LeaseCandidate object, in most cases the controller responsible renewing the LeaseCandidate lease. The value for the strategy between different LeaseCandidate objects leading the same Lease should be the same, but during mixed version states, there is a possibility that they may differ. We will use a consensus protocol that favors the algorithm with the highest priority. The priority is a fixed list that is predetermined. For now, this is NoCoordination > MinimumCompatibilityVersion. For example, if three LeaseCandidate objects exist and two objects select MinimumCompatibilityVersion while the third selects NoCoordination, NoCoordination will take precedent and the coordinated leader election controller will use NoCoordination as the election strategy. The final strategy used will be written to the Lease object when the CLE controller creates the Lease for a suitable leader. This has the benefit of providing better debugging information and allows short circuiting of an election if the set of candidates and selected strategy is the same as before.

The obvious drawback is the need for a consensus protocol and extra information in the LeaseCandidate object that may be unnecessary.

Priority-based Coordinated Leader Election

To enhance control over leader assignment beyond existing CLE strategies like OldestEmulatedVersion, we propose adding an optional Priority field unset by default, higher value = higher priority) to LeaseCandidateSpec.

This field allows operators to explicitly designate a preferred leader. The CLE system will select the candidate with the highest non-zero Priority. If multiple candidates share the same highest priority, the existing v1.CoordinatedLeaseStrategy will act as a tie-breaker. If no candidates have a priority set, the system defaults to the existing v1.CoordinatedLeaseStrategy.

This provides granular, temporary control without replacing the primary CLE mechanism.

LeaseCandidateSpec Update

A new field called Priority is included into LeaseCandidateSpec:

// LeaseCandidateSpec is a specification of a Lease.
type LeaseCandidateSpec struct {
	// ...
  Priority int32 `json:"priority,omitempty" protobuf:"varint,7,opt,name=priority"` // New field: Higher value means higher priority. The value must be > 0.
}

Behavior of the Priority Field

Priority Value: The Priority field is an int32. A higher numerical value indicates a higher priority. This field must be greater than 0.
Selection Logic:
- If one or more candidates have a Priority > 0: The candidate with the numerically highest Priority value will be selected as the leader.
- Tie-Breaking for Equal Highest Priority: If multiple candidates share the same highest non-zero Priority value, the selection among these equally prioritized candidates will be resolved using their existing v1.CoordinatedLeaseStrategy (e.g., OldestEmulatedVersion).
- If no candidates have a Priority, the leader selection will proceed based purely on the existing v1.CoordinatedLeaseStrategy.

Scenario Breakdown for priority based coordination leader election

Here is a step-by-step breakdown of the scenarios for better understanding the priority-based leader election during upgrades.

1. Initial State

At the beginning, all components (C1, C2, and C3) are running Binary Version 1 and are emulating Version 1

Component	Binary Version	Emulation Version	Leader
C1	V1	V1	Y
C2	V1	V1
C3	V1	V1

2. During Upgrade

During the upgrade, C1 and C2 are updated to Binary Version 2, but C3 remains on an earlier version. C2 is momentarily elected as the leader.

Component	Binary Version	Emulation Version	Leader
C1	V2	V2
C2	V2	V1	Y
C3	V2	V1

3. Priority Setting

The cluster administrator chooses C1 to be the leader by setting its priority to 100.

Component	Binary Version	Emulation Version	Priority	Leader
C1	V2	V2	100	Y
C2	V2	V1
C3	V2	V1

4.1. Upgrade Completion

After the upgrade is finished, all components are running Binary Version 2 and are emulating Version 2. C1 remains the leader due to its set priority.

Component	Binary Version	Emulation Version	Priority	Leader
C1	V2	V2	100	Y
C2	V2	V2
C3	V2	V2

4.2 Update rollback

Should an issue arise with C1 requiring a rollback, we can unset its priority. This will enable CLE to select C2, which contains the oldest emulated version.

Component	Binary Version	Emulation Version	Leader
C1	V2 -> V1	V2 -> V1
C2	V2	V1	Y
C3	V2	V1

5. Priority Persistence

Unless the cluster administrator resets the priority, C1 will always remain the leader. When a component gets upgraded or downgraded, it may create a new release candidate, causing the priority to reset.

Consideration for Stale Priorities

A concern with the priority field is the potential for “stale priorities” – a priority set temporarily and not subsequently cleared. This could prevent the Coordinated Leader Election (CLE) system from selecting a more appropriate leader. We considered exposing a Time-To-Live (TTL) for priority in the LeaseCandidateSpec, where the CLE system would ignore a priority once its TTL expired. While this directly addresses the “temporary” nature of many priority assignments, we’ve decided not to include it in this initial phase due to several complexities:

Implementation and Semantics: Defining the precise data type and behavior for a TTL (e.g., time.Duration vs. time.Time, resetting logic) adds significant complexity.
User Rationalization: Adding a third field (ttl) to an already multi-faceted leader election logic (strategy + priority) greatly increases the cognitive load for users to understand and manage leader selection effectively.

Therefore, in this initial iteration, managing priority lifecycles will be an operational responsibility, requiring manual clearance or updates. We may revisit TTL or similar automated mechanisms in future iterations after gaining more experience with the priority field.

Enabling on a component

Components with a --leader-elect-resource-lock flag (kube-controller-manager, kube-scheduler) will accept coordinatedleases as a resource lock type.

Migrations

So long as the API server is running a coordinated election controller, it is safe to directly migrate a component from Lease Based Leader Election to Coordinated Leader Election (or vis-versa).

During the upgrade, a mix of components will be running both election approaches. When the leader lease expires, there are a couple possibilities:

A controller instance using Lease-based leader election claims the leader lease
The coordinated election controller picks a leader, from the components that have written LeaseCandidate leases, and claims the lease on the leader’s behalf

Both possibilities have acceptable outcomes during the migration: a component is elected leader, and once elected, remains leader so long as it keeps the lease renewed. The elected leader might not be the leader that Coordinated Leader Election would pick, but this is no worse than how leader election works before the upgrade, and once the upgrade is complete, Coordinated Leader Election works as intended.

There is one thing that could make migrations slightly cleaner: If Coordinated Leader Election adds a coordination.k8s.io/elected-by: leader-election-controller annotation to any leases that it claims. It can also check for this annotation and only mark leases as “end-of-term” if that annotation is present. Lease Based Leader Election would ignore “end-of-term” annotations anyway, so this isn’t strictly needed, but it would reduce writes from the coordinated election controller to leases that were claimed by component instances not using Coordinated Leader Election

API

The lease lock API will be extended with a new field for election preference, denoted as an enum for strategies for Coordinated Leader Election.


type CoordinatedLeaseStrategy string

// CoordinatedLeaseStrategy defines the strategy for picking the leader for coordinated leader election.
const (
  OldestEmulationVersion CoordinatedLeaseStrategy = "OldestEmulationVersion"
)

// LeaseSpec is a specification of a Lease.
type LeaseSpec struct {
	// holderIdentity contains the identity of the holder of a current lease.
	// If Coordinated Leader Election is used, the holder identity must be
	// equal to the elected LeaseCandidate.metadata.name field.
	// +optional
	HolderIdentity *string `json:"holderIdentity,omitempty" protobuf:"bytes,1,opt,name=holderIdentity"`
	// leaseDurationSeconds is a duration that candidates for a lease need
	// to wait to force acquire it. This is measured against the time of last
	// observed renewTime.
	// +optional
	LeaseDurationSeconds *int32 `json:"leaseDurationSeconds,omitempty" protobuf:"varint,2,opt,name=leaseDurationSeconds"`
	// acquireTime is a time when the current lease was acquired.
	// +optional
	AcquireTime *metav1.MicroTime `json:"acquireTime,omitempty" protobuf:"bytes,3,opt,name=acquireTime"`
	// renewTime is a time when the current holder of a lease has last
	// updated the lease.
	// +optional
	RenewTime *metav1.MicroTime `json:"renewTime,omitempty" protobuf:"bytes,4,opt,name=renewTime"`
	// leaseTransitions is the number of transitions of a lease between
	// holders.
	// +optional
	LeaseTransitions *int32 `json:"leaseTransitions,omitempty" protobuf:"varint,5,opt,name=leaseTransitions"`
	// Strategy indicates the strategy for picking the leader for coordinated leader election.
	// If the field is not specified, there is no active coordination for this lease.
	// (Alpha) Using this field requires the CoordinatedLeaderElection feature gate to be enabled.
	// +featureGate=CoordinatedLeaderElection
	// +optional
	Strategy *CoordinatedLeaseStrategy `json:"strategy,omitempty" protobuf:"bytes,6,opt,name=strategy"`
	// PreferredHolder signals to a lease holder that the lease has a
	// more optimal holder and should be given up.
	// This field can only be set if Strategy is also set.
	// +featureGate=CoordinatedLeaderElection
	// +optional
	PreferredHolder *string `json:"preferredHolder,omitempty" protobuf:"bytes,7,opt,name=preferredHolder"`
}

For the LeaseCandidate leases, a new lease will be created

// LeaseCandidateSpec is a specification of a Lease.
type LeaseCandidateSpec struct {
	// LeaseName is the name of the lease for which this candidate is contending.
	// This field is immutable.
	// +required
	LeaseName string `json:"leaseName" protobuf:"bytes,1,name=leaseName"`
	// PingTime is the last time that the server has requested the LeaseCandidate
	// to renew. It is only done during leader election to check if any
	// LeaseCandidates have become ineligible. When PingTime is updated, the
	// LeaseCandidate will respond by updating RenewTime.
	// +optional
	PingTime *metav1.MicroTime `json:"pingTime,omitempty" protobuf:"bytes,2,opt,name=pingTime"`
	// RenewTime is the time that the LeaseCandidate was last updated.
	// Any time a Lease needs to do leader election, the PingTime field
	// is updated to signal to the LeaseCandidate that they should update
	// the RenewTime.
	// Old LeaseCandidate objects are also garbage collected if it has been hours
	// since the last renew. The PingTime field is updated regularly to prevent
	// garbage collection for still active LeaseCandidates.
	// +optional
	RenewTime *metav1.MicroTime `json:"renewTime,omitempty" protobuf:"bytes,3,opt,name=renewTime"`
	// BinaryVersion is the binary version. It must be in a semver format without leading `v`.
	// This field is required when strategy is "OldestEmulationVersion"
	// +optional
	BinaryVersion string `json:"binaryVersion,omitempty" protobuf:"bytes,4,opt,name=binaryVersion"`
	// EmulationVersion is the emulation version. It must be in a semver format without leading `v`.
	// EmulationVersion must be less than or equal to BinaryVersion.
	// This field is required when strategy is "OldestEmulationVersion"
	// +optional
	EmulationVersion string `json:"emulationVersion,omitempty" protobuf:"bytes,5,opt,name=emulationVersion"`
	// PreferredStrategies indicates the list of strategies for picking the leader for coordinated leader election.
	// The list is ordered, and the first strategy supersedes all other strategies. The list is used by coordinated
	// leader election to make a decision about the final election strategy. This follows as
	// - If all clients have strategy X as the first element in this list, strategy X will be used.
	// - If a candidate has strategy [X] and another candidate has strategy [Y, X], Y supersedes X and strategy Y
	//   will be used.
	// - If a candidate has strategy [X, Y] and another candidate has strategy [Y, X], this is a user error and leader
	//   election will not operate the Lease until resolved.
	// (Alpha) Using this field requires the CoordinatedLeaderElection feature gate to be enabled.
	// +featureGate=CoordinatedLeaderElection
	// +listType=atomic
	// +required
	PreferredStrategies []v1.CoordinatedLeaseStrategy `json:"preferredStrategies,omitempty" protobuf:"bytes,6,opt,name=preferredStrategies"`
}

Each LeaseCandidate lease may only lead one lock. If the same component wishes to lead many leases, a separate LeaseCandidate lease will be required for each lock.

Comparison of leader election

	Lease Based Leader Election	Coordinated Leader Election
Lock Type	Lease	Lease
Claimed by	Component instance	Election Coordinator. (Lease is claimed for to the elected component instance)
Renewed by	Component instance	Component instance
Leader Criteria	First component to claim lease	Best leader from available candidates at time of election
Preemptable	No	Yes, Collaboratively. (Coordinator marks lease’s next `preferredHolder`. Component instance voluntarily stops renewing)

User Stories (Optional)

Story 1

A cluster administrator upgrades a cluster’s control plane node-by-node, expecting version skew to be respected.

When the first and second nodes are upgraded, any components that were leaders will typically lose the lease during the node downtime
- If one happens to retain its lease, it will be preempted by the coordinated election controller after it updates its LeaseCandidate lease with new version information
When the third node is upgraded, all components will be at the new version and one will be elected

Story 2

A cluster administrator rolls back a cluster’s control plane node-by-node, expecting version skew to be respected.

When the first node is rolled back, any components that were leaders will typically loose the lease during the node downtime
Once one of the components updates its LeaseCandidate lease with new version information, the coordinated election controller will preempt the current leader so that this lower version component becomes leader.
When the remaining two nodes can rollback, the first node will typically remain leader, but if a new election occurs, the available older version components will be elected.

Story 3

A cluster administrator may want more fine grain control over a control plane’s upgrade.

When one node is upgraded they may wish to canary the components on that node and switch the leader to the new compatibility version immediately.
This can be accomplished by changing the Strategy field in a lease object.

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Risk: Amount of writes performed by leader election increases substantially

This enhancement introduces a LeaseCandidate lease for each instance of each component.

Example:

HA cluster with 3 control plane nodes
3 elected components (kube-controller-manager, scheduler, cloud-controller-manager) per control plane node
9 LeaseCandidate leases are created and renewed by the components

Introducing this feature is roughtly equivalent to adding the same lease load as adding 9 nodes to a kubernetes cluster.

The API Server Identity enhancement also introduces similar leases. For comparison, in a HA cluster with 3 control plane nodes, API Server Identity adds 3 leases.

This risk can be migitated by scale testing and, if needed, extending the lease duration and renewal times to reduce writes/s.

Risk: lease candidate watches increase apiserver load substantially

The Unknown Version Interoperability Proxy (UVIP) enhancement also adds lease watches on API Server Identity leases in the kube-system namespace. This enhancement does not touch the number of lease resources being watched, but adds 3 resources being watched for LeaseCandidate per component.

Risk: We have to “start over” and build confidence in a new leader election algorithm

We’ve built confidence in the existing leasing algorithm, through an investment of engineering effort, and in core hours testing it and running it in production.

Changing the algorithm “resets the clock” and forces us to rebuild confidence on the new algorithm.

The goal of this proposal is to minimize this risk by reusing as much of the existing lease algorithm as possible:

Renew leases in exactly the same way as before
Leases can never be claimed by another leader until a lease expires

Risk: How is the election controller elected?

The leader election controller will be selected by the first apiserver that claims the leader election lease lock. This is the same as how kube controller manager and other components are elected today. The leader selected is not deterministic during an update, but we do not see many changes to the leader election controller.

Risk: What if the election controller fails to elect a leader?

Fallback to letting component instances claim the lease directly, after a longer delay, to give the coordinated election controller an opportunity to elect before resorting to the fallback.

Design Details

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

staging/src/k8s.io/client-go/tools/leaderelection: 76.8
pkg/controller/leaderelection: 77.8

Integration tests

test/integration/apiserver/coordinatedleaderelection/*: New directory

e2e tests

test/e2e/apimachinery/coordinatedleaderelection.go: New file

Graduation Criteria

Alpha

Feature implemented behind a feature flag
The strategy OldestEmulationVersion is implemented

Beta

e2e & integration tests for coordinated leader election on various scenarios
- single leasecandidate
- multiple leasecandidates
- lease is preempted when another more suitable candidate is found
- Components that don’t know about coordination mixed with those who do
- Downgrade to components that do not know about coordination
- Custom third party strategy controller
Lease pings are parallelized
Tests are included for third party strategies
Tests for disablement of the feature gate

GA

Load test Coordinated Leader Election
Feature is enabled by default
A tested solution for stale priorities is implemented, working through either improved user validation to prevent them, or an automated system to correct them.

Upgrade / Downgrade Strategy

Upgrading requires enabling the feature gate CoordinatedLeaderElection and the group version coordination.k8s.io/v1alpha2. Downgrading will revert to the old leader election mechanism, but may have extra data in etcd for LeaseCandidate objects under the coordination.k8s.io/v1alpha2 group version.

Version Skew Strategy

The feature uses leases in a standard way, so if some components instances are configured to use the old direct leases and others are configured to use this enhancement’s coordinated leases, the component instances may still safely share the same lease, and leaders will be safely elected.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: CoordinatedLeaderElection
- Components depending on the feature gate:
  - kube-apiserver
  - kube-controller-manager
  - kube-scheduler

Does enabling the feature change any default behavior?

Yes, kube-scheduler and kube-controller-manager will use coordinated leader election instead of the default leader election mechanism if the feature is enabled.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, the feature uses leases in a standard way, so if some components are configured to use direct leases and others are configured to use coordinated leases, elections will still happen. Also, coordinated leader election falls back to direct leasing of the election coordinator does not elect leader within a reasonable period of time, making it safe to disable this feature in HA clusters.

What happens if we reenable the feature if it was previously rolled back?

This is safe. Leader elections would transition back to coordinated leader elections. Any elected leaders would continue to renew their leases.

Are there any tests for feature enablement/disablement?

Yes, this will be tested, including tests where the are a mix of components with the feature enabled and disabled.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

Rollouts and rollbacks can fail in many ways. During the first rollout of the feature, there will be a mixed state of control planes using and not using coordinated leader election. Components not using CLE will race to obtain the best leader while the ones using CLE will defer the CLE controller to assign themselves as leader. We cannot guarantee the best leader is elected during mixed version states, but leader election will still be done.

If the CLE controller has bugs, it may fail to or incorrectly select a leader and could lead to disruptions.

If LeaseCandidate objects have incorrect version information, CLE controller may make an incorrect leader selection and potentially lead to version skew violations.

What specific metrics should inform a rollback?

If leases fail to renew that would be a sign for rollback.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Integration tests include testing for skew scenarios.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

LeaseCandidate resource will be enabled and feature gate CoordinatedLeaderElection will be enabled. On the Lease object, a new field Strategy will be populated indicating the strategy used by coordinated leader election for selecting the most suitable leader.

How can someone using this feature know that it is working for their instance?

LeaseCandidate objects will exist for leader elected components, and the RenewTime and PingTime fields will be recent (within 30 minutes).
Lease objects for leader elected components will be assigned and actively renewing.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

When leader elected components are in the cluster, the leader must be timely selected and propagated via the Lease object. The lease must be actively renewed.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name: apiserver_coordinated_leader_election_leader_changes{name="<component>"}
- Metric name: apiserver_coordinated_leader_election_leader_preemptions{name="<component>"}
- Metric name: apiserver_coordinated_leader_election_failures_total
- Metric name: apiserver_coordinated_leader_election_skew_preventions_total
- Components exposing the metrics: kube-apiserver

Are there any missing metrics that would be useful to have to improve observability of this feature?

n/a.

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

Yes.

API call type: PUT
estimated throughput: Steady state is 3 requests per leader elected component every 30 minutes to renew the LeaseCandidate. If there is churn in the control plane, an extra 2N requests are performed on every change per leader elected component, N representing the number of available control planes. The number is 2N because N requests will be sent by the apiserver to ping all candidates, and every request should be ack’d by the client.
watch on LeaseCandidate resources

Will enabling / using this feature result in introducing new API types?

coordination.k8s.io/LeaseCandidate
One candidate will exist for each leader elected component for each control plane. Total amount is # leader elected components * # control plane instances

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

An additional Strategy field will be populated on all leases elected by CLE. This is a string enum.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

This is a control plane feature and does not affect node.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

If the API server becomes unavailable, the CLE cannot function as it is built on top of the API server. It cannot monitor LeaseCandidates, update Leases, or elect new leaders. Existing leaders will continue to function until their Leases expire, but no new leaders will be elected until the API server recovers.

If etcd is unavailable, similar issues arise. The underlying lease mechanism relies on etcd for storage and coordination. Without etcd, Leases cannot be created, renewed, or monitored.

What are other known failure modes?

Leader election controller fails to elect a leader
- Detection: Via metrics apiserver_coordinated_leader_election_failures_total increasing and absence of leader in lease object
- Mitigations: Operators can disable feature gate.
- Diagnostics: Check kube-apiserver logs for messages on failing to elect the leader. Look at the lease object renewal times and holder, along with leasecandidate objects for the particular component.
- Testing: Integration test exists that prevents write access for the CLE controller and ensures that another controller takes over.

What steps should be taken if SLOs are not being met to determine the problem?

Check whether the CLE controller is operating properly, check if API server is not overloaded, and in the worst case disable the feature by explicitly setting the feature gate to false. This information can be found in the controller and API server logs kube-apiserver.log. Additionally, looking through the lease and leasecandidate objects will provide insight on whether the leases and candidates are renewing properly.

Implementation History

Drawbacks

Alternatives

When evaluating alternatives, note that if we decide in the future to improve the algorithm, fix a bug in the algorithm, or change the criteria for how leaders are elected, our decision on where to put the code has a huge impact our how the change is rolled out.

For example, it will be much easier change in a controller in the kube-apiserver than in client-go library code distributed to elected controllers, because once it is distributed into controllers, especially 3rd party controllers, any change requires updating client-go and then updating all controllers to that version of client-go.

Similar approaches involving the leader election controller

Running the leader election controller in HA on every apiserver

The apiserver runs very few controllers, and they are not elected, but instead all run concurrently in HA configurations.
Requires the election controller make careful use concurrency control primitives to ensure multiple instances collaborate, not fight.

When the Coordinated Leader Election controller runs in the apiserver, it is possible that two instances of the controller will have different views of the candidate list. This happens when one controller has fallen behind on a watch (which can happen for many underlying reasons).

When two controllers have differnet candidate lists, they might “fight”. One likely way they would fight is:

controller A thinks X is the best leader
controller B thinks Y is the best leader (because it has stale data from a point in time when this was true)
controller A elects X
controller B marks the leader lease as ““End of term” since it believes Y should be leader
controller B elects Y as leader
controller A marks the leader lease as ““End of term” since it believes X should be leader
…

This can be avoided by tracking resourceVersion or generation numbers of resources used to make a decision in the lease being reconciled and authoring the controllers to not to write to a lease when the data used is stale compared to the already tracked resourceVersion or generation numbers.

One drawback to this approach is that updating the leader election controller can cause undefined behavior when multiple instances of the leader election controller are “collaborating”. It is difficult to test and prove edge cases when an update to the leader election controller code is necessary and could fight with the previous version during an mixed version state.

Running the coordinated leader election controller in KCM

Since the coordinated leader election controller is a controller that is elected, it would also make sense to run in KCM. However, a major drawback is that KCM forcefully shuts down when it loses the leader lock and it is possible that the leader election controller on the same KCM instance is the leader at that time. This causes the coordinated leader election controller to change leaders which could cause disruptions.

Two ways to solve this are to gracefully shutdown the KCM and fork the process such that the coordinated leader election controller is unaffected. Gracefully shutting down the KCM is difficult as controllers are used to the KCM forcefully shutting them, and we have no guarantee that third party controllers do not rely on this “feature”. Forking the process causes additional overhead that we’d like to avoid.

Running the coordinated leader election controller in a new container

Instead of running in KCM, the coordinated leader election controller could be run in a new container (eg: kube-coordinated-leader-election). There will be a slightly larger memory footprint with this approach and adding a new component to the control plane changes our Kubernetes control plane topology in an undesirable way.

Component instances pick a leader without a coordinator

A candidates is picked at random to be an election coordinator, and the coordinator picks the leader:
- Components race to claim the lease
- If a component claims the lease, the first thing it does is check the lease candidates to see if there is a better leader
- If it finds a better lease, it assigns the lease to that component instead of itself

Pros:

No coordinated election controller

Cons:

All leader elected components must have the code to decide which component is the best leader

Component instances pick a leader without lease candidates or a coordinator

The candidates communicate through the lease to agree on the leader
- Leases have “Election” and “Term” states
- Leases are first created in the “election” state.
- While in the “election” state, candidates self-nominate by updating the lease with their identity and version information. Candidates only need to self nominate if they are a better candidate than candidate information already written to the lease.
- When “Election” timeout expires, the best candidate becomes the leader
- The leader sets the state to “term” and starts renewing the lease
- If the lease expires, it goes back to the “election” state

Pros:

No coordinated election controller
No lease candidates

Cons:

Complex election algorithm is distributed as a client-go library. A bug in the algorithm cannot not be fixed by only upgrading kubernetes.. all controllers in the ecosystem with the bug must upgrade client-go and release to be fixed.
More difficult to change/customize the criteria for which candidate is best.

Algorithm configurability

We’ve opted for a static fixed algorithm that looks at three things, continuing down the list of comparisons if there is a tiebreaker.

min(binary version)
min(compatibility version)
min(lease candidate name)

The goal of the KEP is to make the leader predictable during a cluster upgrade where leader elected components and apiservers may have mixed versions. This will make all states of a Kubernetes control plane upgrade adhere to the version skew policy.

An alternative is to make the leader election algorithm configurable either via flags or a configuration file.

Future Work

Controller sharding could leverage coordinated leader election to load balance controllers against apiservers.
Optimizations for graceful and performant failover can be built on this enhancement.