KEP-4355: Coordinated Leader Election
KEP-4355: Coordinated Leader Election
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Component Lease Candidates
- Coordinated Election Controller
- Coordinated Lease Lock
- Renewal Interval and Performance
- Strategy
- Priority-based Coordinated Leader Election
- Enabling on a component
- Migrations
- API
- Comparison of leader election
- User Stories (Optional)
- Notes/Constraints/Caveats (Optional)
- Risks and Mitigations
- Risk: Amount of writes performed by leader election increases substantially
- Risk: lease candidate watches increase apiserver load substantially
- Risk: We have to "start over" and build confidence in a new leader election algorithm
- Risk: How is the election controller elected?
- Risk: What if the election controller fails to elect a leader?
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Future Work
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and
SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
This proposes a component leader election mechanism that is safer for upgrades and rollbacks.
This leader election approach continues to use leases, but with two key modifications:
- Instead of a race by component instances to claim the lease, component instances declare candidacy for a lease and a election coordinator claims the lease for the best available candidate. This allows the election coordinator to pick a candidate with the lowest version to ensure that skew rules are not violated.
- The election coordinator can mark a lease as “end of term” to signal to the current leader to stop renewing the lease. This allows the election coordinator to preempt the current leader and replace it with a better one.
Motivation
The most common upgrade approach used for Kubernetes control plane components is a node-by-node approach where all the component of a control plane node are terminated together and then restarted at the new version. This process is performed node-by-node across a high availability configuration.
Systems using node-by-node upgrades:
- Cluster API
- kubeadm
- KIND
To respect the Kubernetes skew policy :
- Upgrades should keep controller managers and schedulers at the old version until all apiservers are upgraded.
- Rollbacks should rollback controller managers and schedulers at the old version before any apiservers are rolledback.
But a node-by-node upgrade or rollback does not achieve this today.
- For 3 node control plane upgrade, there is about a 25% chance of a new version of the controller running while old versions of the apiserver are active, resulting in a skew violation. (Consider case where the 2nd node upgraded has the lease)
- For rollback, it is almost a certainty that skew will be violated.
There is also the possiblity that the lease will be lost by a leader during an upgrade or rollback resulting in the version of the controller flip-flopping between old and new.
Goals
During HA upgrades/rollbacks/downgrades,
Leader elected components:
- Change versions at predictable times
- Do not violate version skew, even during node-by-node rollbacks
The control plane:
- Can safely canary components and nodes at the new version for an extended period of time, or to pause an upgrade at any step during an upgrade. This enhancement, combined with UVIP helps achieve this.
Non-Goals
- Change the default leader election for components.
Proposal
- Offer an opt-in leader election mechanism to:
- Elect the candidate with the oldest version available.
- Provide a way to preempt the current leader on the upcoming expiry of the term.
- Reuse the existing lease mechanism as much as possible.
Component Lease Candidates
Components will create lease candidates similar to those used by apiserver
identity. Some key differences are certain fields like LeaseTransitions and HolderIdentity are removed.
See the API section for the full API.
e.g.:
apiVersion: coordination.k8s.io/v1
kind: LeaseCandidate
metadata:
name: some-custom-controller-0001A
namespace: kube-system
spec:
leaseName: some-custom-controller
binary-version: "1.29"
compatibility-version: "1.29"
leaseDurationSeconds: 300
renewTime: "2023-12-05T02:33:08.685777Z"
A component “lease candidate” announces candidacy for leadership by specifying
spec.leaseName in its lease candidate lease. If the LeaseCandidate object expires, the
component is considered unavailable for leader election purposes. “Expires” is defined more clearly in the Renewal Interval section.
Coordinated Election Controller
A new Coordinated Election Controller will reconcile component leader Leases
(primary resource) and Lease Candidate Leases (secondary resource, changes trigger
reconciliation of related leader leases).
Coordinated Election Controller reconciliation loop:
- If no leader lease exists for a components:
- Elect leader from candidates by preparing a freshly renewed
Leasewith:spec.holderIdentityset to the identity of the elected leader
- Elect leader from candidates by preparing a freshly renewed
- If there is a better candidate than current leader:
- Sets
preferredHolderon the leaderLeaseto the name of the next leader, signaling that the leader should stop renewing the lease and yield leadership
- Sets
flowchart TD
A[Reconcile] --> |Process Leader Lease| B
B{Lease Status?} --> |Better Leader Exists| D
B --> |Expired/Missing| E
D[End Lease Term]
E[Elect Leader]Example of a lease created by Coordinated Election Controller:
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
annotations:
name: some-custom-controller
namespace: kube-system
spec:
holderIdentity: controller-a
leaseDurationSeconds: 10
leaseTransitions: 0
renewTime: "2023-12-05T18:58:31.295467Z"
The Coordinated Election Controller will run in the kube-apiserver.
In an HA configuration, the Coordinated Leader Election Controller will have its
own lease similar to how other leader elected controllers behaves today. It will
be responsible for renewing its own lease and gracefully shutdown if the lease
is expired. Only one instance of the coordinated leader election controller will
be active at a time, and this prevents instances of the coordinated leader
election controller from interfering with each other. Unlike in KCM, the
coordinated leader election controller must gracefully shutdown and restart as
it will be running in the kube-apiserver and calling os.Exit() is not an
option.
Coordinated Lease Lock
A new controller tools/leaderelection/leasecandidate will be added to client-go that:
- Creates LeaseCandidate Lease when ready to be Leader
- Renews LeaseCandidate lease infrequently (once every 300 seconds)
- Watches its LeaseCandidate lease for the updates to the
pingTimefield. If thepingTimefield is later thanrenewTime, it signals that theLeaseCandidateshould be renewed and therenewTimeis subsequently updated. - Watches Leader Lease, waiting to be elected leader by the Coordinated Election Controller
- When it becomes leader:
- Perform role of active component instance
- Renew leader lease periodically
- Stop renewing if lease field
spec.preferredHolderis non nil
- If leader lease expires:
- Yield leadership and return to acting as a candidate component instance. For certain components, this may involve shutting down and restarting.
flowchart TD
A[Started] -->|Create LeaseCandidate Lease| B
B[Candidate] --> |Elected| C[Leader]
C --> |Renew Leader Lease| C
C -->|Better Candidate Available / Leader Lease Expired| D[Yield Leadership]
D[Yield Leadership] -.-> |Shutdown/Restart if necessary| ARenewal Interval and Performance
The leader lease will have renewal interval and duration (2s and 15s). This is similar to the renewal interval of the current leader lease.
For component leases, keeping a short renewal interval will add many unnecessary writes to the apiserver. The component leases renewal interval will default to 5 mins.
When the leader lease is marked as end of term or available, the coordinated
leader election controller will update the pingTime field of all component
lease candidate objects and wait up to 5 seconds. During that time, components
will update their component lease renewTime. The leader election controller
will then pick the leader based on its criteria from the set of component leases
that have ack’d the request.
Strategy
There are cases where a user may want to change the leader election algorithm
and this can be done via the spec.Strategy field in a Lease.
The Strategy field signals to the coordinated leader election controller the
appropriate algorithm to use when selecting leaders.
We will allow for the existence of a lease without a holder. This will allow
Strategy to be injected and preserved for leases that may not want to use the
default selected by CLE. If there are no candidate objects, the Strategy field
will remain empty to indicate that the Lease is not managed by the CLE
controller. Otherwise the strategy will always default to
MinimumCompatibilityVersion. The Lease may also be created or updated by a
third party to the desired spec.Strategy if an alternate strategy is
preferred. This may be done either by the candidates, users, or additional
controllers.
Releasing a Lease will involve resetting the holderIdentity to nil instead
of deletion. This will preserve Strategy when a Lease object is released and
reacquired by another candidate.
Alternative for Strategy
Creating a new LeaseConfiguration resource
We can create a new resource LeaseConfiguration to set up the defaults for
Strategy and other configurations extensible in the future. This is a very
clean approach that allows users to change the strategy at will without needing
to recompile/restart anything. The main drawback is the introduction of a new
resource and more complexity in leader election logic and watching.
kind: LeaseConfiguration
spec:
targetLease: "kube-system/kube-controller-manager"
strategy: "MinimumCompatibilityVersion"
YAML/CLI configuration on the kube-apiserver
We can also populate the default by directly setting up the CLE controller to ingest the proper defaults.
For instance, ingesting a YAML configuration in the form of a list of KV pairs of lease:strategy pairs will allow the CLE controller to directly determine the Strategy used for each component. This has the added benefit of requiring no API changes as it is optional whether to include the strategy in the Lease object.
The drawback of this method is that elevated permissions are needed to configure the kube-apiserver. In addition, an apiserver restart may be needed when the Strategy needs to be changed.
Strategy propagated from LeaseCandidate
One other alternative is that Strategy could be an option specified by a
LeaseCandidate object, in most cases the controller responsible renewing the
LeaseCandidate lease. The value for the strategy between different
LeaseCandidate objects leading the same Lease should be the same, but during
mixed version states, there is a possibility that they may differ. We will use a
consensus protocol that favors the algorithm with the highest priority. The
priority is a fixed list that is predetermined. For now, this is
NoCoordination > MinimumCompatibilityVersion. For example, if three
LeaseCandidate objects exist and two objects select
MinimumCompatibilityVersion while the third selects NoCoordination,
NoCoordination will take precedent and the coordinated leader election
controller will use NoCoordination as the election strategy. The final
strategy used will be written to the Lease object when the CLE controller
creates the Lease for a suitable leader. This has the benefit of providing
better debugging information and allows short circuiting of an election if the
set of candidates and selected strategy is the same as before.
The obvious drawback is the need for a consensus protocol and extra information
in the LeaseCandidate object that may be unnecessary.
Priority-based Coordinated Leader Election
To enhance control over leader assignment beyond existing CLE strategies like OldestEmulatedVersion, we propose adding an optional Priority field unset by default, higher value = higher priority) to LeaseCandidateSpec.
This field allows operators to explicitly designate a preferred leader. The CLE system will select the candidate with the highest non-zero Priority. If multiple candidates share the same highest priority, the existing v1.CoordinatedLeaseStrategy will act as a tie-breaker. If no candidates have a priority set, the system defaults to the existing v1.CoordinatedLeaseStrategy.
This provides granular, temporary control without replacing the primary CLE mechanism.
LeaseCandidateSpec Update
A new field called Priority is included into LeaseCandidateSpec:
// LeaseCandidateSpec is a specification of a Lease.
type LeaseCandidateSpec struct {
// ...
Priority int32 `json:"priority,omitempty" protobuf:"varint,7,opt,name=priority"` // New field: Higher value means higher priority. The value must be > 0.
}
Behavior of the Priority Field
- Priority Value: The
Priorityfield is an int32. A higher numerical value indicates a higher priority. This field must be greater than 0. - Selection Logic:
- If one or more candidates have a Priority > 0: The candidate with the numerically highest Priority value will be selected as the leader.
- Tie-Breaking for Equal Highest Priority: If multiple candidates share the same highest non-zero Priority value, the selection among these equally prioritized candidates will be resolved using their existing
v1.CoordinatedLeaseStrategy(e.g., OldestEmulatedVersion). - If no candidates have a Priority, the leader selection will proceed based purely on the existing
v1.CoordinatedLeaseStrategy.
Scenario Breakdown for priority based coordination leader election
Here is a step-by-step breakdown of the scenarios for better understanding the priority-based leader election during upgrades.
1. Initial State
At the beginning, all components (C1, C2, and C3) are running Binary Version 1 and are emulating Version 1
| Component | Binary Version | Emulation Version | Leader |
|---|---|---|---|
| C1 | V1 | V1 | Y |
| C2 | V1 | V1 | |
| C3 | V1 | V1 |
2. During Upgrade
During the upgrade, C1 and C2 are updated to Binary Version 2, but C3 remains on an earlier version. C2 is momentarily elected as the leader.
| Component | Binary Version | Emulation Version | Leader |
|---|---|---|---|
| C1 | V2 | V2 | |
| C2 | V2 | V1 | Y |
| C3 | V2 | V1 |
3. Priority Setting
The cluster administrator chooses C1 to be the leader by setting its priority to 100.
| Component | Binary Version | Emulation Version | Priority | Leader |
|---|---|---|---|---|
| C1 | V2 | V2 | 100 | Y |
| C2 | V2 | V1 | ||
| C3 | V2 | V1 |
4.1. Upgrade Completion
After the upgrade is finished, all components are running Binary Version 2 and are emulating Version 2. C1 remains the leader due to its set priority.
| Component | Binary Version | Emulation Version | Priority | Leader |
|---|---|---|---|---|
| C1 | V2 | V2 | 100 | Y |
| C2 | V2 | V2 | ||
| C3 | V2 | V2 |
4.2 Update rollback
Should an issue arise with C1 requiring a rollback, we can unset its priority. This will enable CLE to select C2, which contains the oldest emulated version.
| Component | Binary Version | Emulation Version | Priority | Leader |
|---|---|---|---|---|
| C1 | V2 -> V1 | V2 -> V1 | ||
| C2 | V2 | V1 | Y | |
| C3 | V2 | V1 |
5. Priority Persistence
Unless the cluster administrator resets the priority, C1 will always remain the leader. When a component gets upgraded or downgraded, it may create a new release candidate, causing the priority to reset.
Consideration for Stale Priorities
A concern with the priority field is the potential for “stale priorities” – a priority set temporarily and not subsequently cleared. This could prevent the Coordinated Leader Election (CLE) system from selecting a more appropriate leader.
We considered exposing a Time-To-Live (TTL) for priority in the LeaseCandidateSpec, where the CLE system would ignore a priority once its TTL expired. While this directly addresses the “temporary” nature of many priority assignments, we’ve decided not to include it in this initial phase due to several complexities:
- Implementation and Semantics: Defining the precise data type and behavior for a TTL (e.g., time.Duration vs. time.Time, resetting logic) adds significant complexity.
- User Rationalization: Adding a third field (ttl) to an already multi-faceted leader election logic (strategy + priority) greatly increases the cognitive load for users to understand and manage leader selection effectively.
Therefore, in this initial iteration, managing priority lifecycles will be an operational responsibility, requiring manual clearance or updates. We may revisit TTL or similar automated mechanisms in future iterations after gaining more experience with the priority field.
Enabling on a component
Components with a --leader-elect-resource-lock flag (kube-controller-manager,
kube-scheduler) will accept coordinatedleases as a resource lock type.
Migrations
So long as the API server is running a coordinated election controller, it is safe to directly migrate a component from Lease Based Leader Election to Coordinated Leader Election (or vis-versa).
During the upgrade, a mix of components will be running both election approaches. When the leader lease expires, there are a couple possibilities:
- A controller instance using
Lease-based leader election claims the leader lease - The coordinated election controller picks a leader, from the components that have written LeaseCandidate leases, and claims the lease on the leader’s behalf
Both possibilities have acceptable outcomes during the migration: a component is elected leader, and once elected, remains leader so long as it keeps the lease renewed. The elected leader might not be the leader that Coordinated Leader Election would pick, but this is no worse than how leader election works before the upgrade, and once the upgrade is complete, Coordinated Leader Election works as intended.
There is one thing that could make migrations slightly cleaner: If Coordinated
Leader Election adds a coordination.k8s.io/elected-by: leader-election-controller annotation to any leases that it claims. It can also
check for this annotation and only mark leases as “end-of-term” if that
annotation is present. Lease Based Leader Election would ignore “end-of-term”
annotations anyway, so this isn’t strictly needed, but it would reduce writes
from the coordinated election controller to leases that were claimed by
component instances not using Coordinated Leader Election
API
The lease lock API will be extended with a new field for election preference, denoted as an enum for strategies for Coordinated Leader Election.
type CoordinatedLeaseStrategy string
// CoordinatedLeaseStrategy defines the strategy for picking the leader for coordinated leader election.
const (
OldestEmulationVersion CoordinatedLeaseStrategy = "OldestEmulationVersion"
)
// LeaseSpec is a specification of a Lease.
type LeaseSpec struct {
// holderIdentity contains the identity of the holder of a current lease.
// If Coordinated Leader Election is used, the holder identity must be
// equal to the elected LeaseCandidate.metadata.name field.
// +optional
HolderIdentity *string `json:"holderIdentity,omitempty" protobuf:"bytes,1,opt,name=holderIdentity"`
// leaseDurationSeconds is a duration that candidates for a lease need
// to wait to force acquire it. This is measured against the time of last
// observed renewTime.
// +optional
LeaseDurationSeconds *int32 `json:"leaseDurationSeconds,omitempty" protobuf:"varint,2,opt,name=leaseDurationSeconds"`
// acquireTime is a time when the current lease was acquired.
// +optional
AcquireTime *metav1.MicroTime `json:"acquireTime,omitempty" protobuf:"bytes,3,opt,name=acquireTime"`
// renewTime is a time when the current holder of a lease has last
// updated the lease.
// +optional
RenewTime *metav1.MicroTime `json:"renewTime,omitempty" protobuf:"bytes,4,opt,name=renewTime"`
// leaseTransitions is the number of transitions of a lease between
// holders.
// +optional
LeaseTransitions *int32 `json:"leaseTransitions,omitempty" protobuf:"varint,5,opt,name=leaseTransitions"`
// Strategy indicates the strategy for picking the leader for coordinated leader election.
// If the field is not specified, there is no active coordination for this lease.
// (Alpha) Using this field requires the CoordinatedLeaderElection feature gate to be enabled.
// +featureGate=CoordinatedLeaderElection
// +optional
Strategy *CoordinatedLeaseStrategy `json:"strategy,omitempty" protobuf:"bytes,6,opt,name=strategy"`
// PreferredHolder signals to a lease holder that the lease has a
// more optimal holder and should be given up.
// This field can only be set if Strategy is also set.
// +featureGate=CoordinatedLeaderElection
// +optional
PreferredHolder *string `json:"preferredHolder,omitempty" protobuf:"bytes,7,opt,name=preferredHolder"`
}
For the LeaseCandidate leases, a new lease will be created
// LeaseCandidateSpec is a specification of a Lease.
type LeaseCandidateSpec struct {
// LeaseName is the name of the lease for which this candidate is contending.
// This field is immutable.
// +required
LeaseName string `json:"leaseName" protobuf:"bytes,1,name=leaseName"`
// PingTime is the last time that the server has requested the LeaseCandidate
// to renew. It is only done during leader election to check if any
// LeaseCandidates have become ineligible. When PingTime is updated, the
// LeaseCandidate will respond by updating RenewTime.
// +optional
PingTime *metav1.MicroTime `json:"pingTime,omitempty" protobuf:"bytes,2,opt,name=pingTime"`
// RenewTime is the time that the LeaseCandidate was last updated.
// Any time a Lease needs to do leader election, the PingTime field
// is updated to signal to the LeaseCandidate that they should update
// the RenewTime.
// Old LeaseCandidate objects are also garbage collected if it has been hours
// since the last renew. The PingTime field is updated regularly to prevent
// garbage collection for still active LeaseCandidates.
// +optional
RenewTime *metav1.MicroTime `json:"renewTime,omitempty" protobuf:"bytes,3,opt,name=renewTime"`
// BinaryVersion is the binary version. It must be in a semver format without leading `v`.
// This field is required when strategy is "OldestEmulationVersion"
// +optional
BinaryVersion string `json:"binaryVersion,omitempty" protobuf:"bytes,4,opt,name=binaryVersion"`
// EmulationVersion is the emulation version. It must be in a semver format without leading `v`.
// EmulationVersion must be less than or equal to BinaryVersion.
// This field is required when strategy is "OldestEmulationVersion"
// +optional
EmulationVersion string `json:"emulationVersion,omitempty" protobuf:"bytes,5,opt,name=emulationVersion"`
// PreferredStrategies indicates the list of strategies for picking the leader for coordinated leader election.
// The list is ordered, and the first strategy supersedes all other strategies. The list is used by coordinated
// leader election to make a decision about the final election strategy. This follows as
// - If all clients have strategy X as the first element in this list, strategy X will be used.
// - If a candidate has strategy [X] and another candidate has strategy [Y, X], Y supersedes X and strategy Y
// will be used.
// - If a candidate has strategy [X, Y] and another candidate has strategy [Y, X], this is a user error and leader
// election will not operate the Lease until resolved.
// (Alpha) Using this field requires the CoordinatedLeaderElection feature gate to be enabled.
// +featureGate=CoordinatedLeaderElection
// +listType=atomic
// +required
PreferredStrategies []v1.CoordinatedLeaseStrategy `json:"preferredStrategies,omitempty" protobuf:"bytes,6,opt,name=preferredStrategies"`
}
Each LeaseCandidate lease may only lead one lock. If the same component wishes to lead many leases, a separate LeaseCandidate lease will be required for each lock.
Comparison of leader election
| Lease Based Leader Election | Coordinated Leader Election | |
|---|---|---|
| Lock Type | Lease | Lease |
| Claimed by | Component instance | Election Coordinator. (Lease is claimed for to the elected component instance) |
| Renewed by | Component instance | Component instance |
| Leader Criteria | First component to claim lease | Best leader from available candidates at time of election |
| Preemptable | No | Yes, Collaboratively. (Coordinator marks lease’s next preferredHolder. Component instance voluntarily stops renewing) |
User Stories (Optional)
Story 1
A cluster administrator upgrades a cluster’s control plane node-by-node, expecting version skew to be respected.
- When the first and second nodes are upgraded, any components that were leaders
will typically lose the lease during the node downtime
- If one happens to retain its lease, it will be preempted by the coordinated election controller after it updates its LeaseCandidate lease with new version information
- When the third node is upgraded, all components will be at the new version and one will be elected
Story 2
A cluster administrator rolls back a cluster’s control plane node-by-node, expecting version skew to be respected.
- When the first node is rolled back, any components that were leaders will typically loose the lease during the node downtime
- Once one of the components updates its LeaseCandidate lease with new version information, the coordinated election controller will preempt the current leader so that this lower version component becomes leader.
- When the remaining two nodes can rollback, the first node will typically remain leader, but if a new election occurs, the available older version components will be elected.
Story 3
A cluster administrator may want more fine grain control over a control plane’s upgrade.
- When one node is upgraded they may wish to canary the components on that node and switch the leader to the new compatibility version immediately.
- This can be accomplished by changing the
Strategyfield in a lease object.
Notes/Constraints/Caveats (Optional)
Risks and Mitigations
Risk: Amount of writes performed by leader election increases substantially
This enhancement introduces a LeaseCandidate lease for each instance of each component.
Example:
- HA cluster with 3 control plane nodes
- 3 elected components (kube-controller-manager, scheduler, cloud-controller-manager) per control plane node
- 9 LeaseCandidate leases are created and renewed by the components
Introducing this feature is roughtly equivalent to adding the same lease load as adding 9 nodes to a kubernetes cluster.
The API Server Identity enhancement also introduces similar leases. For comparison, in a HA cluster with 3 control plane nodes, API Server Identity adds 3 leases.
This risk can be migitated by scale testing and, if needed, extending the lease duration and renewal times to reduce writes/s.
Risk: lease candidate watches increase apiserver load substantially
The Unknown Version Interoperability Proxy (UVIP)
enhancement
also adds lease
watches on API Server Identity
leases in the
kube-system namespace. This enhancement does not touch the number of lease resources
being watched, but adds 3 resources being watched for LeaseCandidate per component.
Risk: We have to “start over” and build confidence in a new leader election algorithm
We’ve built confidence in the existing leasing algorithm, through an investment of engineering effort, and in core hours testing it and running it in production.
Changing the algorithm “resets the clock” and forces us to rebuild confidence on the new algorithm.
The goal of this proposal is to minimize this risk by reusing as much of the existing lease algorithm as possible:
- Renew leases in exactly the same way as before
- Leases can never be claimed by another leader until a lease expires
Risk: How is the election controller elected?
The leader election controller will be selected by the first apiserver that claims the leader election lease lock. This is the same as how kube controller manager and other components are elected today. The leader selected is not deterministic during an update, but we do not see many changes to the leader election controller.
Risk: What if the election controller fails to elect a leader?
Fallback to letting component instances claim the lease directly, after a longer delay, to give the coordinated election controller an opportunity to elect before resorting to the fallback.
Design Details
Test Plan
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
Unit tests
staging/src/k8s.io/client-go/tools/leaderelection: 76.8pkg/controller/leaderelection:77.8
Integration tests
test/integration/apiserver/coordinatedleaderelection/*: New directory
e2e tests
test/e2e/apimachinery/coordinatedleaderelection.go: New file
Graduation Criteria
Alpha
- Feature implemented behind a feature flag
- The strategy
OldestEmulationVersionis implemented
Beta
- e2e & integration tests for coordinated leader election on various scenarios
- single leasecandidate
- multiple leasecandidates
- lease is preempted when another more suitable candidate is found
- Components that don’t know about coordination mixed with those who do
- Downgrade to components that do not know about coordination
- Custom third party strategy controller
- Lease pings are parallelized
- Tests are included for third party strategies
- Tests for disablement of the feature gate
GA
- Load test Coordinated Leader Election
- Feature is enabled by default
- A tested solution for stale priorities is implemented, working through either improved user validation to prevent them, or an automated system to correct them.
Upgrade / Downgrade Strategy
Upgrading requires enabling the feature gate CoordinatedLeaderElection and the group version coordination.k8s.io/v1alpha2. Downgrading will revert to the old leader election mechanism, but may have extra data in etcd for LeaseCandidate objects under the coordination.k8s.io/v1alpha2 group version.
Version Skew Strategy
The feature uses leases in a standard way, so if some components instances are configured to use the old direct leases and others are configured to use this enhancement’s coordinated leases, the component instances may still safely share the same lease, and leaders will be safely elected.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name: CoordinatedLeaderElection
- Components depending on the feature gate:
- kube-apiserver
- kube-controller-manager
- kube-scheduler
Does enabling the feature change any default behavior?
Yes, kube-scheduler and kube-controller-manager will use coordinated leader election instead of the default leader election mechanism if the feature is enabled.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
Yes, the feature uses leases in a standard way, so if some components are configured to use direct leases and others are configured to use coordinated leases, elections will still happen. Also, coordinated leader election falls back to direct leasing of the election coordinator does not elect leader within a reasonable period of time, making it safe to disable this feature in HA clusters.
What happens if we reenable the feature if it was previously rolled back?
This is safe. Leader elections would transition back to coordinated leader elections. Any elected leaders would continue to renew their leases.
Are there any tests for feature enablement/disablement?
Yes, this will be tested, including tests where the are a mix of components with the feature enabled and disabled.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
Rollouts and rollbacks can fail in many ways. During the first rollout of the feature, there will be a mixed state of control planes using and not using coordinated leader election. Components not using CLE will race to obtain the best leader while the ones using CLE will defer the CLE controller to assign themselves as leader. We cannot guarantee the best leader is elected during mixed version states, but leader election will still be done.
If the CLE controller has bugs, it may fail to or incorrectly select a leader and could lead to disruptions.
If LeaseCandidate objects have incorrect version information, CLE controller may make an incorrect leader selection and potentially lead to version skew violations.
What specific metrics should inform a rollback?
If leases fail to renew that would be a sign for rollback.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
Integration tests include testing for skew scenarios.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
LeaseCandidate resource will be enabled and feature gate
CoordinatedLeaderElection will be enabled. On the Lease object, a new field
Strategy will be populated indicating the strategy used by coordinated leader
election for selecting the most suitable leader.
How can someone using this feature know that it is working for their instance?
- LeaseCandidate objects will exist for leader elected components, and the
RenewTimeandPingTimefields will be recent (within 30 minutes). - Lease objects for leader elected components will be assigned and actively renewing.
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
When leader elected components are in the cluster, the leader must be timely selected and propagated via the Lease object. The lease must be actively renewed.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
apiserver_coordinated_leader_election_leader_changes{name="<component>"} - Metric name:
apiserver_coordinated_leader_election_leader_preemptions{name="<component>"} - Metric name:
apiserver_coordinated_leader_election_failures_total - Metric name:
apiserver_coordinated_leader_election_skew_preventions_total - Components exposing the metrics: kube-apiserver
- Metric name:
Are there any missing metrics that would be useful to have to improve observability of this feature?
n/a.
Dependencies
Does this feature depend on any specific services running in the cluster?
No.
Scalability
Will enabling / using this feature result in any new API calls?
Yes.
- API call type: PUT
- estimated throughput: Steady state is 3 requests per leader elected component every 30 minutes to renew the LeaseCandidate. If there is churn in the control plane, an extra 2N requests are performed on every change per leader elected component, N representing the number of available control planes. The number is 2N because N requests will be sent by the apiserver to ping all candidates, and every request should be ack’d by the client.
- watch on LeaseCandidate resources
Will enabling / using this feature result in introducing new API types?
- coordination.k8s.io/LeaseCandidate
- One candidate will exist for each leader elected component for each control plane. Total amount is
# leader elected components*# control plane instances
Will enabling / using this feature result in any new calls to the cloud provider?
No.
Will enabling / using this feature result in increasing size or count of the existing API objects?
An additional Strategy field will be populated on all leases elected by CLE. This is a string enum.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
No.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
This is a control plane feature and does not affect node.
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
If the API server becomes unavailable, the CLE cannot function as it is built on top of the API server. It cannot monitor LeaseCandidates, update Leases, or elect new leaders. Existing leaders will continue to function until their Leases expire, but no new leaders will be elected until the API server recovers.
If etcd is unavailable, similar issues arise. The underlying lease mechanism relies on etcd for storage and coordination. Without etcd, Leases cannot be created, renewed, or monitored.
What are other known failure modes?
- Leader election controller fails to elect a leader
- Detection: Via metrics
apiserver_coordinated_leader_election_failures_totalincreasing and absence of leader in lease object - Mitigations: Operators can disable feature gate.
- Diagnostics: Check kube-apiserver logs for messages on failing to elect the leader. Look at the lease object renewal times and holder, along with leasecandidate objects for the particular component.
- Testing: Integration test exists that prevents write access for the CLE controller and ensures that another controller takes over.
- Detection: Via metrics
What steps should be taken if SLOs are not being met to determine the problem?
Check whether the CLE controller is operating properly, check if API server is
not overloaded, and in the worst case disable the feature by explicitly setting
the feature gate to false. This information can be found in the controller and
API server logs kube-apiserver.log. Additionally, looking through the lease
and leasecandidate objects will provide insight on whether the leases and
candidates are renewing properly.
Implementation History
Drawbacks
Alternatives
When evaluating alternatives, note that if we decide in the future to improve the algorithm, fix a bug in the algorithm, or change the criteria for how leaders are elected, our decision on where to put the code has a huge impact our how the change is rolled out.
For example, it will be much easier change in a controller in the kube-apiserver than in client-go library code distributed to elected controllers, because once it is distributed into controllers, especially 3rd party controllers, any change requires updating client-go and then updating all controllers to that version of client-go.
Similar approaches involving the leader election controller
Running the leader election controller in HA on every apiserver
The apiserver runs very few controllers, and they are not elected, but instead
all run concurrently in HA configurations.
Requires the election controller make careful use concurrency control primitives
to ensure multiple instances collaborate, not fight.
When the Coordinated Leader Election controller runs in the apiserver, it is possible that two instances of the controller will have different views of the candidate list. This happens when one controller has fallen behind on a watch (which can happen for many underlying reasons).
When two controllers have differnet candidate lists, they might “fight”. One likely way they would fight is:
- controller A thinks X is the best leader
- controller B thinks Y is the best leader (because it has stale data from a point in time when this was true)
- controller A elects X
- controller B marks the leader lease as ““End of term” since it believes Y should be leader
- controller B elects Y as leader
- controller A marks the leader lease as ““End of term” since it believes X should be leader
- …
This can be avoided by tracking resourceVersion or generation numbers of resources used to make a decision in the lease being reconciled and authoring the controllers to not to write to a lease when the data used is stale compared to the already tracked resourceVersion or generation numbers.
One drawback to this approach is that updating the leader election controller can cause undefined behavior when multiple instances of the leader election controller are “collaborating”. It is difficult to test and prove edge cases when an update to the leader election controller code is necessary and could fight with the previous version during an mixed version state.
Running the coordinated leader election controller in KCM
Since the coordinated leader election controller is a controller that is elected, it would also make sense to run in KCM. However, a major drawback is that KCM forcefully shuts down when it loses the leader lock and it is possible that the leader election controller on the same KCM instance is the leader at that time. This causes the coordinated leader election controller to change leaders which could cause disruptions.
Two ways to solve this are to gracefully shutdown the KCM and fork the process such that the coordinated leader election controller is unaffected. Gracefully shutting down the KCM is difficult as controllers are used to the KCM forcefully shutting them, and we have no guarantee that third party controllers do not rely on this “feature”. Forking the process causes additional overhead that we’d like to avoid.
Running the coordinated leader election controller in a new container
Instead of running in KCM, the coordinated leader election controller could be
run in a new container (eg: kube-coordinated-leader-election). There will be a
slightly larger memory footprint with this approach and adding a new component to the
control plane changes our Kubernetes control plane topology in an undesirable way.
Component instances pick a leader without a coordinator
- A candidates is picked at random to be an election coordinator, and the
coordinator picks the leader:
- Components race to claim the lease
- If a component claims the lease, the first thing it does is check the lease candidates to see if there is a better leader
- If it finds a better lease, it assigns the lease to that component instead of itself
Pros:
- No coordinated election controller
Cons:
- All leader elected components must have the code to decide which component is the best leader
Component instances pick a leader without lease candidates or a coordinator
- The candidates communicate through the lease to agree on the leader
- Leases have “Election” and “Term” states
- Leases are first created in the “election” state.
- While in the “election” state, candidates self-nominate by updating the lease with their identity and version information. Candidates only need to self nominate if they are a better candidate than candidate information already written to the lease.
- When “Election” timeout expires, the best candidate becomes the leader
- The leader sets the state to “term” and starts renewing the lease
- If the lease expires, it goes back to the “election” state
Pros:
- No coordinated election controller
- No lease candidates
Cons:
- Complex election algorithm is distributed as a client-go library. A bug in the algorithm cannot not be fixed by only upgrading kubernetes.. all controllers in the ecosystem with the bug must upgrade client-go and release to be fixed.
- More difficult to change/customize the criteria for which candidate is best.
Algorithm configurability
We’ve opted for a static fixed algorithm that looks at three things, continuing down the list of comparisons if there is a tiebreaker.
- min(binary version)
- min(compatibility version)
- min(lease candidate name)
The goal of the KEP is to make the leader predictable during a cluster upgrade where leader elected components and apiservers may have mixed versions. This will make all states of a Kubernetes control plane upgrade adhere to the version skew policy.
An alternative is to make the leader election algorithm configurable either via flags or a configuration file.
Future Work
- Controller sharding could leverage coordinated leader election to load balance controllers against apiservers.
- Optimizations for graceful and performant failover can be built on this enhancement.