KEP-5647: Stale Controller Handling

KEP-5647: Stale Controller Detection and Mitigation

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
- User Stories (Optional)
- Risks and Mitigations
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

We propose a way to mitigate staleness within Kubernetes controllers, particularly the ones located in kube-controller-manager (KCM). When operating at a large scale, controllers can fall behind the state of the apiserver, leading to decisions based on outdated information. This proposal introduces a generalized mechanism for controllers to detect how stale their local cache is and prevent reconciliation until the cache catches up to provide consistency.

Motivation

Every controller operates on a local cache that is populated by watching for changes from the apiserver. By its nature, this watch stream is eventually consistent and provides no guarantee of how far behind the “live” state of the apiserver it is. A change event might arrive within milliseconds, or under other circumstances, could be delayed by seconds or even minutes. This means every controller, at all times, is operating on a potentially outdated view of the cluster’s state.

The issue this KEP addresses is that operators currently have no visibility into this lag. It’s impossible to distinguish between normal delays and a controller that has fallen dangerously out of sync. A controller may continue trying to enact its desired state based on an outdated view of the world, causing spurious reconciles that do not reflect the desired intentions of the user.

This proposal aims to provide solutions to mitigate issues that may arise out of the controller falling behind. With the ability to compare resource version we now have the ability to have the idea of the “age” of a resource and run operations with that in mind.

Goals

The goals of this KEP is to document the stale controller issue and propose solutions to help detect and mitigate it. We will propose the ideas of enforcing stricter cache semantics at certain points in the reconcile cycle of a controller.

Non-Goals

This is not intended to enforce consistency guarantees on all controllers. The changes described here will be “opt-in” for every controller and will require some degree of changes to the controller logic. It is not the goal to require controllers to all view a consistent state of the world per reconcile.

Proposal

We propose the ability for controllers to be able to view whether their own writes have made it into the apiserver. This will allow for the controller to skip reconciling an object until it knows whether or not the previous write has been propogated to its cache. We will do this by adding plumbing to the underlying cached informer and adding the ability for controllers to know the resource version of objects they subscribe to.

Once this is added, we will onboard certain controllers to be able to use the newly exposed resource versions and skip reconciling and requeue until certain objects they write to are updated in their cache.

User Stories (Optional)

Story 1

I am a K8s cluster administrator. I want to be sure that my controllers do not write too frequently. I enable this feature so that I know my controllers only write when they are up to date.

Story 2

I am a controller author. I want certain objects to be ensured to be in my cache after I write on a previous reconcile so I can be assured that my reads are up to date. I use the newly provided frameworks to ensure those objects are up to date, mitigating the risks that I am operating on stale data.

Story 3

I am a controller author. I want to ensure that my controller doesn’t perform time sensitive operations on stale reads. I perform live gets on certain disruptive operations and if my cache is out of sync I “break the circuit” and wait for my cache to catch up prior to processing, preventing any disruptive operations on stale reads.

Risks and Mitigations

There is the risk of skipping reconciles, if this is not correct and we don’t unpause properly then that would lead to scenarios where a controller may stop reconciling an object entirely. We will feature gate this and add a set of rigorous tests prior to Beta/GA to ensure that the feature is well tested and not missing edge cases. Any time we try and optimize the reconcile loop there are issues like this that may arise.

There are also some edge cases that need to be accounted for such as controller restarts causing the cache to have to be resynced. We will need to account for scenarios like this and ensure that the controller gets a consistent view of the world after events like this. While these changes will not make existing behavior worse, anyone implementing and depending on the ability to read their writes will need to ensure that situations like that are consistent. Likely, we will need to have some solutions to https://github.com/kubernetes/kubernetes/issues/59848#issuecomment-2842495398 before controllers can fully rely on their watch cache on restart. The same ideas discussed here can likely be applied to have a consistent cache.

Design Details

Our proposal consists of two parts, one is the change that will enable the controllers to be fully informed on the current resource versions of objects in their cache. Second is the use of that new ability to actually ensure the read after write guarantees on objects we care about in the controller itself.

Informer and Cache Update

We will update the ResourceEventHandlerFuncs in staging/src/k8s.io/client-go/tools/cache/controller.go to have one more function.

type ResourceEventHandlerFuncs struct {
	AddFunc      func(obj interface{})
	UpdateFunc   func(oldObj, newObj interface{})
	DeleteFunc   func(obj interface{})
	BookmarkFunc func(resourceVersion string) <--- NEW
}

The Add, Update and Delete functions are already nearly enough to be able to track the lifecycle of objects but certain edge cases in tracking make it necessary to add the Bookmark function. Controllers will use all the functions to be able to track when they are able to reconcile after their prior writes.

This Bookmark functions only purpose is to inform any listeners that an update has occurred. This is necessary to be able to fully track the lifecycle of resources since otherwise a controller may not be able to tell whether an object has not yet been added to the cache or whether the object has been added and deleted without the cache tracking it.

Staleness Mitigation in Controllers

With the changes in the informers controllers now need mechanisms to actively prevent issues caused by stale data. The most critical failures often occur when a controller makes a significant decision—like deleting a Pod or scaling a resource—based on outdated information from its local cache. To address this, we will introduce read after write guarantees at these critical decision points.

We will begin by implementing targeted fixes directly within the controllers that are most sensitive to staleness. To solve this, we will modify the core processing loop of key controllers. The new approach involves tracking the resourceVersion of key resources after a successful write operation. This last-written resourceVersion is stored in memory. We will effectively store the mapping of the object that is reconciled on, to a tuple of resourceVersions of any resources that we wish to track. Once the reads of the cache have processed past the latest write for all the tracked resources we will allow the reconcile to proceed for that object. This will only skip and requeue reconciles for the specific objects that have written in a previous reconcile, and won’t be a global lock on all reconciles. By doing so, we ensure that we can progress as much as possible until we need an updated cache.

We can provide an example with the Daemonset controller. Any time the daemonset controller writes to pods, we will store the mapping from Daemonset -> Pod Resource Version. At the same time, we will also add a tracker to the controller which will update the latest seen RV of the pod cache. On any informer event, we will update the seen RV if it is greater than what is currently stored. We will provide some helper function isReady(dsKey string) that will query the tracker and only return true if the resource version of the latest write for pods by that daemonset is older than the latest read. We can also optimize other parts in similar ways, such as writes to daemonset status, but they all take the same pattern.

Lastly, inside of the controllers processNextWorkItem function, we will check whether the daemonset is ready using the helper function and requeue without running the reconciliation routine if the daemonset is not yet ready to be worked on due to the cache still needing to catch up. In the case of the object not being ready, we will requeue the object the same way as if an error occurred. This will have the same exponential backoff semantics so after a few reconciles of being unable to catch up the requeue will take longer and longer until the cache has enough time to actually catch up to the writes.

Circuit Breaking Pattern in Controllers

With the same pattern, we can implement a circuit breaking approach to certain controllers, such as the node-lifecycle controller. There are scenarios, such as the obtaining of node lease objects, where controllers may pull stale information from the cache, causing the controller to wrongly believe that a lease is expired.

We can perform live gets on these resources to ensure that we don’t overreact to staleness in the cache. Simultaneously if it is determined that the cache is stale, we can use the same pattern where we mark the cache as not ready until the cache has at least caught up to our prior live get. This will prevent too many requests hitting the api server.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

All the new code will be unit tested and have reasonable levels of coverage prior to merging.

We will be adding changes in the pkg/controller and package mostly. The controller package will contain the bulk of the logic with additions to the pod informer. We will add unit tests to ensure that if the cache is out of sync, we skip reconciliation and requeue ourselves.

For the Informer and Cache Updates we will be making those changes all in staging/src/k8s.io/client-go/tools/cache folder for the most part. We will add tests to the newly exposed function and ensure that it works as we expect it to.

Integration tests

test name : integration master , triage search

e2e tests

test name : SIG … , triage search

Graduation Criteria

Beta

Addition of new bookmark function for informers
Feature implemented behind a feature flag for 1 or more controllers
Unit tests for controllers/bookmark function
Addition of e2e tests for the controllers with feature gate enabled

GA

Analysis of onboarded controllers and addition of others that may have the same staleness issues
Addition of additional E2E tests and stress tests to ensure edge cases are fully tested
Implementation of circuit breaking to prevent disruptive behavior during staleness

Upgrade / Downgrade Strategy

Version Skew Strategy

This is all internal behavior in the sync loop for controllers. Version skew should not affect it.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate for stale controller mitigation
- Feature gate name: StaleControllerConsistency
- Components depending on the feature gate: KCM Controllers that are determined to be high scale
Feature gate for stale controller monitoring
- Feature gate name: MonitorInformerStaleness
- Components depending on the feature gate: None, but requires a direct get to the apiserver to determine staleness every 5 seconds.
Feature gate for underlying client-go changes
- Feature gate names: AtomicFIFO, UnlockWhileProcessingFIFO
- Components depending on the feature gate: None, but required by the stale controller handling feature.

Does enabling the feature change any default behavior?

There is no change to the default behavior of reconcilers, however the reconcilers may skip reconciling for some time until their caches catch up. This should be invisible but may look like the controllers are stuck when they really are just waiting for stale caches to catch up.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, the feature flag does not have any changes that are irreversible. The only change is in how frequently the reconciliation of key controllers occurs but no actual effect on any api objects should occur that a rollback would break.

What happens if we reenable the feature if it was previously rolled back?

It should start skipping reconciles for objects on enabled controllers until the cache catches up again.

Are there any tests for feature enablement/disablement?

There are no APIs to enable, we will test for whether the feature gate properly enables/disables the consistency guarantees.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Controller not updating objects as expected while it seems like informer metrics are fine. We are planning on adding staleness metrics which can help inform on this.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

It will automatically go in use once the gate is enabled

How can someone using this feature know that it is working for their instance?

Events
- Event Reason:
API .status
- Condition name:
- Other field:
Other (treat as last resort)
- Details:

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
Other (treat as last resort)
- Details:

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

The most significant drawback is the possibility of breaking the controllers by causing them to not reconcile when they in fact should. To mitigate this as much as possible, we will only add these semantics to controllers where we have observed issues due to stale reads and only implement those, especially for the beginning. If we see success on these controllers, we will look into creating a framework around this so it is easier to take advantage, but it would already be a success just for these controllers with documented staleness issues to be fixed.

Alternatives

There is not much of an alternative, we want to ensure that reconciles don’t occur on stale caches, so we have to skip reconciliation somehow. We can query the cache directly and try and figure out the current version that way, but that can lead to much more racy side effects than implementing a function in the informer.