KEP-4988: Snapshottable API server cache

KEP-4988 Snapshottable API server cache

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Risks and Mitigations
- Snapshot memory overhead
- Consistency checking overhead
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

The kube-apiserver’s caching mechanism (watchcache) efficiently serves requests for the latest observed state. However, LIST requests for previous states (e.g., via pagination or by specifying a resourceVersion) often bypass this cache and are served directly from etcd. This direct etcd access significantly increases performance costs and can lead to stability issues, particularly with large resources, due to memory pressure from transferring large data blobs.

This KEP proposes an enhancement to the kube-apiserver’s watch cache to generate B-tree snapshots, allowing it to serve LIST requests for previous states directly from the cache. This change aims to improve API server performance and stability. To support this snapshotting mechanism, this proposal also details changes to the watch cache’s compaction behavior to maintain Kubernetes Conformance and introduces an automatic cache inconsistency detection mechanism.

Motivation

When the API server serves a LIST requests directly from etcd, it introduces significant stability and reliability concerns:

Unpredictable Memory Pressure: Retrieving data from etcd and constructing responses involves significant memory allocations on the API server. The volume of data retrieved from etcd can vary drastically depending on object sizes. This results in unpredictable memory pressure, making it difficult to provision resources effectively and increasing the risk of Out-of-Memory (OOM) errors.
Ineffective API Priority and Fairness (APF) Throttling: The API server’s overload protection mechanism, API Priority and Fairness (APF), primarily throttles based on the predicted cost of a request, which is derived from factors like latency and object count. While these factors provide some indication of computational cost, they do not accurately reflect the memory footprint. Crucially, we lack visibility into the per-request memory allocations. Therefore, APF cannot effectively throttle requests based on actual memory usage, leaving the API server vulnerable to memory exhaustion.

These issues with serving data directly from etcd lead to unpredictable and volatile API server memory usage.

Remarkably, the API server already maintains all the necessary data in the watchcache. By enabling all LIST requests to be served from the watchcache, we can significantly reduce memory pressure and improve the effectiveness of APF throttling, leading to a more stable and reliable API server.

Goals

Reduce memory allocations by serving historical LIST requests from cache
Maintain Kubernetes conformance with regards to compaction
Prevent inconsistent responses returned by cache due to bugs in caching logic

Non-Goals

Change semantics of the LIST request
Support indexing when serving for all types of requests.
Enforce that no client requests are served from etcd
Support etcd server side compaction for watch cache
Detection of watch cache memory corruption

Proposal

We propose that the watch cache generate B-tree snapshots, allowing it to serve LIST requests for previous states. These snapshots will be stored for the same duration as watch history and compacted using the same mechanisms. This improves API server performance and stability by minimizing direct etcd access for historical data retrieval. It also aligns with the future extensions outlined in KEP-365: Paginated Lists .

Compaction is an important behavior, covered by Kubernetes Conformance tests. Supporting compaction is required to ensure consistent behavior regardless of whether the watch cache is enabled or disabled. Storing historical data in the watch cache, as this KEP proposes, breaks conformance. Currently, watch cache is only compacted when it becomes full. For resources with infrequent changes, this means data could be retained indefinitely, far beyond etcd’s compaction point, as highlighted in #131011 . Therefore, to maintain conformance and ensure predictable behavior, we propose that the existing etcd compaction mechanism also be responsible for compacting the snapshots in cache.

This proposal increases reliance on the watchcache, significantly elevating the impact of bugs in watch or caching logic. Triggering a bug would no longer impact a single client but affect the cache read by all clients connecting to a particular API server. As the proposed changes will result in all requests being served from the cache, it would be exceptionally difficult to debug errors, as comparing responses to etcd would no longer be an option. Consequently, we propose an automatic cache inconsistency detection mechanism that can run in production and replace manual debugging. It will automate checking consistency against etcd, protecting against bugs in the watch cache or etcd watch implementation. It is important to note that we do not plan to implement protection from memory corruption like bitflips.

Serving list from snapshots

The snapshotting mechanism utilizes ability of B-tree to create lazy copies of itself. This allows us to create snapshot on each watch event. Those snapshots capture the state of cache at historical resourceVersion, and can be used to serve LIST requests, by finding aprioripate snapshot and just reading from it.

Watch cache compaction

We will expand the existing mechanism for compacting etcd to also compact the watch cache. Kubernetes supports periodic configuring compaction by default executed every 5 minutes. In the current algorithm each API Server executes a optimistic write on compact_rev_key key to store revision to be compacted. The one that is first to write successfully, executes the compaction request against etcd. We will expand it by opening a watch on compact_rev_key key, and informing watch cache about succesfull compactions done by any API server. When watch cache is informed about compaction, it will truncate snapshot history up to that revision. To avoid changes of existing behavior, we will not compact watch history; this should be considered in the future.

Cache Inconsistency Detection Mechanism

The mechanism periodically calculates and compares a hash of the data for each resource in both the etcd and the watch cache.

It will be developed across multiple phases:

Alpha: In this phase, the detection will enabled only in the test environment. Enabled via KUBE_WATCHCACHE_CONSISTANCY_CHECKER environment variable, we will run in Kubernetes e2e tests to ensure that the mechanism works as expected. On mismatch the apiserver will panic making it easy to detect in tests.
Beta: The detection will be enabled by default. If an inconsistency is detected, snapshots stored in cache will be purged and the system will automatically fall back to serving LIST requests from etcd for the affected resource. This mechanism will only impact LIST requests that would be served from watch cache snapshots, effectively reverting to the behavior prior to this proposal, while other requests will continue to be served from the cache. Fallback will not be permanent, but will last until the next successful consistency check.

To monitor consistency failures we will expose storage_consistency_checks_total metric.

Risks and Mitigations

Snapshot memory overhead

B-tree snapshots are designed to minimize memory overhead by storing pointers to the actual objects, rather than the objects themselves. Since the objects are already cached to serve watch events, the primary memory impact comes from the B-tree structure itself. To quantify the memory overhead, we run 5k scalability tests. They should represent the worst case scenario, as they utilize large number of small objects. The results are promising:

Object Allocations: Allocation profile collected during the test test has shown an increase of 7GB in object allocations, which translates to a negligible 0.2% of total allocations.
Memory Usage: Memory in use profile collected during the test has shown Btree memory usage of 300MB, representing a 1.3% of total memory used.

Consistency checking overhead

Periodic execution of consistency checking will introduce additional overhead. This load is not negligible, as it requires downloading and decoding data from etcd. For saftly we still think it’s important that feature is enabled by default, however we want to leave an option to disable it. For that we will introduce DetectCacheInconsistency feature gate in Beta.

For future we plan to improve etcd API to support cheap consistency checks. At that point disabling inconsistency checks will no longer be needed.

Design Details

Snapshotting algorithm

Snapshot Creation: When a watch event is received, the cacher creates a snapshot of the B-tree based cache using the efficient Clone() method. This method creates a lazy copy of the tree structure, minimizing overhead. Since the watch cache already stores the history of watch events, the B-tree maintains just pointers to the in-use memory, storing only minimal necessary data.
Snapshot Storage: Snapshots are stored in a separate tree data structure, keyed by resourceVersion. This tree structure facilitates efficient lookup of the “nextSmaller” element, as resourceVersions are not necessarily sequential.
Serving: When a request requiring response based on previous snapshot arrives, the API server performs the following steps:

Extract the resourceVersion from request.
Looks up the “nextSmaller” snapshot based on the resourceVersion.
Constructs the response using data from the retrieved snapshot. Edge cases:
- Requested resourceVersion is smaller than any available snapshot: This indicates that the requested data has been cleaned up. In this scenario, the API server falls back to serving the request from etcd.
- Requested resourceVersion is larger than the latest snapshot: This could indicate a future resourceVersion or a situation where the watch cache is lagging behind. The API server performs a consistent read from etcd to confirm the existence of the future resourceVersion or waits for the watch cache to catch up.

Hasing algorithm

Every 5 minutes, for each resource, we calculate hash for each resource. A non-consistent LIST request (RV=0) is sent to the watch cache to retrieve its latest available RV. This revision is then used to make a consistent LIST request (RV=X, where X is the revision from the cache) to etcd. This ensures comparison of the cache’s latest state with the corresponding state in etcd, without explicit handling of potential cache staleness.

The 64-bit FNV algorithm (as implemented in hash/fnv ) is used to calculate the hash of object’s namespace, name, and resourceVersion joined by a ‘/’ byte. This should allow us to detect inconsistencies caused by bugs in applying watch events or bugs in etcd watch stream.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Add tests for LIST with pagination and providing exact RV.

Unit tests

k8s/apiserver/pkg/storage/cache: 2024-12-12 - <test coverage>

Integration tests

We should add a test to validate fallback to serving from etcd.

e2e tests

We should add a tests that validates metrics exposed for inconsistency detection. Test should cover couple of resources including resources with conversion.

Graduation Criteria

Alpha

Snapshotting implemented behind a feature gate disabled by default.
Inconsistency detection is behind environment variable
Inconsistency detection run in e2e tests

Beta

Inconsistency detection mechanism is qualified and no mismatch detected.
Inconsistency detection moved behind a feature gate DetectCacheInconsistency enabled by default.
Automatic fallback to etcd is implemented
Pass Kubernetes conformance tests for compaction

GA

TODO

Upgrade / Downgrade Strategy

The feature is purely in-memory so update/downgrade doesn’t require any specific considerations.

Version Skew Strategy

Feature touches only kube-apiserver and coordination between individual instances is not needed.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

feature-gates:
  - name: DetectCacheInconsistency
    components:
      - kube-apiserver
  - name: ListFromCacheSnapshot
    components:
      - kube-apiserver

Does enabling the feature change any default behavior?

Yes, kube-apiserver paginating LIST requests will no longer require request to etcd.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, via disabling the feature-gate in kube-apiserver.

What happens if we reenable the feature if it was previously rolled back?

The feature is purely in-memory so it will just work as enabled for the first time.

Are there any tests for feature enablement/disablement?

The feature is purely in-memory so feature enablement/disablement will not provide additional value on top of feature tests themselves.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Snapshotting should automatically fallback to serving from etcd if inconsistency is detected. Rollback should be consider if there is a high number of inconsistencies detected by storage_consistency_checks_total metric.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

No need for tests, this feature doesn’t cause any persistent side effects.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

This is control-plane feature, not a workload feature.

How can someone using this feature know that it is working for their instance?

This is control-plane feature, not a workload feature.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

API call latency SLO

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

API call latency SLI

Are there any missing metrics that would be useful to have to improve observability of this feature?

Yes, we are adding storage_consistency_checks_total to count the number of consistency checks performed and their outcomes.

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No, we expect the API call latency SLI to improve.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Overall we expect that cost of serving pagination will go down, however caching might increase RAM usage, if the client reads the first page, but never paginates. We expect that most controllers will read all pages.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

The feature is kube-apiserver feature - it just doesn’t work if kube-apiserver is unavailable.

What are other known failure modes?

Inconsistency of watch cache, should be addressed by the consistency checking mechanism. For the first iteration we will enable users to define an alert on a metric and detect if cache becomes inconsistent with etcd.