KEP-5866: Server-side Sharded List and Watch
KEP-5866: Server-side Sharded List and Watch
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests within one minor version of promotion to GA
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
This proposal introduces server-side sharded LIST and WATCH to the Kubernetes API Server. By allowing clients to specify a sharding strategy and range in their LIST and WATCH requests, the API Server can filter events at the source, ensuring that horizontally scalable controllers (like kube-state-metrics) only receive the traffic and data they are responsible for.
Motivation
As Kubernetes clusters grow, the volume of events for core resources like Pods increases
significantly due to high churn. Many controllers need to scale to handle this load.
Historically, most controllers choose to scale vertically (e.g., kube-controller-manager), as
there is no native support for sharding or partitioning the watch stream. Some specialized controllers (e.g.,
kube-state-metrics) have implemented their own client-side horizontal sharding to distribute
work.
However, client-side sharding has a critical limitation: it does not reduce the incoming event volume per replica. Every replica still receives the full stream of events, paying the CPU and network cost to deserialize everything, only to discard items not belonging to their shard. Functionally, this makes horizontal scaling of the watch stream impossible. This results in:
- Wasted network bandwidth and overhead (N replicas * Full Stream).
- Wasted CPU and Memory on clients for processing irrelevant events.
The goal of this proposal is to address this bottleneck. By moving filtering to the API server, we provide the primitive needed to:
- Allow vertically scaled controllers to eventually adopt sharded architectures.
- Enable existing horizontally scalable controllers to scale efficiently without the “full stream” penalty.
We propose moving the filtering logic “upstream” to the Kubernetes API Server. By filtering events at the source, we ensure that each controller replica receives only the data it is responsible for.
Goals
- Reduce Network Traffic: Clients only receive events for their assigned shard.
- Reduce Client Resouce Usage: Clients do not need to deserialize or process irrelevant events.
- Extensible Framework: Support future sharding keys beyond the initial implementation.
Non-Goals
- Coordination: This proposal does not implement the coordination logic for clients. Clients are still responsible for determining their shard ranges.
- Resharding: The API server does not manage shard rebalancing strategies; By providing the raw hash ranges, clients can implement their own consistent hashing strategies if needed. This is future work that we are interested in, but is out of scope for this KEP.
- Sharding KCM: Sharding KCM is a complex topic that is outside the scope of this KEP. We are working on the fundamental building blocks that can enable sharding for KCM in the future. We want to eventually move towards a full sharding system, and this KEP acts as the initial step.
Proposal
We propose enhancements to the Kubernetes API to support Server-Side Watch Sharding.
This allows clients to request a filtered stream of events based on a consistent hashing strategy
determined by the client.
Clients specify a new selector parameter utilizing specific grammar (e.g., selector=shardRange(object.metadata.uid, start, end)) in
their ListOptions.
The API server computes the hash of the target field for each object and only dispatches events
that fall within the requested range.
This enables efficient horizontal scaling of controllers by ensuring each replica only processes
the data it owns, reducing network ingress and deserialization overhead.
User Stories (Optional)
Story 1: Horizontal Scaling of Controllers
A user wants to deploy a sharded controller (e.g., kube-state-metrics) that monitors Pods
across a large cluster. Instead of each replica watching all Pods and filtering client-side—which
consumes excessive bandwidth and CPU—the user configures each replica to watch a specific range of
UIDs (e.g., Replica 0 watches 00-7f, Replica 1 watches 80-ff). The API server only sends
events matching these ranges, allowing the monitoring system to scale linearly with the number of
shards.
Story 2 (Optional)
Notes/Constraints/Caveats (Optional)
Risks and Mitigations
Design Details
API Extensibility: Sharding Parameters
Clients will request shards via query parameters in their LIST and WATCH requests.
A dedicated shardSelector query parameter mapped to the ShardSelector field in
meta/v1.ListOptions. This parameter accepts a lightweight CEL-based functional
grammar, specifically utilizing a shardRange() function.
Syntax Details:
shardRange(fieldPath, hexStart, hexEnd)- Bounds are defined as 64-bit strings with a
'0x'prefix (e.g.'0x0000000000000000','0x8000000000000000'). - Supported field paths currently include
object.metadata.uidandobject.metadata.namespace.
The parameters are strongly typed internally on both the client side
(for syntactic correctness in client-go) and the server side (for validation and execution).
Shard Key
While several attributes such as (UID, Namespace, and OwnerReference) are viable candidates for sharding keys, we will start with support for UID and Namespace for the initial implementation.
kube-state-metrics is already using UID-based partitioning, and it is a perfect candidate for this feature as it current does client side filtering. Based on feedback from users, we will then expand the fraemwork to support configurable sharding across other fields (e.g., Namespace, NodeName).
Consistent Hashing (Key Range)
To support seamless scaling, we avoid binding objects directly to specific shards. Instead of using fixed virtual buckets, we partition the keyspace directly using prefix-based ranges.
The total keyspace is treated as a continuous ring or line. We assign ownership by defining start and end prefixes for the hash output. Each shard is configured to watch a specific lexicographical range of the hash output.
In a 2 shard example:
- Shard 1 covers range
0to7. - Shard 2 covers range
8tof.
This has the benefit that only a small fraction of buckets are moved between replicas when a reshard occurs.
Client Request
Clients will append the new parameter to their LIST and WATCH requests to subscribe to a specific slice of the stream.
Example Request:
GET /api/v1/pods?watch=true&shardSelector=shardRange(object.metadata.uid, '0x0000000000000000', '0x8000000000000000')
Clients can also specify multiple hash ranges simultaneously using the logical OR
operator (||) implemented by the CEL evaluator.
Multiple Range Example:
GET /api/v1/pods?watch=true&shardSelector=shardRange(object.metadata.uid, '0x0000000000000000', '0x8000000000000000') || shardRange(object.metadata.uid, '0x8000000000000000', '0x10000000000000000')
shardSelector: Introduces a new query parameter specifically for shard selection logic.shardRange(...): A CEL function that specifies the field to hash and the start (inclusive) / end (exclusive) hex bounds (hexStart <= x < hexEnd) of hash values.
Server Design
Currently, the Cacher broadcasts events to all watchers that match a simple Label/Field selector. We will extend this pipeline to support Hash-Based Filtering.
We enhance the SelectionPredicate to carry a new sharding configuration, and introduce a new
filter function.
flowchart TD
Event["Inbox Event"] --> Extract["Extraction: Get metadata.UID"]
Extract --> Hash["Hash: Compute FNV-1a(UID)"]
Hash --> Check{"Range Check"}
Check -- "Matches [Start, End]" --> Keep["Keep Event"]
Check -- "Outside Range" --> Drop["Discard Event"]
Keep --> Dispatch["Dispatch to Watcher"]Hashing Implementation
For speed and efficiency, we will use the FNV-1a hash algorithm.
- Input: Selected field value (string).
- Output: 64-bit integer (represented as hex/string for range comparison).
- Note on Randomness: Kubernetes UUIDs (uuidv4) are already uniformly distributed, but explicit hashing ensures distribution uniformity even if inputs become sequential (e.g., uuidv7 in the future).
Test Plan
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
n/a
Unit tests
k8s.io/apimachinery/pkg/apis/meta/v1:2026-02-01-TBD(Validation logic)k8s.io/apiserver/pkg/storage:2026-02-01-TBD(Filtering logic)
Integration tests
- Watch Sharding Test: Ensure that sharded watches with different ranges function.
e2e tests
Graduation Criteria
Alpha
- Feature implemented behind
ShardedListAndWatchfeature gate. - Basic unit and integration tests passing.
Beta
- Benchmarks showing performance improvements for sharded clients.
- Scalability tests verifying no regression in API server throughput.
- Informer and reflector framework will be updated to support sharded watches.
Deprecation
Upgrade / Downgrade Strategy
Version Skew Strategy
- Clients must be updated to send the new parameters.
- If a client sends sharding parameters to an old API server, the old server will ignore the unknown query parameters and send the full, un-sharded stream.
- To allow clients to safely distinguish between a filtered stream and a full stream,
the API Server will return a new
ShardInfostruct within theListMetaof the initialLISTresponse (and initial sync of aWATCH).- The
ShardInfostruct mirrors the applied selector back to the client via aselectorstring field. - If a client requests a shard and observes a matching
ShardInfo.selector, it can safely construct partial lists, process incoming events, or merge responses across multiple shards. - If
ShardInfois absent, the client knows the server ignored the parameter and can take appropriate action to avoid breaking mutual exclusion like falling back to client-side sharding.
- The
- Clients can also deterministically check for server-side sharding support by querying the OpenAPI v3 discovery document for the presence of the sharding query parameters.
- To enable safe client-side fallback, the hash algorithm and range evaluation logic will be placed in a common library (
k8s.io/apimachinery).
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name:
ShardedListAndWatch - Components depending on the feature gate:
kube-apiserver
- Feature gate name:
Does enabling the feature change any default behavior?
No. Default behavior (no sharding parameters) remains unchanged.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
What happens if we reenable the feature if it was previously rolled back?
Clients can resume sending sharding parameters. The API server will immediately start respecting them again. No state is persisted, so re-enablement is instantaneous.
Are there any tests for feature enablement/disablement?
Yes. Disabling the feature gate will cause the API server to stop honoring sharding parameters (processing them as unknown/ignored field or erroring depending on validation). Clients will revert to receiving full streams (if falling back) or error.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
What specific metrics should inform a rollback?
- Significant increase in
apiserver_request_duration_secondsfor LIST/WATCH requests using sharding (indicating expensive hashing/filtering). apiserver_watch_filtered_events_totalremaining 0 despite active sharded watches (indicating feature malfunction).
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
n/a
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
By checking the apiserver_watch_shards_total metric. If > 0, sharded watches are active.
How can someone using this feature know that it is working for their instance?
- Events
- Event Reason:
- API .status
- Condition name:
- Other field:
- Other (treat as last resort)
- Details:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
Latency for sharded watches should be comparable to standard watches.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
apiserver_watch_shards_total - Components exposing the metric:
kube-apiserver - Metric name:
apiserver_watch_filtered_events_total - Components exposing the metric:
kube-apiserver
- Metric name:
Are there any missing metrics that would be useful to have to improve observability of this feature?
n/a
Dependencies
Does this feature depend on any specific services running in the cluster?
No.
Scalability
Will enabling / using this feature result in any new API calls?
No. It uses standard LIST and WATCH verbs with new query parameters.
Will enabling / using this feature result in introducing new API types?
No.
Will enabling / using this feature result in any new calls to the cloud provider?
No.
Will enabling / using this feature result in increasing size or count of the existing API objects?
No. ListOptions is not persisted.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
Negligible. Hashing the UID is a very fast operation (nanoseconds). We already perform similar list filtering for label and field selectors.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
- API Server: Slight increase in CPU for hashing. However, significant downstream savings in network I/O and serialization for filtered events.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No.
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
It’s an apiserver feature and will not function if apiserver is unavailable.
What are other known failure modes?
- Hot Shards: If the hash distribution is uneven (unlikely with FNV-1a on UUIDs) or if the
keyspace is partitioned unevenly by clients, some shards may receive significantly more
traffic than others.
- Detection:
apiserver_watch_filtered_events_totalshowing uneven rates across shards. - Mitigation: Clients can adjust their shard ranges to balance the load.
- Diagnostics: Metrics.
- Detection:
What steps should be taken if SLOs are not being met to determine the problem?
Implementation History
Drawbacks
- Complexity in API Server filtering logic.
- Clients need to implement ring logic to calculate ranges.
Alternatives
Virtual Buckets
Instead of arbitrary lexical ranges, we could partition the keyspace into a fixed number of “Virtual Buckets” (e.g., 1024).
- Pros: Easier for clients to reason about.
- Cons: Fixed granularity. Resizing is hard.
Client-side Filtering
Clients are already able to filter events on their side. We could provide a library to help them with the filtering logic but it doesn’t solve the fundamental problem of too much churn in the watch stream for controllers that need to watch resources like Pod.
- Pros: No API Server changes.
- Cons: Network and CPU waste.
Label-Based Sharding
Clients could compute shard assignment themselves and label objects with their assigned shard
(e.g., controller-shard: "0").
- Pros: Works with existing
LabelSelector. - Cons: Write amplification. Every time the number of shards changes, every object in the system might need to be relabeled. This generates massive write load on the API server and etcd.
Explicit Query Parameters
Instead of a new selector grammar, we could expose explicit query parameters for sharding components, e.g., ?shardingKey=uid&shardRangeStart=0&shardRangeEnd=8.
- Pros: Relies on standard query parameter parsing without requiring a custom grammar. The fields map cleanly to strongly typed properties in
ListOptions, making validation straightforward. - Cons: Increases the surface area of
meta/v1.ListOptionswith fields that are specific only to watch sharding. This introduces new combinations of parameters and increases the risk of test coverage gaps. It is also less flexible for future evolution.
Lightweight Functional Grammar (e.g. selector=...)
Instead of explicit query parameters, we could introduce a generic selector parameter that uses an expression-based functional grammar (e.g., selector=shardRange(object.metadata.uid, 0, 8)). This avoids permanently bloating the core API with niche feature flags.
If this route is taken, we would deliberately start with a simple functional grammar func(args...) rather than full CEL. Whether this grammar would eventually support CEL is dependent on further testing to ensure the watch cache is not slowed down for complex expressions.
Similar to labels and field selectors, the parameter itself would be passed as a string over the wire, but the representation would be strongly typed on the client side (syntactic correctness) and server side (validation and execution).
// example types.go
type ShardRangeRequirement struct {
Key string // e.g. "object.metadata.uid"
Start string // hex string
End string // hex string
}
// example client side usage
req := selectors.NewShardRangeRequirement("object.metadata.uid", "0", "8")
listOptions := metav1.ListOptions{
// ... other options
Selector: req.String(),
}
- Pros: Future-proof and extensible primitive applicable to more than just sharding. Matches the existing label/field selector patterns.
- Cons: Requires building and maintaining a new parser. The grammar definition must be carefully designed to prevent execution overhead on the watch cache.
Extended Field Selectors
We considered extending the fieldSelector grammar to support functions or hash ranges, e.g.,
fieldSelector=range(object.metadata.uid, 0, 100).
- Pros: Reuses the existing concept of field-based filtering without adding new
ListOptionsquery parameters. - Cons: The
fieldSelectorgrammar has been around for a long time, relying heavily on exact matches and non-matches. Modifying it to support functional syntax (range(...)) can be fragile because the broader Kubernetes ecosystem may have built-in assumptions about this grammar.
Infrastructure Needed (Optional)
None.