KEP-5866: Server-side Sharded List and Watch

Implementation History
ALPHA Implementable
Created 2026-02-01
Latest v1.36
Milestones
Alpha v1.36
Ownership
Primary Authors

KEP-5866: Server-side Sharded List and Watch

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
    • e2e Tests for all Beta API Operations (endpoints)
    • (R) Ensure GA e2e tests meet requirements for Conformance Tests
    • (R) Minimum Two Week Window for GA e2e tests to prove flake free
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • “Implementation History” section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This proposal introduces server-side sharded LIST and WATCH to the Kubernetes API Server. By allowing clients to specify a sharding strategy and range in their LIST and WATCH requests, the API Server can filter events at the source, ensuring that horizontally scalable controllers (like kube-state-metrics) only receive the traffic and data they are responsible for.

Motivation

As Kubernetes clusters grow, the volume of events for core resources like Pods increases significantly due to high churn. Many controllers need to scale to handle this load.

Historically, most controllers choose to scale vertically (e.g., kube-controller-manager), as there is no native support for sharding or partitioning the watch stream. Some specialized controllers (e.g., kube-state-metrics) have implemented their own client-side horizontal sharding to distribute work.

However, client-side sharding has a critical limitation: it does not reduce the incoming event volume per replica. Every replica still receives the full stream of events, paying the CPU and network cost to deserialize everything, only to discard items not belonging to their shard. Functionally, this makes horizontal scaling of the watch stream impossible. This results in:

  • Wasted network bandwidth and overhead (N replicas * Full Stream).
  • Wasted CPU and Memory on clients for processing irrelevant events.

The goal of this proposal is to address this bottleneck. By moving filtering to the API server, we provide the primitive needed to:

  1. Allow vertically scaled controllers to eventually adopt sharded architectures.
  2. Enable existing horizontally scalable controllers to scale efficiently without the “full stream” penalty.

We propose moving the filtering logic “upstream” to the Kubernetes API Server. By filtering events at the source, we ensure that each controller replica receives only the data it is responsible for.

Goals

  • Reduce Network Traffic: Clients only receive events for their assigned shard.
  • Reduce Client Resouce Usage: Clients do not need to deserialize or process irrelevant events.
  • Extensible Framework: Support future sharding keys beyond the initial implementation.

Non-Goals

  • Coordination: This proposal does not implement the coordination logic for clients. Clients are still responsible for determining their shard ranges.
  • Resharding: The API server does not manage shard rebalancing strategies; By providing the raw hash ranges, clients can implement their own consistent hashing strategies if needed. This is future work that we are interested in, but is out of scope for this KEP.
  • Sharding KCM: Sharding KCM is a complex topic that is outside the scope of this KEP. We are working on the fundamental building blocks that can enable sharding for KCM in the future. We want to eventually move towards a full sharding system, and this KEP acts as the initial step.

Proposal

We propose enhancements to the Kubernetes API to support Server-Side Watch Sharding. This allows clients to request a filtered stream of events based on a consistent hashing strategy determined by the client. Clients specify a new selector parameter utilizing specific grammar (e.g., selector=shardRange(object.metadata.uid, start, end)) in their ListOptions. The API server computes the hash of the target field for each object and only dispatches events that fall within the requested range. This enables efficient horizontal scaling of controllers by ensuring each replica only processes the data it owns, reducing network ingress and deserialization overhead.

User Stories (Optional)

Story 1: Horizontal Scaling of Controllers

A user wants to deploy a sharded controller (e.g., kube-state-metrics) that monitors Pods across a large cluster. Instead of each replica watching all Pods and filtering client-side—which consumes excessive bandwidth and CPU—the user configures each replica to watch a specific range of UIDs (e.g., Replica 0 watches 00-7f, Replica 1 watches 80-ff). The API server only sends events matching these ranges, allowing the monitoring system to scale linearly with the number of shards.

Story 2 (Optional)

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Design Details

API Extensibility: Sharding Parameters

Clients will request shards via query parameters in their LIST and WATCH requests.

A dedicated shardSelector query parameter mapped to the ShardSelector field in meta/v1.ListOptions. This parameter accepts a lightweight CEL-based functional grammar, specifically utilizing a shardRange() function.

Syntax Details:

  • shardRange(fieldPath, hexStart, hexEnd)
  • Bounds are defined as 64-bit strings with a '0x' prefix (e.g. '0x0000000000000000', '0x8000000000000000').
  • Supported field paths currently include object.metadata.uid and object.metadata.namespace.

The parameters are strongly typed internally on both the client side (for syntactic correctness in client-go) and the server side (for validation and execution).

Shard Key

While several attributes such as (UID, Namespace, and OwnerReference) are viable candidates for sharding keys, we will start with support for UID and Namespace for the initial implementation.

kube-state-metrics is already using UID-based partitioning, and it is a perfect candidate for this feature as it current does client side filtering. Based on feedback from users, we will then expand the fraemwork to support configurable sharding across other fields (e.g., Namespace, NodeName).

Consistent Hashing (Key Range)

To support seamless scaling, we avoid binding objects directly to specific shards. Instead of using fixed virtual buckets, we partition the keyspace directly using prefix-based ranges.

The total keyspace is treated as a continuous ring or line. We assign ownership by defining start and end prefixes for the hash output. Each shard is configured to watch a specific lexicographical range of the hash output.

In a 2 shard example:

  • Shard 1 covers range 0 to 7.
  • Shard 2 covers range 8 to f.

This has the benefit that only a small fraction of buckets are moved between replicas when a reshard occurs.

Client Request

Clients will append the new parameter to their LIST and WATCH requests to subscribe to a specific slice of the stream.

Example Request: GET /api/v1/pods?watch=true&shardSelector=shardRange(object.metadata.uid, '0x0000000000000000', '0x8000000000000000')

Clients can also specify multiple hash ranges simultaneously using the logical OR operator (||) implemented by the CEL evaluator.

Multiple Range Example: GET /api/v1/pods?watch=true&shardSelector=shardRange(object.metadata.uid, '0x0000000000000000', '0x8000000000000000') || shardRange(object.metadata.uid, '0x8000000000000000', '0x10000000000000000')

  • shardSelector: Introduces a new query parameter specifically for shard selection logic.
  • shardRange(...): A CEL function that specifies the field to hash and the start (inclusive) / end (exclusive) hex bounds (hexStart <= x < hexEnd) of hash values.

Server Design

Currently, the Cacher broadcasts events to all watchers that match a simple Label/Field selector. We will extend this pipeline to support Hash-Based Filtering.

We enhance the SelectionPredicate to carry a new sharding configuration, and introduce a new filter function.

flowchart TD
    Event["Inbox Event"] --> Extract["Extraction: Get metadata.UID"]
    Extract --> Hash["Hash: Compute FNV-1a(UID)"]
    Hash --> Check{"Range Check"}
    Check -- "Matches [Start, End]" --> Keep["Keep Event"]
    Check -- "Outside Range" --> Drop["Discard Event"]
    Keep --> Dispatch["Dispatch to Watcher"]

Hashing Implementation

For speed and efficiency, we will use the FNV-1a hash algorithm.

  • Input: Selected field value (string).
  • Output: 64-bit integer (represented as hex/string for range comparison).
  • Note on Randomness: Kubernetes UUIDs (uuidv4) are already uniformly distributed, but explicit hashing ensures distribution uniformity even if inputs become sequential (e.g., uuidv7 in the future).

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

n/a

Unit tests
  • k8s.io/apimachinery/pkg/apis/meta/v1: 2026-02-01 - TBD (Validation logic)
  • k8s.io/apiserver/pkg/storage: 2026-02-01 - TBD (Filtering logic)
Integration tests
  • Watch Sharding Test: Ensure that sharded watches with different ranges function.
e2e tests

Graduation Criteria

Alpha

  • Feature implemented behind ShardedListAndWatch feature gate.
  • Basic unit and integration tests passing.

Beta

  • Benchmarks showing performance improvements for sharded clients.
  • Scalability tests verifying no regression in API server throughput.
  • Informer and reflector framework will be updated to support sharded watches.

Deprecation

Upgrade / Downgrade Strategy

Version Skew Strategy

  • Clients must be updated to send the new parameters.
  • If a client sends sharding parameters to an old API server, the old server will ignore the unknown query parameters and send the full, un-sharded stream.
  • To allow clients to safely distinguish between a filtered stream and a full stream, the API Server will return a new ShardInfo struct within the ListMeta of the initial LIST response (and initial sync of a WATCH).
    • The ShardInfo struct mirrors the applied selector back to the client via a selector string field.
    • If a client requests a shard and observes a matching ShardInfo.selector, it can safely construct partial lists, process incoming events, or merge responses across multiple shards.
    • If ShardInfo is absent, the client knows the server ignored the parameter and can take appropriate action to avoid breaking mutual exclusion like falling back to client-side sharding.
  • Clients can also deterministically check for server-side sharding support by querying the OpenAPI v3 discovery document for the presence of the sharding query parameters.
  • To enable safe client-side fallback, the hash algorithm and range evaluation logic will be placed in a common library (k8s.io/apimachinery).

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?
  • Feature gate (also fill in values in kep.yaml)
    • Feature gate name: ShardedListAndWatch
    • Components depending on the feature gate: kube-apiserver
Does enabling the feature change any default behavior?

No. Default behavior (no sharding parameters) remains unchanged.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
What happens if we reenable the feature if it was previously rolled back?

Clients can resume sending sharding parameters. The API server will immediately start respecting them again. No state is persisted, so re-enablement is instantaneous.

Are there any tests for feature enablement/disablement?

Yes. Disabling the feature gate will cause the API server to stop honoring sharding parameters (processing them as unknown/ignored field or erroring depending on validation). Clients will revert to receiving full streams (if falling back) or error.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?
What specific metrics should inform a rollback?
  • Significant increase in apiserver_request_duration_seconds for LIST/WATCH requests using sharding (indicating expensive hashing/filtering).
  • apiserver_watch_filtered_events_total remaining 0 despite active sharded watches (indicating feature malfunction).
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

n/a

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

By checking the apiserver_watch_shards_total metric. If > 0, sharded watches are active.

How can someone using this feature know that it is working for their instance?
  • Events
    • Event Reason:
  • API .status
    • Condition name:
    • Other field:
  • Other (treat as last resort)
    • Details:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?

Latency for sharded watches should be comparable to standard watches.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
  • Metrics
    • Metric name: apiserver_watch_shards_total
    • Components exposing the metric: kube-apiserver
    • Metric name: apiserver_watch_filtered_events_total
    • Components exposing the metric: kube-apiserver
Are there any missing metrics that would be useful to have to improve observability of this feature?

n/a

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

No. It uses standard LIST and WATCH verbs with new query parameters.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No. ListOptions is not persisted.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Negligible. Hashing the UID is a very fast operation (nanoseconds). We already perform similar list filtering for label and field selectors.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
  • API Server: Slight increase in CPU for hashing. However, significant downstream savings in network I/O and serialization for filtered events.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

It’s an apiserver feature and will not function if apiserver is unavailable.

What are other known failure modes?
  • Hot Shards: If the hash distribution is uneven (unlikely with FNV-1a on UUIDs) or if the keyspace is partitioned unevenly by clients, some shards may receive significantly more traffic than others.
    • Detection: apiserver_watch_filtered_events_total showing uneven rates across shards.
    • Mitigation: Clients can adjust their shard ranges to balance the load.
    • Diagnostics: Metrics.
What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

  • Complexity in API Server filtering logic.
  • Clients need to implement ring logic to calculate ranges.

Alternatives

Virtual Buckets

Instead of arbitrary lexical ranges, we could partition the keyspace into a fixed number of “Virtual Buckets” (e.g., 1024).

  • Pros: Easier for clients to reason about.
  • Cons: Fixed granularity. Resizing is hard.

Client-side Filtering

Clients are already able to filter events on their side. We could provide a library to help them with the filtering logic but it doesn’t solve the fundamental problem of too much churn in the watch stream for controllers that need to watch resources like Pod.

  • Pros: No API Server changes.
  • Cons: Network and CPU waste.

Label-Based Sharding

Clients could compute shard assignment themselves and label objects with their assigned shard (e.g., controller-shard: "0").

  • Pros: Works with existing LabelSelector.
  • Cons: Write amplification. Every time the number of shards changes, every object in the system might need to be relabeled. This generates massive write load on the API server and etcd.

Explicit Query Parameters

Instead of a new selector grammar, we could expose explicit query parameters for sharding components, e.g., ?shardingKey=uid&shardRangeStart=0&shardRangeEnd=8.

  • Pros: Relies on standard query parameter parsing without requiring a custom grammar. The fields map cleanly to strongly typed properties in ListOptions, making validation straightforward.
  • Cons: Increases the surface area of meta/v1.ListOptions with fields that are specific only to watch sharding. This introduces new combinations of parameters and increases the risk of test coverage gaps. It is also less flexible for future evolution.

Lightweight Functional Grammar (e.g. selector=...)

Instead of explicit query parameters, we could introduce a generic selector parameter that uses an expression-based functional grammar (e.g., selector=shardRange(object.metadata.uid, 0, 8)). This avoids permanently bloating the core API with niche feature flags.

If this route is taken, we would deliberately start with a simple functional grammar func(args...) rather than full CEL. Whether this grammar would eventually support CEL is dependent on further testing to ensure the watch cache is not slowed down for complex expressions.

Similar to labels and field selectors, the parameter itself would be passed as a string over the wire, but the representation would be strongly typed on the client side (syntactic correctness) and server side (validation and execution).

// example types.go
type ShardRangeRequirement struct {
    Key       string // e.g. "object.metadata.uid"
    Start     string // hex string
    End       string // hex string
}

// example client side usage
req := selectors.NewShardRangeRequirement("object.metadata.uid", "0", "8")
listOptions := metav1.ListOptions{
    // ... other options
    Selector: req.String(),
}
  • Pros: Future-proof and extensible primitive applicable to more than just sharding. Matches the existing label/field selector patterns.
  • Cons: Requires building and maintaining a new parser. The grammar definition must be carefully designed to prevent execution overhead on the watch cache.

Extended Field Selectors

We considered extending the fieldSelector grammar to support functions or hash ranges, e.g., fieldSelector=range(object.metadata.uid, 0, 100).

  • Pros: Reuses the existing concept of field-based filtering without adding new ListOptions query parameters.
  • Cons: The fieldSelector grammar has been around for a long time, relying heavily on exact matches and non-matches. Modifying it to support functional syntax (range(...)) can be fragile because the broader Kubernetes ecosystem may have built-in assumptions about this grammar.

Infrastructure Needed (Optional)

None.