KEP-5808: Native Histogram Support for Kubernetes Metrics

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
- Increase Classic Histogram Bucket Count
Infrastructure Needed (Optional)
References

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This KEP proposes adding support for Prometheus Native Histograms to Kubernetes component metrics. Starting with Prometheus v3.8.0, native histograms are supported as a stable feature. Native histograms use exponential bucket boundaries instead of fixed boundaries, providing significant storage efficiency (~10x reduction in time series count per histogram), improved query performance, and finer-grained visibility into distributions while maintaining full backward compatibility with existing monitoring infrastructure through a dual exposition strategy.

The implementation introduces a feature gate (NativeHistograms) to provide safe rollout and rollback capabilities. When enabled, Kubernetes components will expose histogram metrics in both classic and native formats simultaneously (when requesting a format that supports Native Histograms, such as PrometheusProto), ensuring existing dashboards and alerts continue to function while users can migrate to native histograms at their own pace. Rollback is handled primarily through Prometheus-side configuration (for Prometheus 3.x users) or via the K8s feature gate.

Motivation

Kubernetes exposes hundreds of histogram metrics across its control plane components. These metrics are essential for monitoring cluster health, debugging performance issues, and ensuring service level objectives are met. However, classic Prometheus histograms have inherent limitations:

Storage overhead: Each classic histogram creates multiple time series (one per bucket plus _count and _sum), leading to high storage costs at scale
Fixed bucket boundaries: Predefined buckets may not align well with actual data distributions, causing accuracy issues and rendering bucket boundaries useless. For example, if a histogram uses default buckets like [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10] seconds, a request completing in 1µs (0.000001s) falls into the same le="0.005" bucket as a request completing in 4ms—a 4000x difference in latency becomes indistinguishable. Similarly, all requests between 1s and 2.5s are grouped together, hiding important performance variations

Prometheus Native Histograms, introduced in Prometheus 2.40, address these limitations using exponential bucket boundaries with automatic adjustment. Kubernetes should support this modern, more efficient histogram format.

Goals

Enable Kubernetes components to expose metrics in Prometheus Native Histogram format
Maintain full backward compatibility with existing monitoring infrastructure
Provide a safe, gradual rollout path with extended testing periods

Non-Goals

Remove classic histogram exposition format
Remove existing histogram metrics

Proposal

Add native histogram support to the component-base/metrics package with:

Feature Gate: A new NativeHistograms feature gate controlling whether K8s components expose metrics with native histogram format
Global Defaults: Use sensible global defaults for native histogram configuration
Dual Exposition: When enabled, expose both classic and native histogram formats

The control model is intentionally simple:

K8s-side: Feature gate controls whether native histograms are exposed
Prometheus-side: Per-job scrape_native_histograms (ref ) controls what Prometheus ingests (Prometheus 3.x)

User Stories

Story 1: Platform Engineer Optimizing Monitoring Costs

As a platform engineer managing a large Kubernetes fleet, I want to reduce the storage costs of my Prometheus infrastructure. With native histograms, I can achieve ~10x reduction in time series count for histogram metrics, significantly reducing storage and improving query performance.

Story 2: SRE Detecting Performance Regressions

As an SRE responsible for cluster reliability, I need to detect performance regressions accurately. With classic histograms, a latency regression from 1ms to 50ms might go unnoticed because both values fall into the same le="0.1" bucket. Native histograms’ exponential buckets provide much finer resolution, enabling me to reliably detect even small performance regressions and set precise SLO thresholds.

Notes/Constraints/Caveats

External Dependency: Native histogram support in Kubernetes depends on Prometheus server scrape settings. Users must configure the following in their Prometheus scrape config during transition to receive both formats
1. scrape_native_histograms: true
2. always_scrape_classic_histograms: true

Risks and Mitigations

Note: Using native histograms is opt-in from users perspective: Prometheus only collects them if scrape_native_histograms: true is set in the scrape config. Enabling the Kubernetes feature gate by default will only affect users who have already opted in (i.e., configured scrape_native_histograms: true). Those using classic histograms will see no change until they update their Prometheus configuration.

Silent Dashboard/Alert Failures on Upgrade: When upgrading to a Kubernetes version where NativeHistograms feature gate becomes default ON, users with scrape_native_histograms: true in Prometheus who forget to also set always_scrape_classic_histograms: true can experience silent failures:
- Classic _bucket, _count, _sum metrics will no longer be ingested
- Existing dashboards using histogram_quantile(..._bucket...) queries will show no data or stale data
- Alerts based on classic histogram queries will stop firing

The dashboard breakage risk depends on a combination of Prometheus settings:

`scrape_native_histograms`	`always_scrape_classic_histograms`	Result
`false` (default)	N/A	SAFE: Classic only
`true`	`true`	SAFE: Both formats (recommended during migration)
`true`	`false` (default)	RISK: Native only (safe only after full dashboard migration)

Migration workflow:

The always_scrape_classic_histograms setting addresses a chicken-and-egg problem: users cannot migrate dashboards to native histogram queries without first enabling native histogram ingestion, but enabling ingestion without classic format would break existing dashboards.

Recommended approach:

Enable both formats: Set scrape_native_histograms: true AND always_scrape_classic_histograms: true
Migrate dashboards/alerts: Update queries from classic (histogram_quantile(..._bucket...)) to native histogram functions
Verify in staging: Ensure all dashboards and alerts work with native histogram queries
Disable classic scraping: Once migration is complete and verified, set always_scrape_classic_histograms: false to reduce storage overhead

Mitigation:

Verify Prometheus version (3.x recommended for per-job control)
Set always_scrape_classic_histograms: true for all K8s scrape jobs during migration
Test dashboard queries in staging before production upgrade
Docs and release notes must clearly state that users enabling scrape_native_histograms should also set always_scrape_classic_histograms: true until dashboard migration is complete

Design Details

Kubernetes metrics use the component-base/metrics package which wraps prometheus/client_golang. Currently:

HistogramOpts only supports classic Buckets []float64
No configuration path for native histogram options
Hundreds of histogram metrics across control plane components

Dual Exposition Strategy

When native histograms are enabled, Kubernetes will expose BOTH formats. The format returned depends on the client’s Accept header:

Text format (text/plain, OpenMetrics1.0):

Remains backward compatible; contains only classic histogram buckets

# Classic histogram buckets (always present)
apiserver_request_duration_seconds_bucket{le="0.005"} 1000
apiserver_request_duration_seconds_bucket{le="0.01"} 2000
...
apiserver_request_duration_seconds_bucket{le="+Inf"} 10000
apiserver_request_duration_seconds_count 10000
apiserver_request_duration_seconds_sum 450.5

Protobuf format (application/vnd.google.protobuf):

Contains both classic buckets AND native histogram data
Native histograms use the Prometheus protobuf schema with Histogram message containing exponential bucket spans
Binary format, not human-readable

Prometheus negotiates the format via content negotiation. When scrape_native_histograms: true, Prometheus requests protobuf format to receive native histogram data.

This ensures:

Existing dashboards continue to work
Users can migrate queries at their own pace
Prometheus stores whichever format it’s configured for

Implementation Phases

We will use sensible global defaults for all native histograms without exposing configuration options to developers. This keeps the implementation simple while we gather feedback. The HistogramOpts struct in component-base/metrics remains unchanged - no new fields have been added (configuration options may be added in future phases if a need arises based on user feedback and real-world usage patterns).

We will update the conversion function to pass native histogram options to the underlying Prometheus library when the feature gate is enabled:

func (o *HistogramOpts) toPromHistogramOpts() prometheus.HistogramOpts {
    opts := prometheus.HistogramOpts{
        Namespace:   o.Namespace,
        Subsystem:   o.Subsystem,
        Name:        o.Name,
        Help:        o.Help,
        ConstLabels: o.ConstLabels,
        Buckets:     o.Buckets,  // Always keep classic buckets
    }
    
    if utilfeature.DefaultFeatureGate.Enabled(features.NativeHistograms) {
        // Use fixed global defaults
        opts.NativeHistogramBucketFactor = 1.1   // Default bucket growth factor
        opts.NativeHistogramMaxBucketNumber = 160 // Default max buckets (based on OTel SDK recommendation: https://opentelemetry.io/docs/specs/otel/metrics/sdk/#base2-exponential-bucket-histogram-aggregation)
    }
    
    return opts
}

Prometheus Version Compatibility

Native histogram support and configuration varies significantly across Prometheus versions:

Prometheus < 2.40:

Cannot ingest native histograms at all
K8s exposing native histograms has no effect—Prometheus ignores them
Classic _bucket, _count, _sum metrics continue to work
Action needed: None, but no benefit from native histograms

Prometheus 2.40 - 2.x:

# Enable native histogram support globally
prometheus --enable-feature=native-histograms

# This is all-or-nothing: ALL scrape jobs will attempt to ingest native histograms
# No per-job control available

Higher risk: Cannot selectively enable for K8s while keeping classic for other targets
If enabled, dashboards for ALL targets using classic histogram queries may break

Prometheus 3.0 - 3.7:

# Per-job configuration (recommended)
scrape_configs:
  - job_name: 'kubernetes-apiservers'
    scrape_native_histograms: true
    always_scrape_classic_histograms: true  # Keep classic during transition

# OR use feature flag for global default (still supported)
# prometheus --enable-feature=native-histograms

Per-job control available
Can enable native histograms for K8s while keeping other jobs on classic only
Prometheus feature flag (--enable-feature=native-histograms) still works as global default

Prometheus 3.8:

# Per-job configuration (required for fine-grained control)
scrape_configs:
  - job_name: 'kubernetes-apiservers'
    scrape_native_histograms: true
    always_scrape_classic_histograms: true

# Feature flag now ONLY changes the global default value of scrape_native_histograms to true FOR ALL JOBS
# Individual jobs can still override with explicit scrape_native_histograms: false
# prometheus --enable-feature=native-histograms  # Changes default for all jobs, explicit per-job config preferred

Prometheus feature flag (--enable-feature=native-histograms) only remaining effect: sets scrape_native_histograms: true as default for all jobs
Per-job settings override the default
Transition period: migrate from flag to explicit per-job config

Prometheus 3.9+:

# Per-job configuration (only method)
scrape_configs:
  - job_name: 'kubernetes-apiservers'
    scrape_native_histograms: true
    always_scrape_classic_histograms: true

Prometheus feature flag (--enable-feature=native-histograms) fully deprecated and removed
Must use per-job scrape_native_histograms and always_scrape_classic_histograms
Default for both settings is false

Special Concern: Prometheus 2.x Users

Prometheus 2.x users with --enable-feature=native-histograms enabled are in a difficult position:

Scenario:

User has prometheus --enable-feature=native-histograms enabled for other workloads that benefit from native histograms
K8s starts exposing native histograms when the feature gate is enabled

Problem:

Prometheus 2.x global flag means ALL scrape jobs (including K8s) now ingest native format only
Other workloads that were prepared for this change continue working
But K8s dashboards that weren’t updated for native histograms suddenly break
No way to selectively keep classic format just for K8s while maintaining native for other apps

Mitigation options:

Turn off Prometheus feature flag
- Loses native histograms for ALL workloads (not just K8s)
Disable K8s feature gate: --feature-gates=NativeHistograms=false
- Requires K8s component restarts (may be slow/disruptive)
- For managed K8s, may not be possible
Upgrade to Prometheus 3.x
- Major version upgrade, may not be quick/easy
- Then can use per-job scrape_native_histograms: false to keep K8s on classic only
- OR use always_scrape_classic_histograms: true to get both native and classic formats

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Existing histogram metric tests have been extended to verify dual exposition behavior when the feature gate is enabled.

Unit tests

Test coverage before Beta graduation
- k8s.io/component-base/metrics: 2026-06-10 - 73.6%
- k8s.io/component-base/metrics/features: 2026-06-10 - 0.0% (contains only feature gate definitions; exercised by other packages’ tests)
- k8s.io/component-base/metrics/internal: 2026-06-10 - 0.0% (holds internal enabling state; tested via the main metrics package tests)

Integration tests

TestAPIServerNativeHistogramMetrics: integration master , triage search
TestSchedulerNativeHistogramMetrics: integration master , triage search
TestControllerManagerNativeHistogramMetrics: integration master , triage search

e2e tests

[sig-instrumentation] NativeHistograms should export both classic and native histograms (in protobuf format) from apiserver /metrics : SIG Instrumentation , triage search

Graduation Criteria

Alpha

Feature implemented behind NativeHistograms feature flag
Initial unit tests completed and enabled
Basic documentation available

Beta

Gather feedback from early adopters
Comprehensive integration tests in place
E2E tests covering upgrade/downgrade scenarios
Documentation updated with:
- Migration guide
- Troubleshooting procedures
- Prometheus configuration examples
- Clear Prometheus 2.x limitations
Performance benchmarks completed showing no regression (Migrate select performance tests to use native histograms for early feedback)
Refactor histograms created in init() functions to use lazy initialization (e.g., sync.Once with getter functions) so native histogram options are properly applied after feature gates are parsed

GA

Consider whether to make Native Histogram options configurable

Upgrade / Downgrade Strategy

Upgrade:

Kubernetes upgrade does not change monitoring behavior if feature gate is off
When feature gate is enabled:
- Classic histogram format continues to be exposed
- Native format is additionally exposed

Downgrade: When downgrading from a Kubernetes version with native histogram support to an older version:

Metrics automatically revert to classic histogram format only (_bucket, _count, _sum)
Prometheus behavior:
- If always_scrape_classic_histograms: true: Prometheus continues to scrape classic histograms
- If always_scrape_classic_histograms: false: Prometheus will NOT scrape histograms (data loss until config is updated)
Dashboard/Alert impact:
- Classic histogram queries: work only if always_scrape_classic_histograms: true, otherwise break
- Native histogram queries stop receiving new data and will break

Enabling Native Histograms (Opt-in):

# 1. Ensure Prometheus is ready (Prometheus 3.x)
# prometheus.yml
scrape_configs:
  - job_name: 'kubernetes-apiservers'
    scrape_native_histograms: true           # Ingest native histograms
    always_scrape_classic_histograms: true   # CRITICAL: Keep classic during transition

# For older Prometheus (2.40-2.x), use global feature flag:
# --enable-feature=native-histograms

# 2. Enable feature gate in Kubernetes
--feature-gates=NativeHistograms=true

Version Skew Strategy

Native histogram support is independent per component. Each component’s metrics are independent, no coordination required. Some components may expose native histograms while others don’t. This is acceptable as Prometheus scrapes each target independently.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: NativeHistograms
- Components depending on the feature gate: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet, kube-proxy
Other
- Describe the mechanism: Prometheus 3.x per-job scrape_native_histograms: false stops ingestion without K8s changes
- Will enabling / disabling the feature require downtime of the control plane? For feature gate changes, yes (component restart required). For Prometheus config changes, no.
- Will enabling / disabling the feature require downtime or reprovisioning of a node? No

Does enabling the feature change any default behavior?

When enabled, the metrics endpoint will expose an additional native histogram encoding alongside the existing classic histogram format. The classic format (_bucket, _count, _sum) remains unchanged and always present.

Users with Prometheus configured to prefer native histograms will see the data stored in native format. Users without native histogram support enabled in Prometheus see no change.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. Disabling can be done via:

Prometheus config: scrape_native_histograms: false per job (fastest, Prometheus 3.x only, no K8s restart)
Feature gate: --feature-gates=NativeHistograms=false (requires component restart)

When K8s feature is disabled, only classic histogram format is exposed. When Prometheus stops ingesting native histograms, it resumes scraping classic format on next scrape interval. No data loss occurs; historical data in Prometheus remains queryable.

What happens if we reenable the feature if it was previously rolled back?

Native histogram exposition resumes. No special handling required. Prometheus will begin storing native histograms again if configured to do so.

Are there any tests for feature enablement/disablement?

Yes, unit tests verify:

toPromHistogramOpts() returns correct configuration based on feature gate state
Toggling feature gate changes histogram configuration appropriately
Classic buckets are always present regardless of feature gate state

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

Rollout failure scenarios:

Prometheus too old to understand native histograms - Prometheus ignores native format, stores classic
Dashboard queries not updated - Dashboards continue to work with classic format
Memory pressure from additional histogram storage

Impact on workloads: None. This feature only affects metrics exposition, not workload behavior.

What specific metrics should inform a rollback?

Prometheus scrape errors increasing for Kubernetes targets
Significant increase in process_resident_memory_bytes for control plane components
Increase in /metrics endpoint latency
Dashboard queries failing

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Because all metrics are stored strictly in memory, the upgrade/rollback path was verified at the API/Scrape layer by simulating the transition of the feature gate via benchmarking . With the feature gate enabled, native histograms are exposed alongside classic histograms and ingested by Prometheus (if scrape settings are configured). With the feature gate disabled, only classic histograms are exposed, and Prometheus successfully falls back to classic scraping (or stops scraping native histograms without errors, continuing to scrape classic if always_scrape_classic_histograms is true).

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No. Classic histogram format is not deprecated.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

Check component logs for “native histograms enabled” message
Query kubernetes_feature_enabled metric with label name=NativeHistograms (value 1 = enabled)
Scrape /metrics endpoint with protobuf format and verify native histogram encoding is present

How can someone using this feature know that it is working for their instance?

Other (treat as last resort)
- Details: Query the metrics endpoint with Accept: application/vnd.google.protobuf header and verify native histogram encoding is present for histogram metrics. Note: Native histograms are only supported in protobuf exposition format, not in text-based formats.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

Metrics endpoint latency should not increase significantly when native histograms are enabled
All existing classic histogram queries continue to function

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name: process_resident_memory_bytes (existing, monitor for unexpected increases)
- Components exposing the metric: All control plane components

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

No cluster services required. However, to utilize native histograms:

Prometheus 2.40+ (experimental) or Prometheus 3.0+ (stable)
- Usage description: Required to scrape and store native histogram format
- Configuration:
  - Prometheus 2.40-2.x: --enable-feature=native-histograms (global)
  - Prometheus 3.x: scrape_native_histograms: true per scrape job (recommended)

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

The /metrics endpoint may take slightly longer to serialize when exposing both formats. Benchmarks show that under a high-cardinality load (~80,000 active series), serving both formats in Protobuf only increases the binary payload by 200 KB (+11.7%), representing a negligible, sub-millisecond serialization latency increase. For legacy text format scrapes, native histograms are skipped entirely, resulting in 0% serialization overhead.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Memory: Small increase for native histogram bucket storage. Bounded by --native-histogram-max-buckets (default: 160). Heap profiling shows a tiny overhead of ~142 bytes per active timeseries of memory allocated during observation operations (totaling +7.34 MB for 54,000 active timeseries). In a large production cluster with 500k active timeseries, this represents ~70 MB, which is <1% of typical APIServer memory.

CPU: Negligible increase for histogram operations.

Network: Slight increase in /metrics response size when exposing both formats. According to benchmarks exposing native histograms in Protobuf format only adds 200 KB (+11.7%) to the binary payload compared to classic protobuf.

Collector (Prometheus) Memory: Under a live 80k active series load, ingesting both formats simultaneously only increased Prometheus heap memory usage by 13.38 MB (+9.9%) benchmarks

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No. This feature only affects in-memory histogram representation and metrics endpoint output.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

No impact. Metrics exposition is independent of API server and etcd availability (except for the API server’s own metrics).

What are other known failure modes?

Failure mode: Prometheus too old
- Detection: Prometheus logs errors about unknown metric format
- Mitigations: Upgrade Prometheus to 2.40+ or disable native histograms
- Diagnostics: Check Prometheus version; verify --enable-feature=native-histograms is set
Failure mode: Memory pressure from histogram storage
- Detection: process_resident_memory_bytes increasing; OOMKilled events
- Mitigations: Disable native histogram ingestion in Prometheus; disable K8s feature gate
- Diagnostics: Compare memory usage before/after enabling feature

What steps should be taken if SLOs are not being met to determine the problem?

Check if native histograms are enabled
Compare memory usage with baseline
Check /metrics endpoint latency
If issues detected, disable via Prometheus config (scrape_native_histograms: false) or K8s feature gate
File issue with memory/latency profiles

Implementation History

v1.36: Initial KEP created (Alpha)
v1.36: Enabled native-histograms in apiserver , scheduler , kubelet , kube-controller-manager and kube-proxy

Drawbacks

Increased complexity: Two histogram formats to maintain and test
External dependency: Full benefit requires Prometheus upgrade by users
Memory overhead: Small additional memory for native histogram storage

Alternatives

Increase Classic Histogram Bucket Count

Instead of adopting native histograms, we could increase the number of buckets in classic histograms to achieve finer granularity.

Cons:

Each additional bucket creates a new time series, significantly increasing cardinality
This directly increases Prometheus storage costs, memory usage, and query latency
The cardinality explosion from more classic buckets negates any observability benefits

Infrastructure Needed (Optional)

None.

KEP-5808: Native Histogram Support for Kubernetes Metrics

KEP-5808: Native Histogram Support for Kubernetes Metrics

Release Signoff Checklist

Summary

Motivation

Goals

Non-Goals

Proposal

User Stories

Story 1: Platform Engineer Optimizing Monitoring Costs

Story 2: SRE Detecting Performance Regressions

Notes/Constraints/Caveats

Risks and Mitigations

Design Details

Dual Exposition Strategy

Implementation Phases

Prometheus Version Compatibility

Special Concern: Prometheus 2.x Users

Test Plan

Prerequisite testing updates

Unit tests

Integration tests

e2e tests

Graduation Criteria

Alpha

Beta

GA

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

Increase Classic Histogram Bucket Count

Infrastructure Needed (Optional)

References