KEP-5808: Native Histogram Support for Kubernetes Metrics
KEP-5808: Native Histogram Support for Kubernetes Metrics
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
- References
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
This KEP proposes adding support for Prometheus Native Histograms to Kubernetes component metrics. Starting with Prometheus v3.8.0, native histograms are supported as a stable feature. Native histograms use exponential bucket boundaries instead of fixed boundaries, providing significant storage efficiency (~10x reduction in time series count per histogram), improved query performance, and finer-grained visibility into distributions while maintaining full backward compatibility with existing monitoring infrastructure through a dual exposition strategy.
The implementation introduces a feature gate (NativeHistograms) to provide safe rollout and rollback capabilities. When enabled, Kubernetes components will expose histogram metrics in both classic and native formats simultaneously (when requesting a format that supports Native Histograms, such as PrometheusProto), ensuring existing dashboards and alerts continue to function while users can migrate to native histograms at their own pace. Rollback is handled primarily through Prometheus-side configuration (for Prometheus 3.x users) or via the K8s feature gate.
Motivation
Kubernetes exposes hundreds of histogram metrics across its control plane components. These metrics are essential for monitoring cluster health, debugging performance issues, and ensuring service level objectives are met. However, classic Prometheus histograms have inherent limitations:
- Storage overhead: Each classic histogram creates multiple time series (one per bucket plus
_countand_sum), leading to high storage costs at scale - Fixed bucket boundaries: Predefined buckets may not align well with actual data distributions, causing accuracy issues and rendering bucket boundaries useless. For example, if a histogram uses default buckets like
[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]seconds, a request completing in 1µs (0.000001s) falls into the samele="0.005"bucket as a request completing in 4ms—a 4000x difference in latency becomes indistinguishable. Similarly, all requests between 1s and 2.5s are grouped together, hiding important performance variations
Prometheus Native Histograms, introduced in Prometheus 2.40, address these limitations using exponential bucket boundaries with automatic adjustment. Kubernetes should support this modern, more efficient histogram format.
Goals
- Enable Kubernetes components to expose metrics in Prometheus Native Histogram format
- Maintain full backward compatibility with existing monitoring infrastructure
- Provide a safe, gradual rollout path with extended testing periods
Non-Goals
- Remove classic histogram exposition format
- Remove existing histogram metrics
Proposal
Add native histogram support to the component-base/metrics package with:
- Feature Gate: A new
NativeHistogramsfeature gate controlling whether K8s components expose metrics with native histogram format - Global Defaults: Use sensible global defaults for native histogram configuration (alpha phase)
- Dual Exposition: When enabled, expose both classic and native histogram formats
The control model is intentionally simple:
- K8s-side: Feature gate controls whether native histograms are exposed
- Prometheus-side: Per-job
scrape_native_histograms(ref ) controls what Prometheus ingests (Prometheus 3.x)
User Stories
Story 1: Platform Engineer Optimizing Monitoring Costs
As a platform engineer managing a large Kubernetes fleet, I want to reduce the storage costs of my Prometheus infrastructure. With native histograms, I can achieve ~10x reduction in time series count for histogram metrics, significantly reducing storage and improving query performance.
Story 2: SRE Detecting Performance Regressions
As an SRE responsible for cluster reliability, I need to detect performance regressions accurately. With classic histograms, a latency regression from 1ms to 50ms might go unnoticed because both values fall into the same le="0.1" bucket. Native histograms’ exponential buckets provide much finer resolution, enabling me to reliably detect even small performance regressions and set precise SLO thresholds.
Notes/Constraints/Caveats
- External Dependency: Native histogram support in Kubernetes depends on Prometheus server scrape settings. Users must configure the following in their Prometheus scrape config during transition to receive both formats
scrape_native_histograms: truealways_scrape_classic_histograms: true
Risks and Mitigations
Note: Using native histograms is opt-in from users perspective: Prometheus only collects them if scrape_native_histograms: true is set in the scrape config. Enabling the Kubernetes feature gate by default will only affect users who have already opted in (i.e., configured scrape_native_histograms: true). Those using classic histograms will see no change until they update their Prometheus configuration.
- Silent Dashboard/Alert Failures on Upgrade: When upgrading to a Kubernetes version where
NativeHistogramsfeature gate becomes default ON, users withscrape_native_histograms: truein Prometheus who forget to also setalways_scrape_classic_histograms: truecan experience silent failures:- Classic
_bucket,_count,_summetrics will no longer be ingested - Existing dashboards using
histogram_quantile(..._bucket...)queries will show no data or stale data - Alerts based on classic histogram queries will stop firing
- Classic
The dashboard breakage risk depends on a combination of Prometheus settings:
scrape_native_histograms | always_scrape_classic_histograms | Result |
|---|---|---|
false (default) | N/A | SAFE: Classic only |
true | true | SAFE: Both formats (recommended during migration) |
true | false (default) | RISK: Native only (safe only after full dashboard migration) |
Migration workflow:
The always_scrape_classic_histograms setting addresses a chicken-and-egg problem: users cannot migrate dashboards to native histogram queries without first enabling native histogram ingestion, but enabling ingestion without classic format would break existing dashboards.
Recommended approach:
- Enable both formats: Set
scrape_native_histograms: trueANDalways_scrape_classic_histograms: true - Migrate dashboards/alerts: Update queries from classic (
histogram_quantile(..._bucket...)) to native histogram functions - Verify in staging: Ensure all dashboards and alerts work with native histogram queries
- Disable classic scraping: Once migration is complete and verified, set
always_scrape_classic_histograms: falseto reduce storage overhead
Mitigation:
- Verify Prometheus version (3.x recommended for per-job control)
- Set
always_scrape_classic_histograms: truefor all K8s scrape jobs during migration - Test dashboard queries in staging before production upgrade
- Docs and release notes must clearly state that users enabling
scrape_native_histogramsshould also setalways_scrape_classic_histograms: trueuntil dashboard migration is complete
Design Details
Kubernetes metrics use the component-base/metrics package which wraps prometheus/client_golang. Currently:
HistogramOptsonly supports classicBuckets []float64- No configuration path for native histogram options
- Hundreds of histogram metrics across control plane components
Dual Exposition Strategy
When native histograms are enabled, Kubernetes will expose BOTH formats. The format returned depends on the client’s Accept header:
Text format (text/plain, OpenMetrics1.0):
- Remains backward compatible; contains only classic histogram buckets
# Classic histogram buckets (always present)
apiserver_request_duration_seconds_bucket{le="0.005"} 1000
apiserver_request_duration_seconds_bucket{le="0.01"} 2000
...
apiserver_request_duration_seconds_bucket{le="+Inf"} 10000
apiserver_request_duration_seconds_count 10000
apiserver_request_duration_seconds_sum 450.5
Protobuf format (application/vnd.google.protobuf):
- Contains both classic buckets AND native histogram data
- Native histograms use the Prometheus protobuf schema
with
Histogrammessage containing exponential bucket spans - Binary format, not human-readable
Prometheus negotiates the format via content negotiation. When scrape_native_histograms: true, Prometheus requests protobuf format to receive native histogram data.
This ensures:
- Existing dashboards continue to work
- Users can migrate queries at their own pace
- Prometheus stores whichever format it’s configured for
Implementation Phases
For the alpha phase, we will use sensible global defaults for all native histograms without exposing configuration options to developers. This keeps the initial implementation simple while we gather feedback. The HistogramOpts struct in component-base/metrics will remain unchanged - no new fields will be added (configuration options may be added in future phases if a need arises based on user feedback and real-world usage patterns).
We will update the conversion function to pass native histogram options to the underlying Prometheus library when the feature gate is enabled:
func (o *HistogramOpts) toPromHistogramOpts() prometheus.HistogramOpts {
opts := prometheus.HistogramOpts{
Namespace: o.Namespace,
Subsystem: o.Subsystem,
Name: o.Name,
Help: o.Help,
ConstLabels: o.ConstLabels,
Buckets: o.Buckets, // Always keep classic buckets
}
if utilfeature.DefaultFeatureGate.Enabled(features.NativeHistograms) {
// Use fixed global defaults for alpha phase
opts.NativeHistogramBucketFactor = 1.1 // Default bucket growth factor
opts.NativeHistogramMaxBucketNumber = 160 // Default max buckets (based on OTel SDK recommendation: https://opentelemetry.io/docs/specs/otel/metrics/sdk/#base2-exponential-bucket-histogram-aggregation)
}
return opts
}
Prometheus Version Compatibility
Native histogram support and configuration varies significantly across Prometheus versions:
Prometheus < 2.40:
- Cannot ingest native histograms at all
- K8s exposing native histograms has no effect—Prometheus ignores them
- Classic
_bucket,_count,_summetrics continue to work - Action needed: None, but no benefit from native histograms
Prometheus 2.40 - 2.x:
# Enable native histogram support globally
prometheus --enable-feature=native-histograms
# This is all-or-nothing: ALL scrape jobs will attempt to ingest native histograms
# No per-job control available
- Higher risk: Cannot selectively enable for K8s while keeping classic for other targets
- If enabled, dashboards for ALL targets using classic histogram queries may break
Prometheus 3.0 - 3.7:
# Per-job configuration (recommended)
scrape_configs:
- job_name: 'kubernetes-apiservers'
scrape_native_histograms: true
always_scrape_classic_histograms: true # Keep classic during transition
# OR use feature flag for global default (still supported)
# prometheus --enable-feature=native-histograms
- Per-job control available
- Can enable native histograms for K8s while keeping other jobs on classic only
- Prometheus feature flag (
--enable-feature=native-histograms) still works as global default
Prometheus 3.8:
# Per-job configuration (required for fine-grained control)
scrape_configs:
- job_name: 'kubernetes-apiservers'
scrape_native_histograms: true
always_scrape_classic_histograms: true
# Feature flag now ONLY changes the global default value of scrape_native_histograms to true FOR ALL JOBS
# Individual jobs can still override with explicit scrape_native_histograms: false
# prometheus --enable-feature=native-histograms # Changes default for all jobs, explicit per-job config preferred
- Prometheus feature flag (
--enable-feature=native-histograms) only remaining effect: setsscrape_native_histograms: trueas default for all jobs - Per-job settings override the default
- Transition period: migrate from flag to explicit per-job config
Prometheus 3.9+:
# Per-job configuration (only method)
scrape_configs:
- job_name: 'kubernetes-apiservers'
scrape_native_histograms: true
always_scrape_classic_histograms: true
- Prometheus feature flag (
--enable-feature=native-histograms) fully deprecated and removed - Must use per-job
scrape_native_histogramsandalways_scrape_classic_histograms - Default for both settings is
false
Special Concern: Prometheus 2.x Users
Prometheus 2.x users with --enable-feature=native-histograms enabled are in a difficult position:
Scenario:
- User has
prometheus --enable-feature=native-histogramsenabled for other workloads that benefit from native histograms - K8s starts exposing native histograms when the feature gate is enabled
Problem:
- Prometheus 2.x global flag means ALL scrape jobs (including K8s) now ingest native format only
- Other workloads that were prepared for this change continue working
- But K8s dashboards that weren’t updated for native histograms suddenly break
- No way to selectively keep classic format just for K8s while maintaining native for other apps
Mitigation options:
Turn off Prometheus feature flag
- Loses native histograms for ALL workloads (not just K8s)
Disable K8s feature gate:
--feature-gates=NativeHistograms=false- Requires K8s component restarts (may be slow/disruptive)
- For managed K8s, may not be possible
Upgrade to Prometheus 3.x
- Major version upgrade, may not be quick/easy
- Then can use per-job
scrape_native_histograms: falseto keep K8s on classic only - OR use
always_scrape_classic_histograms: trueto get both native and classic formats
Test Plan
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
Existing histogram metric tests should be extended to verify dual exposition behavior when the feature gate is enabled.
Unit tests
staging/src/k8s.io/component-base/metrics: TesttoPromHistogramOpts()with feature gate enabled/disabled- Test that global native histogram defaults are applied when feature gate is enabled
- Test that classic buckets are always present
Integration tests
- Verify metrics endpoint serves both formats when enabled
- Verify classic buckets are always present regardless of feature state
e2e tests
- Scrape metrics with Prometheus (native histogram support enabled)
- Verify both formats are queryable
Graduation Criteria
Alpha
- Feature implemented behind
NativeHistogramsfeature flag - Initial unit tests completed and enabled
- Basic documentation available
Beta
- Gather feedback from early adopters
- Comprehensive integration tests in place
- E2E tests covering upgrade/downgrade scenarios
- Documentation updated with:
- Migration guide
- Troubleshooting procedures
- Prometheus configuration examples
- Clear Prometheus 2.x limitations
- Performance benchmarks completed showing no regression (Migrate select performance tests to use native histograms for early feedback)
- Refactor histograms created in
init()functions to use lazy initialization (e.g.,sync.Oncewith getter functions) so native histogram options are properly applied after feature gates are parsed
GA
- Consider whether to make Native Histogram options configurable
Upgrade / Downgrade Strategy
Upgrade:
- Kubernetes upgrade does not change monitoring behavior if feature gate is off
- When feature gate is enabled:
- Classic histogram format continues to be exposed
- Native format is additionally exposed
Downgrade: When downgrading from a Kubernetes version with native histogram support to an older version:
- Metrics automatically revert to classic histogram format only (
_bucket,_count,_sum) - Prometheus behavior:
- If
always_scrape_classic_histograms: true: Prometheus continues to scrape classic histograms - If
always_scrape_classic_histograms: false: Prometheus will NOT scrape histograms (data loss until config is updated)
- If
- Dashboard/Alert impact:
- Classic histogram queries: work only if
always_scrape_classic_histograms: true, otherwise break - Native histogram queries stop receiving new data and will break
- Classic histogram queries: work only if
Enabling Native Histograms (Opt-in):
# 1. Ensure Prometheus is ready (Prometheus 3.x)
# prometheus.yml
scrape_configs:
- job_name: 'kubernetes-apiservers'
scrape_native_histograms: true # Ingest native histograms
always_scrape_classic_histograms: true # CRITICAL: Keep classic during transition
# For older Prometheus (2.40-2.x), use global feature flag:
# --enable-feature=native-histograms
# 2. Enable feature gate in Kubernetes
--feature-gates=NativeHistograms=true
Version Skew Strategy
Native histogram support is independent per component. Each component’s metrics are independent, no coordination required. Some components may expose native histograms while others don’t. This is acceptable as Prometheus scrapes each target independently.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name:
NativeHistograms - Components depending on the feature gate: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet, kube-proxy
- Feature gate name:
- Other
- Describe the mechanism: Prometheus 3.x per-job
scrape_native_histograms: falsestops ingestion without K8s changes - Will enabling / disabling the feature require downtime of the control plane? For feature gate changes, yes (component restart required). For Prometheus config changes, no.
- Will enabling / disabling the feature require downtime or reprovisioning of a node? No
- Describe the mechanism: Prometheus 3.x per-job
Does enabling the feature change any default behavior?
When enabled, the metrics endpoint will expose an additional native histogram encoding alongside the existing classic histogram format. The classic format (_bucket, _count, _sum) remains unchanged and always present.
Users with Prometheus configured to prefer native histograms will see the data stored in native format. Users without native histogram support enabled in Prometheus see no change.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
Yes. Disabling can be done via:
- Prometheus config:
scrape_native_histograms: falseper job (fastest, Prometheus 3.x only, no K8s restart) - Feature gate:
--feature-gates=NativeHistograms=false(requires component restart)
When K8s feature is disabled, only classic histogram format is exposed. When Prometheus stops ingesting native histograms, it resumes scraping classic format on next scrape interval. No data loss occurs; historical data in Prometheus remains queryable.
What happens if we reenable the feature if it was previously rolled back?
Native histogram exposition resumes. No special handling required. Prometheus will begin storing native histograms again if configured to do so.
Are there any tests for feature enablement/disablement?
Yes, unit tests will verify:
toPromHistogramOpts()returns correct configuration based on feature gate state- Toggling feature gate changes histogram configuration appropriately
- Classic buckets are always present regardless of feature gate state
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
Rollout failure scenarios:
- Prometheus too old to understand native histograms - Prometheus ignores native format, stores classic
- Dashboard queries not updated - Dashboards continue to work with classic format
- Memory pressure from additional histogram storage
Impact on workloads: None. This feature only affects metrics exposition, not workload behavior.
What specific metrics should inform a rollback?
- Prometheus scrape errors increasing for Kubernetes targets
- Significant increase in
process_resident_memory_bytesfor control plane components - Increase in
/metricsendpoint latency - Dashboard queries failing
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
Will be tested as part of beta graduation.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No. Classic histogram format is not deprecated.
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
- Check component logs for “native histograms enabled” message
- Query
kubernetes_feature_enabledmetric with labelname=NativeHistograms(value 1 = enabled) - Scrape
/metricsendpoint with protobuf format and verify native histogram encoding is present
How can someone using this feature know that it is working for their instance?
- Other (treat as last resort)
- Details: Query the metrics endpoint with
Accept: application/vnd.google.protobufheader and verify native histogram encoding is present for histogram metrics. Note: Native histograms are only supported in protobuf exposition format, not in text-based formats.
- Details: Query the metrics endpoint with
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
- Metrics endpoint latency should not increase significantly when native histograms are enabled
- All existing classic histogram queries continue to function
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
process_resident_memory_bytes(existing, monitor for unexpected increases) - Components exposing the metric: All control plane components
- Metric name:
Are there any missing metrics that would be useful to have to improve observability of this feature?
No
Dependencies
Does this feature depend on any specific services running in the cluster?
No cluster services required. However, to utilize native histograms:
- Prometheus 2.40+ (experimental) or Prometheus 3.0+ (stable)
- Usage description: Required to scrape and store native histogram format
- Configuration:
- Prometheus 2.40-2.x:
--enable-feature=native-histograms(global) - Prometheus 3.x:
scrape_native_histograms: trueper scrape job (recommended)
- Prometheus 2.40-2.x:
Scalability
Will enabling / using this feature result in any new API calls?
No.
Will enabling / using this feature result in introducing new API types?
No.
Will enabling / using this feature result in any new calls to the cloud provider?
No.
Will enabling / using this feature result in increasing size or count of the existing API objects?
No.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
The /metrics endpoint may take slightly longer to serialize when exposing both formats. This will be benchmarked during alpha/beta.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
Memory: Small increase for native histogram bucket storage. Bounded by --native-histogram-max-buckets (default: 160).
CPU: Negligible increase for histogram operations.
Network: Slight increase in /metrics response size when exposing both formats.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No. This feature only affects in-memory histogram representation and metrics endpoint output.
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
No impact. Metrics exposition is independent of API server and etcd availability (except for the API server’s own metrics).
What are other known failure modes?
Failure mode: Prometheus too old
- Detection: Prometheus logs errors about unknown metric format
- Mitigations: Upgrade Prometheus to 2.40+ or disable native histograms
- Diagnostics: Check Prometheus version; verify
--enable-feature=native-histogramsis set
Failure mode: Memory pressure from histogram storage
- Detection:
process_resident_memory_bytesincreasing; OOMKilled events - Mitigations: Disable native histogram ingestion in Prometheus; disable K8s feature gate
- Diagnostics: Compare memory usage before/after enabling feature
- Detection:
What steps should be taken if SLOs are not being met to determine the problem?
- Check if native histograms are enabled
- Compare memory usage with baseline
- Check
/metricsendpoint latency - If issues detected, disable via Prometheus config (
scrape_native_histograms: false) or K8s feature gate - File issue with memory/latency profiles
Implementation History
- 2026-01-16: Initial KEP created
Drawbacks
- Increased complexity: Two histogram formats to maintain and test
- External dependency: Full benefit requires Prometheus upgrade by users
- Memory overhead: Small additional memory for native histogram storage
Alternatives
Increase Classic Histogram Bucket Count
Instead of adopting native histograms, we could increase the number of buckets in classic histograms to achieve finer granularity.
Cons:
- Each additional bucket creates a new time series, significantly increasing cardinality
- This directly increases Prometheus storage costs, memory usage, and query latency
- The cardinality explosion from more classic buckets negates any observability benefits
Infrastructure Needed (Optional)
None.