KEP-4205: Expose PSI Metrics

Implementation History
STABLE Implemented
Created 2023-05-25
Latest v1.36
Milestones
Alpha v1.33
Beta v1.34
Stable v1.36
Ownership
Owning SIG
SIG Node
Participating SIGs

KEP-4205: Expose PSI Metrics

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
    • e2e Tests for all Beta API Operations (endpoints)
    • (R) Ensure GA e2e tests meet requirements for Conformance Tests
    • (R) Minimum Two Week Window for GA e2e tests to prove flake free
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • “Implementation History” section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This KEP proposes adding support in kubelet to read Pressure Stall Information (PSI) metric pertaining to CPU, Memory and IO resources exposed from cAdvisor and runc.

Motivation

PSI metric provides a quantifiable way to see resource pressure increases as they develop, with a new pressure metric for three major resources (memory, CPU, IO). These pressure metrics are useful for detecting resource shortages and provide nodes the opportunity to respond intelligently - by updating the node condition.

In short, PSI metric are like barometers that provide fair warning of impending resource shortages on the node, and enable nodes to take more proactive, granular and nuanced steps when major resources (memory, CPU, IO) start becoming scarce.

Goals

This proposal aims to:

  1. Enable the kubelet to have the PSI metric of cgroupv2 exposed from cAdvisor and Runc.
  2. Enable the pod level PSI metric and expose it in the Summary API.

Non-Goals

  • Invest in more opportunities to further use PSI metric for pod evictions, userspace OOM kills, and so on, for future KEPs.

Proposal

User Stories (Optional)

Story 1

Today, to identify disruptions caused by resource crunches, Kubernetes users need to install node exporter to read PSI metric. With the feature proposed in this enhancement, PSI metric will be available for users in the Kubernetes metrics API.

Risks and Mitigations

There are no significant risks associated with integrating the PSI metric in kubelet from either from cadvisor runc libcontainer library or kubelet’s CRI runc libcontainer implementation which doesn’t involve any shelled binary operations.

Design Details

  1. Add new Data structures PSIData and PSIStats corresponding to the PSI metric output format as following:
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
// PSI data for an individual resource.
type PSIData struct {
	// Total time duration for tasks in the cgroup have waited due to congestion.
	// Unit: nanoseconds.
	Total  uint64 `json:"total"`
	// The average (in %) tasks have waited due to congestion over a 10 second window.
	Avg10  float64 `json:"avg10"`
	// The average (in %) tasks have waited due to congestion over a 60 second window.
	Avg60  float64 `json:"avg60"`
	// The average (in %) tasks have waited due to congestion over a 300 second window.
	Avg300 float64 `json:"avg300"`
}

// PSI statistics for an individual resource.
type PSIStats struct {
	// PSI data for some tasks in the cgroup.
	Some PSIData `json:"some,omitempty"`
	// PSI data for all tasks in the cgroup.
	Full PSIData `json:"full,omitempty"`
}
  1. Summary API includes stats for both system and kubepods level cgroups. Extend the Summary API to include PSI metric data for each resource obtained from cadvisor. Note: if cadvisor-less is implemented prior to the implementation of this enhancement, the PSI metric data will be available through CRI instead.
CPU
type CPUStats struct { 
	// PSI stats of the overall node
	PSI *PSIStats `json:"psi,omitempty"`
}
Memory
type MemoryStats struct {
	// PSI stats of the overall node
	PSI *PSIStats `json:"psi,omitempty"`
}
IO
// IOStats contains data about IO usage.
type IOStats struct {
	// The time at which these stats were updated.
	Time metav1.Time `json:"time"`

	// PSI stats of the overall node
	PSI *PSIStats `json:"psi,omitempty"`
}

type NodeStats struct {
	// Stats about the IO pressure of the node
	IO *IOStats `json:"io,omitempty"`
}

Test Plan

[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates
Unit tests
  • k8s.io/kubernetes/pkg/kubelet/server/stats: 2023-10-04 - 74.4%
  • k8s.io/kubernetes/pkg/kubelet/stats: 2025-06-10 - 77.4%
Integration tests

Within Kubernetes, the feature is implemented solely in kubelet. Therefore a Kubernetes integration test doesn’t apply here.

Any identified external user of either of these endpoints (prometheus, metrics-server) should be tested to make sure they’re not broken by new fields in the API response.

e2e tests
  • test/e2e_node/summary_test.go: https://storage.googleapis.com/k8s-triage/index.html?test=test%2Fe2e_node%2Fsummary_test.go

Graduation Criteria

Alpha

  • PSI integrated in kubelet behind a feature flag.
  • Unit tests to check the fields are populated in the Summary API response.

Beta

  • Feature gate is enabled by default.
  • Extend e2e test coverage.
  • Allowing time for feedback.
  • Performance testing to verify:
    • Verification enabling PSI on nodes doesn’t introduce excessive CPU or memory usage in the kernel
    • PSI metrics collection doesn’t introduce excessive CPU or memory usage increase in the kubelet

GA

  • Quantify the cAdvisor and kubelet-level overhead of PSI metric collection, especially where PSI is disabled at the kernel level.
  • Validate with SIG Node that collection overhead is acceptable for general use cases, or include opt-out knobs.
  • Expanded stress testing with diverse environments and scenarios, while maintining acceptable minimal resource consumption like outlined in Beta perf testing.
  • Gather evidence of real-world usage from beta users.
  • No major issues reported.

Deprecation

  • Announce deprecation and support policy of the existing flag
  • Two versions passed since introducing the functionality that deprecates the flag (to address version skew)
  • Address feedback on usage/changed behavior, provided on GitHub issues
  • Deprecate the flag –>

Upgrade / Downgrade Strategy

No impact. Runc will be upgraded to 1.2.0 version as a prerequisite for this feature, and all the other components will already be at expected levels. Hence there shouldn’t be a problem in upgrading or downgrading. Besides, it’s always possible to upgrade/downgrade to a different kubelet version.

Version Skew Strategy

N/A

PSI stats will be available only after CRI and cadvisor have been updated to use runc 1.2.0 in K8s 1.29. Since PSI Based Node Conditions is dependent on kubelet version, and CRI and kubelet are generally updated in tandem, Version skew strategy is not applicable.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?
  • Feature gate (also fill in values in kep.yaml)
    • Feature gate name: KubeletPSI
    • Components depending on the feature gate: kubelet
Does enabling the feature change any default behavior?

No.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, but starting in v1.36 where this feature graduates to GA, the KubeletPSI feature gate will be locked to true and can no longer be disabled.

What happens if we reenable the feature if it was previously rolled back?

No PSI metrics will be available in kubelet Summary API nor Prometheus metrics if the feature was rolled back.

Are there any tests for feature enablement/disablement?

Unit tests

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

The PSI metrics in kubelet Summary API and Prometheus metrics are for monitoring purpose, and are not used by Kubernetes itself to inform workload lifecycle decisions. Therefore it should not impact running workloads.

If there is a bug and kubelet fails to serve the metrics during rollout, the kubelet Summary API and Prometheus metrics could be corrupted, and other components that depend on those metrics could be impacted. Disabling the feature gate / rolling back the feature should be safe.

What specific metrics should inform a rollback?

PSI metrics exposed at kubelet /metrics/cadvisor endpoint:

container_pressure_cpu_stalled_seconds_total
container_pressure_cpu_waiting_seconds_total
container_pressure_memory_stalled_seconds_total
container_pressure_memory_waiting_seconds_total
container_pressure_io_stalled_seconds_total
container_pressure_io_waiting_seconds_total

kubelet Summary API at the /stats/summary endpoint.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Test plan:

  • Create pods when the feature is alpha and disabled
  • Upgrade kubelet so the feature is beta and enabled
    • Pods should continue to run
    • PSI metrics should be reported in kubelet Summary API and Prometheus metrics
  • Roll back kubelet to previous version
    • Pods should continue to run
    • PSI metrics should no longer be reported
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

Use kubectl get --raw "/api/v1/nodes/{$nodeName}/proxy/stats/summary" to call Summary API. If the PSIStats field is seen in the API response, the feature is available to be used by workloads.

How can someone using this feature know that it is working for their instance?
  • Other (treat as last resort)
    • Details: The feature is only about metrics surfacing. One can know that it is working by reading the metrics.
What are the reasonable SLOs (Service Level Objectives) for the enhancement?

kubelet Summary API and Prometheus metrics should continue serving traffics meeting their originally targeted SLOs

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
  • Metrics
    • Metric name:
    • [Optional] Aggregation method:
    • Components exposing the metric:
  • Other (treat as last resort)
    • Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Yes, it depends on runc version 1.2.0. This KEP can be implemented only after runc 1.2.0 is released, which is estimated to be released in Q1 2024.

Scalability

Will enabling / using this feature result in any new API calls?

No

Will enabling / using this feature result in introducing new API types?

Yes, PSIStats is the new API type that will be added to Summary API.

Will enabling / using this feature result in any new calls to the cloud provider?

No

Will enabling / using this feature result in increasing size or count of the existing API objects?

No

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No. Additional metric i.e. PSI is being read from cadvisor.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No

Troubleshooting

NA

How does this feature react if the API server and/or etcd is unavailable?
  • NA.
What are other known failure modes?

NA

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

  • 2023/09/13: Initial proposal
  • 2025/06/10: Drop Phase 2 from this KEP. Phase 2 will be tracked in its own KEP to allow separate milestone tracking
  • 2025/06/10: Update the proposal with Beta requirements
  • 2026/04/29: Mark KEP as implemented (GA graduation)

Drawbacks

No drawbacks identified. There’s no reason the enhancement should not be implemented. This enhancement now makes it possible to read PSI metric without installing additional dependencies

Infrastructure Needed (Optional)

No new infrastructure is needed.