KEP-3498: Extending Metrics Stability

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
- Risks and Mitigations
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

The metric stability framework was originally introduced with the intent of safeguarding significant metrics from being broken downstream. Metrics could be deemed stable or alpha, and only stable metrics would have stability guarantees.

This KEP intends to propose additional stability classes to extend on the existing metrics stability framework, such that we can achieve parity with the various stages of the feature release cycle.

Motivation

It’s become more obvious recently that we need additional stability classes, particularly in respect to various stages of feature releases. This has become more obvious with the advent of PRR (production readiness reviews) and mandated production readiness metrics .

Goals

Introduce two more metric classes: beta, corresponding to the beta stage of feature release, and internal which corresponds to internal development related metrics.

Non-Goals

establishing if specific metrics fall into a stability class, this exercise is left for component owners, who own their own metrics

Proposal

We’re proposing adding additional metadata fields to Kubernetes metrics. Specifically we want to add the following stability levels:

Internal - representing internal usages of metrics (i.e. classes of metrics which do not correspond to features) or low-level metrics that a typical operator would not understand (or would not be able to react to them properly).
Beta - representing a more mature stage in a feature metric, with greater stability guarantees than alpha or internal metrics, but less than Stable

We also propose amending the semantic meaning of an Alpha metric such that it represents the nascent stage of a KEP-proposed feature, rather than the entire class of metrics without stability guarantees.

Additionally we propose forced upgrades of metrics stability classes in the similar vein that features are not allowed to languish in alpha or beta stages, but this feature will not be available until the beta version of this KEP. For the alpha version of this KEP, we will implement the necessary changes to Kubernetes metrics framework, such that it supports the additional classes of metrics, without making changes to any existing metrics or their stability levels. As such, this KEP proposes changes to the metrics pipeline and the static analysis pieces of Kubernetes metrics framework.

Risks and Mitigations

The primary risk is that these changes break our existing (and working) metrics infrastructure. The mitigation should straightfoward, i.e. rollback the changes to the metrics framework.

Design Details

Our plan is to add functionality to our static analysis framework which is hosted in the main k8s/k8s repo, under test/instrumentation. Specifically, we will need to support:

parsing variables
multi-line strings
evaluating buckets
buckets which are defined via variables and consts
evaluation of simple consts
evaluation of simple variables

We will not attempt to parse metrics which:

are constructed dynamically, i.e. through function calls which use function arguments as parameters in metric definitions, since some of those cannot be resolved until runtime.
are constructed using custom prometheus collectors, for the same reasons as above.

As an aside, much of this work has already been done, but is stashed in a local repo.

Semantic of Stability Levels

Internal Metrics

Internal metrics have no stability guarantees and are not parseable by the static analysis framework. As such, Internal metrics will NOT be included in metric auto-documentation.

Alpha Metrics

Alpha metrics have no stability guarantees but are parseable by the static analysis framework. As such, Alpha metrics will be included in metric auto-documentation.

Beta Metrics

Beta metrics have some stability guarantees. Specifically, we guarantee that:

Beta metrics will not be removed without first being explicitly deprecated.
- you can deprecate Beta metrics at any point:
  - if because of changes in underlying code/feature it’s impossible to compute such metric the metric can be removed after one release
  - if the metric is still possible to expose (we just think it’s not the right one, e.g. we want to remove some label), but technically can still expose it, we leave it deprecated for 3 releases
Furthermore, Beta metrics are guaranteed to be forward compatible in respect to alerts and queries which may be written against them. By “forward compatible”, we mean that queries and alerts which are written against the metric and its labels will continue to work in the future. We ensure forward compatibility by ensuring that labels can only be added, and not removed, from Beta metrics.
Beta metrics will be included in metric auto-documentation

Stable Metrics

Stable metrics have stability guarantees. Specifically, we guarantee that:

Stable metrics will not be removed without first being explicitly deprecated. After deprecation, the metric will be removed in 12 months or 3 releases.
Furthermore, Stable metrics are guaranteed to not change in respect to labels. This means labels can neither be added nor removed from a Stable metric.
Stable metrics will be included in metric auto-documentation

Test Plan

We have static analysis testing for stable metrics, we will extend our test coverage to include metrics which are ALPHA and BETA while ignoring INTERNAL metrics.

[ X ] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

We already have thorough testing for the stability framework which has been GA for years.

Unit tests

[X] parsing variables [X] multi-line strings [X] evaluating buckets [X] buckets which are defined via variables and consts [X] evaluation of simple consts [X] evaluation of simple variables

test/instrumentation: 09/20/2022 - full coverage of existing stability framework

Integration tests

We will test the static analysis parser on a test directory with all permutations of metrics which we expect to parse (and variants we expect not to be able to parse)

e2e tests

The statis analysis tooling runs in a precommit pipeline and is therefore exempt from runtime tests.

Graduation Criteria

Alpha

Kubernetes metrics framework will be enhanced to support additional stability classes
The static analysis pipeline of the metrics framework will be enhanced to understand how to parse more things (these are listed above)

Beta

Kubernetes metrics framework will be enhanced to support marking Alpha and Beta metrics with release version. The semantics of this are yet to be determined. This version will be used to statically determine whether or not that metric should be deprecated automatically or promoted.

For the beta version of this KEP, we begin permitting metrics to be promoted to the Beta stability class.

GA

We will allow bake time before promoting this feature to GA
At this stage, we will promote our meta-metric for registered metrics to Stable
We also require an update to the prometheus golang client such that we can add process start time to a header, so that scraping clients do not have to parse the entire metrics payload in order to properly process counter metrics. Please see this PR for more details.

Deprecation

This section will pertain to the deprecation policy of deprecated Alpha and Beta metrics which will be determined in the Beta version of this KEP.

Upgrade / Downgrade Strategy

The static analysis code does not run in Kubernetes runtime code, with the exception of the registered_metrics metric.

Version Skew Strategy

This feature does not require a version skew strategy.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

This feature cannot be enabled or rolled back. It is built into the infrastructure of metrics, which will support two additional values for the enumeration of stable classes of metrics.

How can this feature be enabled / disabled in a live cluster?

It cannot. This is purely infrastructure based and requires adding additional enumeration values to metrics stability classes.

Does enabling the feature change any default behavior?

It will cause metrics previously annotated as Alpha metrics to be denoted as Internal.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

No.

What happens if we reenable the feature if it was previously rolled back?

N/A

Are there any tests for feature enablement/disablement?

No.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

This should not affect rollout. It could affect workloads that depended on Alpha metrics, which will be recagetorized as Internal. But to be fair, we’ve already explicitly laid out the fact that Alpha metrics do not have stability guarantees.

What specific metrics should inform a rollback?

registered_metrics_total summing to zero.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

This should not affect upgrade/rollback paths.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Alpha metrics will be recategorized as Internal.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

We’ve introduced a metric (i.e. registered_metrics_total) which should serve to indicate this feature is enabled.

How can someone using this feature know that it is working for their instance?

They will be able to see metrics.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

This tooling runs in precommit. It does not affect runtime SLOs.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

N/A

Are there any missing metrics that would be useful to have to improve observability of this feature?

No.

Dependencies

Prometheus and the Kubernetes metric framework.

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

Apiserver needs to be available to scrape metrics, if etcd is not available, you may still be able to scrape metrics from the apiserver.

What are other known failure modes?

Runaway cardinality of metrics, but that is orthogonal to the scope of this KEP.

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

This introduces complexity to metrics stability levels, however this has been asked for by various community members over the past few years. And we, as a community, are moving towards requiring metrics as a prerequisite for KEPs, which this should basically align with.

Alternatives

Doing nothing is a viable alternative. However, we end up in a weird spot with feature metrics, where they have no guarantees or are fully stable.

KEP-3498: Extending Metrics Stability

KEP-3498: Extending Metrics Stability

Release Signoff Checklist

Summary

Motivation

Goals

Non-Goals

Proposal

Risks and Mitigations

Design Details

Semantic of Stability Levels

Internal Metrics

Alpha Metrics

Beta Metrics

Stable Metrics

Test Plan

Prerequisite testing updates

Unit tests

Integration tests

e2e tests

Graduation Criteria

Alpha

Beta

GA

Deprecation

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

Infrastructure Needed (Optional)