KEP-3498: Extending Metrics Stability
KEP-3498: Extending Metrics Stability
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
The metric stability framework was originally introduced with the intent of safeguarding significant metrics from being broken downstream. Metrics could be deemed stable or alpha, and only stable metrics would have stability guarantees.
This KEP intends to propose additional stability classes to extend on the existing metrics stability framework, such that we can achieve parity with the various stages of the feature release cycle.
Motivation
It’s become more obvious recently that we need additional stability classes, particularly in respect to various stages of feature releases. This has become more obvious with the advent of PRR (production readiness reviews) and mandated production readiness metrics .
Goals
Introduce two more metric classes: beta, corresponding to the beta stage of feature release, and internal which corresponds to internal development related metrics.
Non-Goals
- establishing if specific metrics fall into a stability class, this exercise is left for component owners, who own their own metrics
Proposal
We’re proposing adding additional metadata fields to Kubernetes metrics. Specifically we want to add the following stability levels:
Internal- representing internal usages of metrics (i.e. classes of metrics which do not correspond to features) or low-level metrics that a typical operator would not understand (or would not be able to react to them properly).Beta- representing a more mature stage in a feature metric, with greater stability guarantees than alpha or internal metrics, but less thanStable
We also propose amending the semantic meaning of an Alpha metric such that it represents the nascent stage of a KEP-proposed feature, rather than the entire class of metrics without stability guarantees.
Additionally we propose forced upgrades of metrics stability classes in the similar vein that features are not allowed to languish in alpha or beta stages, but this feature will not be available until the beta version of this KEP. For the alpha version of this KEP, we will implement the necessary changes to Kubernetes metrics framework, such that it supports the additional classes of metrics, without making changes to any existing metrics or their stability levels. As such, this KEP proposes changes to the metrics pipeline and the static analysis pieces of Kubernetes metrics framework.
Risks and Mitigations
The primary risk is that these changes break our existing (and working) metrics infrastructure. The mitigation should straightfoward, i.e. rollback the changes to the metrics framework.
Design Details
Our plan is to add functionality to our static analysis framework which is hosted in the main k8s/k8s repo, under test/instrumentation. Specifically, we will need to support:
- parsing variables
- multi-line strings
- evaluating buckets
- buckets which are defined via variables and consts
- evaluation of simple consts
- evaluation of simple variables
We will not attempt to parse metrics which:
- are constructed dynamically, i.e. through function calls which use function arguments as parameters in metric definitions, since some of those cannot be resolved until runtime.
- are constructed using custom prometheus collectors, for the same reasons as above.
As an aside, much of this work has already been done, but is stashed in a local repo.
Semantic of Stability Levels
Internal Metrics
Internal metrics have no stability guarantees and are not parseable by the static analysis framework. As such, Internal metrics will NOT be included in metric auto-documentation.
Alpha Metrics
Alpha metrics have no stability guarantees but are parseable by the static analysis framework. As such, Alpha metrics will be included in metric auto-documentation.
Beta Metrics
Beta metrics have some stability guarantees. Specifically, we guarantee that:
Betametrics will not be removed without first being explicitly deprecated.- you can deprecate Beta metrics at any point:
- if because of changes in underlying code/feature it’s impossible to compute such metric the metric can be removed after one release
- if the metric is still possible to expose (we just think it’s not the right one, e.g. we want to remove some label), but technically can still expose it, we leave it deprecated for 3 releases
- you can deprecate Beta metrics at any point:
- Furthermore,
Betametrics are guaranteed to be forward compatible in respect to alerts and queries which may be written against them. By “forward compatible”, we mean that queries and alerts which are written against the metric and its labels will continue to work in the future. We ensure forward compatibility by ensuring that labels can only be added, and not removed, fromBetametrics. Betametrics will be included in metric auto-documentation
Stable Metrics
Stable metrics have stability guarantees. Specifically, we guarantee that:
Stablemetrics will not be removed without first being explicitly deprecated. After deprecation, the metric will be removed in 12 months or 3 releases.- Furthermore,
Stablemetrics are guaranteed to not change in respect to labels. This means labels can neither be added nor removed from aStablemetric. Stablemetrics will be included in metric auto-documentation
Test Plan
We have static analysis testing for stable metrics, we will extend our test coverage
to include metrics which are ALPHA and BETA while ignoring INTERNAL metrics.
[ X ] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
We already have thorough testing for the stability framework which has been GA for years.
Unit tests
[X] parsing variables [X] multi-line strings [X] evaluating buckets [X] buckets which are defined via variables and consts [X] evaluation of simple consts [X] evaluation of simple variables
test/instrumentation:09/20/2022-full coverage of existing stability framework
Integration tests
We will test the static analysis parser on a test directory with all permutations of metrics which we expect to parse (and variants we expect not to be able to parse)
e2e tests
The statis analysis tooling runs in a precommit pipeline and is therefore exempt from runtime tests.
Graduation Criteria
Alpha
- Kubernetes metrics framework will be enhanced to support additional stability classes
- The static analysis pipeline of the metrics framework will be enhanced to understand how to parse more things (these are listed above)
Beta
- Kubernetes metrics framework will be enhanced to support marking
AlphaandBetametrics with release version. The semantics of this are yet to be determined. This version will be used to statically determine whether or not that metric should be deprecated automatically or promoted.
For the beta version of this KEP, we begin permitting metrics to be promoted to the Beta stability class.
GA
- We will allow bake time before promoting this feature to GA
- At this stage, we will promote our meta-metric for registered metrics to Stable
- We also require an update to the prometheus golang client such that we can add process start time to a header, so that scraping clients do not have to parse the entire metrics payload in order to properly process counter metrics. Please see this PR for more details.
Deprecation
- This section will pertain to the deprecation policy of deprecated
AlphaandBetametrics which will be determined in theBetaversion of this KEP.
Upgrade / Downgrade Strategy
The static analysis code does not run in Kubernetes runtime code, with the exception of the registered_metrics metric.
Version Skew Strategy
This feature does not require a version skew strategy.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
This feature cannot be enabled or rolled back. It is built into the infrastructure of metrics, which will support two additional values for the enumeration of stable classes of metrics.
How can this feature be enabled / disabled in a live cluster?
It cannot. This is purely infrastructure based and requires adding additional enumeration values to metrics stability classes.
Does enabling the feature change any default behavior?
It will cause metrics previously annotated as Alpha metrics to be denoted as Internal.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
No.
What happens if we reenable the feature if it was previously rolled back?
N/A
Are there any tests for feature enablement/disablement?
No.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
This should not affect rollout. It could affect workloads that depended on Alpha metrics, which will be recagetorized as Internal. But to be fair, we’ve already explicitly laid out the fact that Alpha metrics do not have stability guarantees.
What specific metrics should inform a rollback?
registered_metrics_total summing to zero.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
This should not affect upgrade/rollback paths.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
Alpha metrics will be recategorized as Internal.
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
We’ve introduced a metric (i.e. registered_metrics_total) which should serve to indicate this feature is enabled.
How can someone using this feature know that it is working for their instance?
They will be able to see metrics.
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
This tooling runs in precommit. It does not affect runtime SLOs.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
N/A
Are there any missing metrics that would be useful to have to improve observability of this feature?
No.
Dependencies
Prometheus and the Kubernetes metric framework.
Does this feature depend on any specific services running in the cluster?
No.
Scalability
Will enabling / using this feature result in any new API calls?
No.
Will enabling / using this feature result in introducing new API types?
No.
Will enabling / using this feature result in any new calls to the cloud provider?
No.
Will enabling / using this feature result in increasing size or count of the existing API objects?
No.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
No.
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
Apiserver needs to be available to scrape metrics, if etcd is not available, you may still be able to scrape metrics from the apiserver.
What are other known failure modes?
Runaway cardinality of metrics, but that is orthogonal to the scope of this KEP.
What steps should be taken if SLOs are not being met to determine the problem?
Implementation History
Drawbacks
This introduces complexity to metrics stability levels, however this has been asked for by various community members over the past few years. And we, as a community, are moving towards requiring metrics as a prerequisite for KEPs, which this should basically align with.
Alternatives
Doing nothing is a viable alternative. However, we end up in a weird spot with feature metrics, where they have no guarantees or are fully stable.