KEP-3466: Kubernetes Component Health SLIs
KEP-3466: Kubernetes Component SLIs
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
This KEP intends to allow us to emit SLI data in a structured and consistent way, so that monitoring agents can consume this data at higher scrape intervals and create SLOs (service level objectives) and alerts off of these SLIs (service level indicators).
Motivation
Healthchecking data is currently surfaced in unstructured format and is scraped by monitoring agents (as well as the kubelet), which must be configured to interpret the health data and to act upon it. This process does not lend itself readily to the creation of availability SLOs, since we basically require an outside agent to parse the health endpoint and convert this signal into an SLI.
Goals
- Create a uniform interface by which we can consume health checking information
- Allow SLIs to be created without a specialized monitoring agent
- Allow for increased granularity (by configuring a more frequent interval) of SLI metric data
- Minimize the diff involved for each Kubernetes component
Non-Goals
- Creation of SLOs are out of scope. This KEP specifically targets the creation of signals which can be used as SLIs.
Proposal
We are proposing to add a new endpoint on Kubernetes components /metrics/slis which returns
SLI data in prometheus format.
Risks and Mitigations
In the alpha phase of this KEP, we propose adding health check data to the metrics/slis
endpoint. Since this is a separate endpoint, it does not have to be used. The risk of
adding metrics is generally cardinality, but in this case we are proposing known dimensions
to the metrics, specifically:
- status - one of
Success,Error,Pending - type - one of
livez,readyz,healthz - name - the known name of the health check. AFAIK, these are all static strings in the Kubernetes codebase, therefore bounded in cardinality
Design Details
When healthz/livez/readyz paths are accessed (not on a timer), they will record whatever they return in a gauge metric. This has the downside of staleness though, since the health check data can be as stale as the length of the kubelet scrape interval. However, given our e2e tests configure apiserver to 1s intervals , it is reasonable to assume that other cloud-providers likely configure similar small scrape intervals, which means staleness should not realistically be much of an issue. However, in the case that the kubelet gets stuck, one can alert off of the counter that we expose; if the counter stops incrementing, then we know that the health endpoint is not getting hit and that our gauge data is too stale. It would therefore be prudent to set a staleness alert off of the counter.
We considered fetching metric data when the metrics endpoint was hit, but this would introduce extra load against the health endpoint, which we took care to avoid. Alternatively, we considered periodic polling of the metrics endpoint such that the metrics would be incremented only during this periodic poll, but that change would be larger and would need to be implemented in each component for each of their health check endpoints.
Using a gaugeFunc would also preclude making the metric stable, since gaugeFuncs are
dynamic by nature and therefore cannot be parsed at compile time by the stability framework.
Since these metrics are intended to be used as component health SLIs, we want them to be
able to be promoted to stable status.
Test Plan
[ X ] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
N/A Our exisiting feature is already thoroughly tested and has been GA for several years now without any issues.
Unit tests
[ X ] ensure that healthcheck states are reset for gauges on write
[ X ] ensure that counters properly retain state of all seen healthchecks
staging/src/k8s.io/component-base/metrics/prometheus/health/metrics_test.go:09-21-2022-existing battery of tests for testing the metrics endpoint
Integration tests
[ X ] ensure existence of healthcheck endpoint (beta requirement)
:
e2e tests
Given this feature is purely in-memory, no enablement/disablement tests are needed.
[ X ] ensure existence of healthcheck endpoint (beta requirement)
:
Graduation Criteria
Alpha
- Feature implemented behind a feature flag
- Feature implemented for apiserver
- unit tests covering aspects of the feature
Beta
- e2e tests completed and enabled (this needs to be beta due to feature flag enablement)
- Gather feedback from developers
- Feature implemented for other Kubernetes Components
GA
- Several cycles of bake-time
- Graduation of metrics to stable status
Deprecation
- Announce deprecation and support policy of the existing flag
- Two versions passed since introducing the functionality that deprecates the flag (to address version skew)
- Address feedback on usage/changed behavior, provided on GitHub issues
- Deprecate the flag
Upgrade / Downgrade Strategy
This is a new metrics endpoint and should not affect upgrade/downgrade strategy with the exception that if you are scraping this endpoint, downgrading may remove this endpoint from the Kubernetes components and you may end up missing these metrics.
Version Skew Strategy
We do not plan to modify these metrics, so it should be safe for version skew.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
We will target this feature behind a flag ComponentSLIs
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name:
ComponentSLIs - Components depending on the feature gate:
- apiserver
- kubelet
- scheduler
- cloud-controller-manager
- kube-controller-manager
- kube-proxy
- Feature gate name:
Does enabling the feature change any default behavior?
Yes it will expose a new metrics endpoint.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
Yes. But it will remove the metrics endpoint
What happens if we reenable the feature if it was previously rolled back?
It will expose the metrics endpoint again
Are there any tests for feature enablement/disablement?
Given this feature is purely in-memory, no enablement/disablement tests are needed.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
This feature should not cause rollout failures. If it does, we can disable the feature. In the worst case, it is possible it could cause runtime failures, but it is highly unlikely we would not detect this with existing tests.
What specific metrics should inform a rollback?
I mean, we’re literally introducing health metrics so those can be used to inform a rollback.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
This should not be necessary, we’re adding a new metrics endpoint with no dependencies. The rollback simply removes the endpoint, so if scrapes were happening, they will just fail.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
Monitoring Requirements
I am proposing a metric/series of metrics here.
How can an operator determine if the feature is in use by workloads?
They can check their prometheus scrape configs.
How can someone using this feature know that it is working for their instance?
They can curl any of the kubernetes components’s (except etcd) metrics/slis endpoint.
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
This is intended to allow people to establish SLOs.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
This KEP introduces SLIs.
Are there any missing metrics that would be useful to have to improve observability of this feature?
Yes, the exact metrics that this KEP proposes.
Dependencies
Prometheus client and the Kubernetes metrics framework.
Does this feature depend on any specific services running in the cluster?
Yes, it depends on the Kubernetes components running in the cluster.
Scalability
We already hit the readyz/healthz/livez endpoints of our control-plane components frequently, this KEP only adds instrumentation of these endpoints’ results.
Will enabling / using this feature result in any new API calls?
Yes, we are proposing that this health metrics are surfaced in each component under /metrics/slis which
will have to be consumed for the feature to be useful. However, this should be relatively innocuous since
it will an isolated endpoint strictly for the purpose of surfacing health metrics.
Will enabling / using this feature result in introducing new API types?
No.
Will enabling / using this feature result in any new calls to the cloud provider?
No.
Will enabling / using this feature result in increasing size or count of the existing API objects?
No.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No. This should, in theory, reduce calls to health endpoints since SLIs need to be calculated currently by directly hitting and parsing out the results of our existing health check endpoints, which adds to the total number of calls (since kubelet also hits these endpoints).
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
No.
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
If component is unavailable, then you will not be able to ingest the metrics from the component. However, in the case of the apiserver, the failure of etcd should allow you to scrape the metrics from component, so long as it is otherwise healthy.
What are other known failure modes?
If the metric is unbounded, then it can cause a memory leak. However, we are only propsing using bounded label values so this should not be a problem.
What steps should be taken if SLOs are not being met to determine the problem?
I mean this makes it possible to establish Kubernetes Component Health SLOs…
Implementation History
Drawbacks
Slight increase of memory usage for components (i.e. the breadth of the prometheus metric label values).
Alternatives
Status quo. Which also means you basically need to implement this in an external component.