KEP-1972: Kubelet Exec Probe Timeouts

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
- Risks and Mitigations
Design Details
Implementation History
Drawbacks
Alternatives
Production Readiness Review Questionnaire

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
(R) Graduation criteria is in place
(R) Production readiness review completed
Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Kubelet today does not respect exec probe timeouts. This is considered a bug we should fix since the timeout value is supported in the Container Probe API. Because exec probe timeouts were never respected by kubelet, a new feature gate ExecProbeTimeout will be introduced. With this feature, nodes can be configured to preserve the current behavior while the proper timeouts are enabled for exec probes.

Motivation

Kubelet not respecting the probe timeout is a bug and should be fixed.

Goals

treat exec probe timeouts as probe failures in kubelet

Non-Goals

ensuring exec processes that timed out have been killed by kubelet.
introducing CRI errors for handling scenarios such as time outs.

Proposal

Risks and Mitigations

existing workloads on Kubernetes that relied on this bug may unexpectedly see their probes timeout

Design Details

Changes to kubelet:

Ensure kubelet handles timeout errors and registers them as failing probes.
Add feature gate ExecProbeTimeout that is GA and on by default.
If the feature gate ExecProbeTimeout is disabled and an exec probe timeout is reached, add warning event to inform users that exec probes are timing out.
Introduce the probe duration metric
- metric dimension cardinality must be reviewed and approved by SIG Instrumentation
Re-enable existing exec liveness probe e2e test.
Add new exec readiness probe e2e test.

Test Plan

E2E tests:

re-enable existing exec liveness probe e2e test that is currently being skipped
add new exec readiness probe e2e test.
exec probe tests are promotes to Conformance (#97619 ).

Graduation Criteria

This is a bug fix so the feature gate will be GA and on by default from the start.

Documentation on the migration steps must be provided at kubernetes documentation site offering tips on detecting and updating affected workloads.

The feature flag should be kept available till we get a sufficient evidence of people not being affected by this bug fix - either directly (adjusting the timeouts in pod definition), or indirectly, when the timeout is not specified in some third party templates and products that cannot be easily fixed by end user.

Tentative timeline is to lock the feature flag to true in 1.25.

Upgrade / Downgrade Strategy

N/A

Version Skew Strategy

N/A

Implementation History

2020-09-08 - the KEP was merged as implementable for v1.20
2020-12-08 - Timeout is respected in Kubernetes 1.20: The Raddest Release , and can be disabled with the feature flag

Drawbacks

Existing workloads may depend on the fact that exec probe timeouts were never respected. Introducing the timeout now may result in unexpected behavior for some workloads.

Alternatives

Some alternatives that were considered:

Increasing the default timeout for exec probes
Continuing to ignore the exec probe timeout

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: ExecProbeTimeouts
- Components depending on the feature gate: kubelet

Does enabling the feature change any default behavior?

Yes, all workloads that were not accounting for the timeout affect the probe behavior will experience the problem.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, by resetting the feature gate back.

What happens if we reenable the feature if it was previously rolled back?

Behavior will restore back immediately.

Are there any tests for feature enablement/disablement?

N/A, trivial

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

Rollout and rollback are straightforward and are not expected to fail.

What specific metrics should inform a rollback?

Pods entering crashloopbackoff because of exec timeout failure.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

N/A, trivial

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

The only mechanism currently implemented is warning logs in kubelet. The KEP was updated to introduce the warning events for the cases when timeout was exceeded. With these events, operator may ensure that no workloads are affected by this bug currently by analyzing events.

How can an operator determine if the feature is in use by workloads?

Before migration, analyze events indicating that the timeout was exceeded by exec probe. There is no way to determine if exceed timeout failure of exec probes were intentional or not once the feature gate was enabled.

How can someone using this feature know that it is working for their instance?

No, there is no way to determine if exceed timeout failure of exec probes were intentional or not once the feature gate was enabled.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

SLO of the feature: exec probes must fail when timeout is exceeded. This can be checked by reviewing that Probe duration metric not exceeding significantly the timeout value.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name: probe_duration_seconds

Are there any missing metrics that would be useful to have to improve observability of this feature?

The metric probe duration metric was not implemented yet.

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Troubleshooting

Kubelet.log may be used for all the probes behavior troubleshooting.

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

None

What steps should be taken if SLOs are not being met to determine the problem?

None. It is a core functionality of kubelet