KEP-5468: KEP Template

KEP-5468: Invariant Testing

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
- Risks and Mitigations
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
- Implement outside of test framework
- Collect data, and scan for it externally
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

With this KEP, we introduce an approach to testing invariants within the cluster across the entire end-to-end test run, with rules in place to minimize the risk of “tragedy of the commons” flakes across CI.

Motivation

SIG Testing has received multiple requests to enable testing invariants that span entire end to end test suite runs, as opposed to being isolated to individual tests.

For example, declarative validation introduces a metric which, when incremented, indicates a mismatch in declarative and imperative validation, indicating a bug.

Because there is concern about flaking tests and “tragedy of the commons” when a test that spans the entire suite starts to fail in any number of jobs the authors may not be monitoring, we will formalize the approach in this KEP and introduce a principled approach with rules and guardrails.

See:

Goals

Enable end to end tests to check invariants after the rest of the test suite
Prevent unowned tests
Allow opting into / out-of this testing
Prevent flaking, unreliable tests
Ensure result reporting is structured
Must not impact the conformance test suite

Non-Goals

Enable completely arbitrary checks
Targeting integration tests.
- We are specifically aiming for end to end tests for this purpose.

Proposal

First, all invariant MUST have clearly documented owners, which will be surfaced with the failure results. No invariant tests may roll up to the general test owners. There is precedent in documenting featuregate and metrics owners.

Similarly, the associated SIG(s) must be registered, for use in tagged related issues.

A shared system will be introduced to the e2e framework to enable this form of testing.

All utilities for this will be owned by the SIG Testing Leads and thoroughly vetted for reliability.

New invariants will be tested many times before merge, and may be demoted at any time by tagging the associated sentinel test [Flaky]. Failure to respond to flaking invariant tests in a timeline fashion will result in demoting or removing them.

Risks and Mitigations

If implemented poorly, this could result in tests flaking in any number of e2e test CI jobs that are now running these tests.

We will mitigate this by thoroughly reviewing and testing the invariant checking utilities, and by limiting to only portable, reliable mechanisms.

If not run in many CI jobs, there will be limited benefit to the signal. We will aim to generally introduce these as default selected tests.

Design Details

Ginkgo , which we use for the e2e tests, does not have a facility for “this test must run after all other tests”. It does have ordering, but only within a group of tests. Alternatively we can run arbitrary code after the suite, but that does not suit our ability to opt-in/out of the tests.

We will introduce a package under test/e2e that contains sentinel test(s) which do nothing other than register to Ginkgo that they have run, if selected.

These tests can be labeled appropriately, allowing the existing test selection mechanisms to select / skip them.

We can then use ginkgo.ReportAfterSuite to inspect which, if any, of these sentinel invariant tests ran, and run the actual test logic, reporting pass/fail.

This allows us the flexibility of treating them like regular tests while ensuring that the actual invariant check occurs after all other tests.

It also ensures that while default-enabled, these tests will not impact conformance testing, since the sentinel tests will not be tagged [Conformance] and therefore will not run, in turn disabling the actual after-suite checks.

Invariant checks will be limited to read-only, such as fetching metrics. Invariant checks should be simple, reliable, cheap code. They ideally run in most e2e test runs for useful coverage.

Failure logs for these invariant checks will include instructions about how to handle the test failure and who to report bugs to, from the documented owners.

Something like this, for the failure message:

Invariant failed for missing metric:
If this failed on a pull request, please check if the PR changes may be related to the failure, if not, you can also search for an existing GitHub issue before filing a new issue.
If this failed in a periodic CI, please file a bug and assign the owners.
Owners for this metric: Associated Special Interest Groups:

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

This KEP is itself entirely testing updates.

Unit tests

Not applicable, these are e2e tests.

Integration tests

Not applicable, these are e2e tests.

e2e tests

TBD: We will need to settle on the tests and names.

To start we will add something to cover the declarative validation metric.

Graduation Criteria

N/A. Alpha/Beta/GA does not apply to a KEP of this nature. This KEP is about decision making around the e2e tests, it does not have any feature-gated changes or user-facing API changes.

Upgrade / Downgrade Strategy

Not applicable.

Version Skew Strategy

Not applicable.