KEP-5468: KEP Template

Implementation History
STABLE Implemented
Created 2025-09-04
Latest v1.35
Milestones
Alpha v1.35
Beta v1.35
Stable v1.35
Ownership
Owning SIG
SIG Testing
Primary Authors

KEP-5468: Invariant Testing

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
    • e2e Tests for all Beta API Operations (endpoints)
    • (R) Ensure GA e2e tests meet requirements for Conformance Tests
    • (R) Minimum Two Week Window for GA e2e tests to prove flake free
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • “Implementation History” section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

With this KEP, we introduce an approach to testing invariants within the cluster across the entire end-to-end test run, with rules in place to minimize the risk of “tragedy of the commons” flakes across CI.

Motivation

SIG Testing has received multiple requests to enable testing invariants that span entire end to end test suite runs, as opposed to being isolated to individual tests.

For example, declarative validation introduces a metric which, when incremented, indicates a mismatch in declarative and imperative validation, indicating a bug.

Because there is concern about flaking tests and “tragedy of the commons” when a test that spans the entire suite starts to fail in any number of jobs the authors may not be monitoring, we will formalize the approach in this KEP and introduce a principled approach with rules and guardrails.

See:

Goals

  • Enable end to end tests to check invariants after the rest of the test suite
  • Prevent unowned tests
  • Allow opting into / out-of this testing
  • Prevent flaking, unreliable tests
  • Ensure result reporting is structured
  • Must not impact the conformance test suite

Non-Goals

  • Enable completely arbitrary checks
  • Targeting integration tests.
    • We are specifically aiming for end to end tests for this purpose.

Proposal

First, all invariant MUST have clearly documented owners, which will be surfaced with the failure results. No invariant tests may roll up to the general test owners. There is precedent in documenting featuregate and metrics owners.

Similarly, the associated SIG(s) must be registered, for use in tagged related issues.

A shared system will be introduced to the e2e framework to enable this form of testing.

All utilities for this will be owned by the SIG Testing Leads and thoroughly vetted for reliability.

New invariants will be tested many times before merge, and may be demoted at any time by tagging the associated sentinel test [Flaky]. Failure to respond to flaking invariant tests in a timeline fashion will result in demoting or removing them.

Risks and Mitigations

If implemented poorly, this could result in tests flaking in any number of e2e test CI jobs that are now running these tests.

We will mitigate this by thoroughly reviewing and testing the invariant checking utilities, and by limiting to only portable, reliable mechanisms.

If not run in many CI jobs, there will be limited benefit to the signal. We will aim to generally introduce these as default selected tests.

Design Details

Ginkgo , which we use for the e2e tests, does not have a facility for “this test must run after all other tests”. It does have ordering, but only within a group of tests. Alternatively we can run arbitrary code after the suite, but that does not suit our ability to opt-in/out of the tests.

We will introduce a package under test/e2e that contains sentinel test(s) which do nothing other than register to Ginkgo that they have run, if selected.

These tests can be labeled appropriately, allowing the existing test selection mechanisms to select / skip them.

We can then use ginkgo.ReportAfterSuite to inspect which, if any, of these sentinel invariant tests ran, and run the actual test logic, reporting pass/fail.

This allows us the flexibility of treating them like regular tests while ensuring that the actual invariant check occurs after all other tests.

It also ensures that while default-enabled, these tests will not impact conformance testing, since the sentinel tests will not be tagged [Conformance] and therefore will not run, in turn disabling the actual after-suite checks.

Invariant checks will be limited to read-only, such as fetching metrics. Invariant checks should be simple, reliable, cheap code. They ideally run in most e2e test runs for useful coverage.

Failure logs for these invariant checks will include instructions about how to handle the test failure and who to report bugs to, from the documented owners.

Something like this, for the failure message:

Invariant failed for missing metric:

If this failed on a pull request, please check if the PR changes may be related to the failure, if not, you can also search for an existing GitHub issue before filing a new issue.

If this failed in a periodic CI, please file a bug and assign the owners.

Owners for this metric: Associated Special Interest Groups:

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

This KEP is itself entirely testing updates.

Unit tests

Not applicable, these are e2e tests.

Integration tests

Not applicable, these are e2e tests.

e2e tests

TBD: We will need to settle on the tests and names.

To start we will add something to cover the declarative validation metric.

Graduation Criteria

N/A. Alpha/Beta/GA does not apply to a KEP of this nature. This KEP is about decision making around the e2e tests, it does not have any feature-gated changes or user-facing API changes.

Upgrade / Downgrade Strategy

Not applicable.

Version Skew Strategy

Not applicable.

Production Readiness Review Questionnaire

This entire section is N/A, this is not in a production cluster component.

Feature Enablement and Rollback

Not applicable. Does not exist in a cluster component.

How can this feature be enabled / disabled in a live cluster?

Not applicable. Does not exist in a cluster component.

Does enabling the feature change any default behavior?

Not applicable. Does not exist in a cluster component.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Not applicable. Does not exist in a cluster component.

What happens if we reenable the feature if it was previously rolled back?

Not applicable. Does not exist in a cluster component.

Are there any tests for feature enablement/disablement?

Not applicable. Does not exist in a cluster component.

Rollout, Upgrade and Rollback Planning

Not applicable. Does not exist in a cluster component.

How can a rollout or rollback fail? Can it impact already running workloads?

Not applicable. Does not exist in a cluster component.

What specific metrics should inform a rollback?

Not applicable. Does not exist in a cluster component.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Not applicable. Does not exist in a cluster component.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Not applicable. Does not exist in a cluster component.

Monitoring Requirements

Not applicable. Does not exist in a cluster component.

How can an operator determine if the feature is in use by workloads?

Not applicable. Does not exist in a cluster component.

How can someone using this feature know that it is working for their instance?

Not applicable. Does not exist in a cluster component.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

Not applicable. Does not exist in a cluster component.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Not applicable. Does not exist in a cluster component.

Are there any missing metrics that would be useful to have to improve observability of this feature?

Not applicable. Does not exist in a cluster component.

Dependencies

None.

Does this feature depend on any specific services running in the cluster?

Not applicable. Does not exist in a cluster component.

Scalability

Will enabling / using this feature result in any new API calls?

Not applicable. Does not exist in a cluster component.

Will enabling / using this feature result in introducing new API types?

Not applicable. Does not exist in a cluster component.

Will enabling / using this feature result in any new calls to the cloud provider?

Not applicable. Does not exist in a cluster component.

Will enabling / using this feature result in increasing size or count of the existing API objects?

Not applicable. Does not exist in a cluster component.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Not applicable. Does not exist in a cluster component.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Not applicable. Does not exist in a cluster component.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Not applicable. Does not exist in a cluster component.

Troubleshooting

Not applicable. Does not exist in a cluster component.

How does this feature react if the API server and/or etcd is unavailable?

Not applicable. Does not exist in a cluster component.

What are other known failure modes?

Not applicable. Does not exist in a cluster component.

What steps should be taken if SLOs are not being met to determine the problem?

Not applicable. Does not exist in a cluster component.

Implementation History

Drawbacks

It introduces additional complexity and a risk for flaking tests. I believe we can sufficiently mitigate these.

Alternatives

Implement outside of test framework

We could implement checks in some specific CI jobs, outside of the e2e test framework. This would give us less coverage, and be less maintainable.

Collect data, and scan for it externally

We could make sure CI jobs are exporting relevant metrics and other invariant data to persisted CI results, and then scan those in some external process.

This was discussed and considered, but deemed redundant to our existing CI result pipelines and dashboards, which already need more active maintainers (testgrid, triage, prow, gcsweb …). It is also difficult to do efficiently, and we’ve already tackled that problem with collecting JUnit results from the test runs.

Infrastructure Needed (Optional)