KEP-3041: NodeConformance and NodeFeature labels cleanup

Implementation History
BETA Implementable
Created 2021-11-08
Latest v1.26
Ownership
Owning SIG
SIG Testing
Participating SIGs
Primary Authors

KEP-3041: NodeConformance, NodeFeature, and Feature Gate labels cleanup

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
    • e2e Tests for all Beta API Operations (endpoints)
    • (R) Ensure GA e2e tests for meet requirements for Conformance Tests
    • (R) Minimum Two Week Window for GA e2e tests to prove flake free
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • “Implementation History” section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

The document started as an analysis of whether NodeFeature label needs to be renamed to simply Feature as it carries an identical semantics kubernetes/kubernetes#94289 , but looking into it, scope needed to be extended into other labels currently being used, see also this discussion .

Document proposes changes into Feature label description, and additional labels that will make it clearer for contributors how to apply these labels.

Motivation

Tests have various clarifiers - they may be testing the feature that is currently in development and depends on feature gate, test may require a special environment, or test may only work with specific hardware. Today a few labels are used to represent the combination of these “dimensions”, in most cases a single label [Feature:] is used.

The universal nature of a single [Feature:] label makes it hard to apply it consistently and query tests.

There are a few specific problems we saw recently:

  • There is no way to disable tests for a specific feature gate (example where it’s needed: https://github.com/kubernetes/kubernetes/issues/99854) . It is a “feature” guarded by feature gate, but not marked as “feature”.
  • There are no tests validating that k8s works with ALL beta feature gates disabled. Creating such test would be hard with today’s test labeling.
  • Some tests are degraded in how [Feature] and [NodeFeature] labels are applied (for example: https://github.com/kubernetes/kubernetes/pull/105921) .
  • labels that we apply heavily depend on the environment tests are running on. For example, https://github.com/kubernetes/kubernetes/pull/104803 is not marked as Feature, even though it depends on the environment - if test-handler is not installed, test will not succeed. Similar thing to AppArmor tests.

Goals

Split the special environment and feature development labels:

  • Clarify the semantic of a [Feature:] label.
  • Introduce the [FeatureGate:] label with the stage ([Alpha], [Beta], etc.) addition, replace tags like [NodeAlphaFeature].

Clean up labels that means whether special environment is needed:

  • Eliminate the [NodeFeature:] label by renaming to [Feature:] when applicable.
  • Document the meaning of [NodeConformance] label.
  • Remove the [NodeSpecialFeature:] label.
  • Introduce the [Environment:Foo] label to indicate the special environment that is not a standard OSS test infrastructure machines needed to run the feature.

Non-Goals

  • As [NodeConformance] may be a confusing term and may be confused with the k8s Conformance tests, renaming of it is not in a scope of this KEP.
  • Convert textual tags into the ginkgo v2 tags. This is optional, but not required and can be handled outside of this KEP.

Proposal

Clarify the semantic of a [Feature:] label

The current feature definition is :

  • [Feature:.+]: If a test has non-default requirements to run or targets some non-core functionality, and thus should not be run as part of the standard suite, it receives a [Feature:.+] label, e.g. [Feature:Performance] or [Feature:Ingress]. [Feature:.+] tests are not run in our core suites, instead running in custom suites. If a feature is experimental or alpha and is not enabled by default due to being incomplete or potentially subject to breaking changes, it does not block PR merges, and thus should run in some separate test suites owned by the feature owner(s) (see Continuous Integration below).

Feature will indicate that things may NOT work on all k8s distros and/or node capabilities. This flag will not indicate the Feature Gate state.

Proposed change:

  • [Feature:.+]: If a test validates functionality that may only work outside the minimal conformant installation of Kubernetes, e.g. when specific node capabilities are enabled, having a certain addons available, like loadbalancer integration, or when test depends on the functionality of underlying components like container runtime, and thus may need to be skipped on certain environments, it receives a [Feature:.+] label, e.g. [Feature:AppArmor]. This label can also be applied to signify that test validates non-core functionality, like [Feature:Ingress]. [Feature:.+] tests are not run in a core suites, but typically can run together on the standard environment as many capabilities and functionality of underlying components are pre-enabled on the standard environment. Use [Environment:Foo] to signify the specific environment configuration needed to run the test. For example if test requires GPU to be configured, but tests the non-optional feature like a device plugin. [Feature:+] label should not be confused with [FeatureGate:] label.

And another two:

  • [FeatureGate:.+]: If a test only works when the certain feature gate is enabled it receives a [FeatureGate:.+] label. [FeatureGate:.+] tests must also be marked with the status of this feature gate: [Alpha], [Beta], [Stable], [Deprecated]. This label helps to skip tests that should not work on specific k8s distributive that has a certain feature gate disabled. This label has to be removed when feature gate value is “locked” or removed.
  • [Environment:.+]: If test requires non-standard environment (different from standard OSS test machines) to run it receives the [Environment:.+] tag. Typically only tests with the matching [Environment:.+] tags can run together. Examples may be GPU needs to be provisioned, Memory Swap enabled, high-memory or high-CPU machines are needed for Performance environment, etc.

Feature label clean up

It is already true today that only a handful of tests marked as Feature don’t have Serial or Disruptive labels as well. Out of all tests that has Feature label without Serial or Disruptive, most of them just degraded, and don’t actually need the Feature label any longer. Going forward, features that require a special environment outside of predefined LinuxOnly, Serial, etc. will need to define their “custom” labels and be commented with the description of the special environment needed.

Clean up labels that means whether special environment is needed

NodeSpecialFeature

The label NodeSpecialFeature: was introduced in this document , but was never consistently used. Today we only have a couple uses of it that might be cleaned up in favor of using Feature: tag when applicable.

NodeFeature

NodeFeatures label was introduced to indicate that the feature may not work the same way on different container runtimes or environments. It is NOT used today as a direct analog of Features label, as Feature label indicates that the special environment configuration is needed to run tests in the “standard” CI. See the previous section and definition here .

The use of these labels in tests today is indicative of the labels’ different meaning. Looking at tests configuration, wildcard [NodeFeature:* label is always used as focus and individual NodeFeatures are used in skip. It is opposite for the Feature label. The wildcard [Feature:* is always used to skip tests, while individual tests are present in focus.

The reason for this difference is that runtimes that are being tested support all the NodeFeatures today and there is no need to list all NodeFeatures individually in test definitions. If we will have more fragmented support of NodeFeatures in runtimes being tested in CI, labels will have exactly the same semantics.

Also both labels, NodeFeature and Feature degraded over time - they weren’t applied using this semantics.

This KEP proposes to adjust the definition of the Feature: label which makes definitions compatible. The proposal is to unify NodeFeatures and Features and start relying on alternative labels for filtering in environments.

  • Rename NodeFeature to Feature. https://testgrid.k8s.io/sig-node-containerd#node-e2e-features will execute tests for all Features, excluding everything that needs a special environment, see labels above.
  • Document the meaning of NodeConformance. NodeConformance will indicate ALL NODE tests that are testing enabled out of the box functionality that is not runtime or k8s distro specific. These tests may still require an additional environment set up. NodeConformance tests may have all labels specifying its environment requirements, like Slow, Disruptive, Special, etc.
  • Introduce a new periodic job that runs all NodeConformance tests with all FeatureGates turned off.

NodeConformance

NodeConformance label represents tests that have to be working on all environments and k8s distros. The name is similar to the [Conformance] cluster tests that are part of the Kubernetes Conformance Tests, which serves as the base for the Kubernetes certification program. The similarity in the name may be confusing, but this confusion will not be addressed in this KEP.

One difference proposed in this document is to allow NodeConformance to be applied to [Beta] Features. Since beta features are enabled out of the box and in most deployments all Beta features stay enabled, applying Feature label to these tests may be misleading. This difference is intended, as we want to make sure Beta features as enabled out of the box are tested on PR validation and across different environments. This will give a better signal for GA-ing the feature.

Proposed definition of NodeConformance:

  • [NodeConformance]: Node-level tests that validating behavior that doesn’t depend on specific Node capabilities being present, hardware, or feature set of a dependency (like a container runtime), must be labeled as [NodeConformance]. For the ease of test querying, each node-level test that is not testing alpha feature (marked as [FeatureGate:Foo][Alpha]) is supposed to be either NodeConformance or Feature.

See also

User Stories (Optional)

Story 1

  1. New feature gate is introduced as Alpha.
  2. Tests for this functionality are marked as [FeatureGate:Foo][Alpha].
  3. New test grid tab is added to run these tests while enabling the feature gate explicitly.
  4. Feature gate is promoted to Beta and enabled by default.
  5. Tests for this functionality are marked as [FeatureGate:Foo][Beta].
  6. Depending on whether test targets all environments or specific ones, NodeConformance or Feature: labels are added. Note, the Feature: label may (but not necessarily) match the FeatureGate: name.
  7. Test infra runs all these tests as part of Features or NodeConformance runs to ensure that default installation of k8s has all the features working.
  8. Test infra runs all tests with every feature gate disabled, catching potential GA features dependencies on the new functionality.
  9. Feature gate is promoted to GA. Feature gate is locked to the value and cannot be disabled.
  10. Tests for this functionality are dropping the labels [FeatureGate:Foo][Beta], while keeping either NodeConformance or Feature: label.

Story 2

Decision making tree for the test labels:

  • Is test only works on Linux? Apply [LinuxOnly]
  • Is test validate functionality that is controlled by a Feature Gate? Apply [FeatureGate:Foo][Alpha|Beta|Deprecated]
  • Is test validate Core API that is enabled by default, GA, and works on any environment? Apply [Conformance]. See more at https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md
  • Is test validate node-specific functionality that is enabled by default and works on “any” node? Apply [NodeConformance]
  • Is test only works when underlying container runtime has a specific feature enabled or specific node configuration is set? Apply [Feature:Foo] to describe this feature or configuration.
  • Can test only run on the “default” test infra node? If not, apply [Environment:Foo] to describe the specific environment that needs to be pre-configured.

Notes

PRR and test plan sections are not applicable to this KEP.

See https://kubernetes.slack.com/archives/CPNHUMN74/p1665177817264029

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Existing test definitions

Existing test definitions may be affected as they may start running different set of tests. Skew tests may be affected especially as the labels are being modified.

Mitigation:

We will take the iterative approach.

  • Add new labels. FeatureGate and Environment bring additive value and should not affect existing tests
  • Features which should be NodeConformance or Conformance. This is typically an easy transition as NodeConformance and Conformance tests are run more often than Feature and it will unlikely lead to less tests are being run in general.
  • We don’t expect many NodeConformance and Conformance tests to be reverted to Feature. Thus it should not be an issue.

Design Details

Test Plan

Graduation Criteria

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?
  • Feature gate (also fill in values in kep.yaml)
    • Feature gate name:
    • Components depending on the feature gate:
  • Other
    • Describe the mechanism:
    • Will enabling / disabling the feature require downtime of the control plane?
    • Will enabling / disabling the feature require downtime or reprovisioning of a node?
Does enabling the feature change any default behavior?
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
What happens if we reenable the feature if it was previously rolled back?
Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?
What specific metrics should inform a rollback?
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?
How can someone using this feature know that it is working for their instance?
  • Events
    • Event Reason:
  • API .status
    • Condition name:
    • Other field:
  • Other (treat as last resort)
    • Details:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
  • Metrics
    • Metric name:
    • [Optional] Aggregation method:
    • Components exposing the metric:
  • Other (treat as last resort)
    • Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?
Will enabling / using this feature result in introducing new API types?
Will enabling / using this feature result in any new calls to the cloud provider?
Will enabling / using this feature result in increasing size or count of the existing API objects?
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?
What are other known failure modes?
What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

Infrastructure Needed (Optional)