KEP-5241: Beta Feature Gate Promotion Requirements
KEP-5241: Beta Feature Gate Promotion Requirements
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
Features gates must include all functional, security, monitoring, and testing requirements along with resolving all issues and gaps identified prior to being enabled by default. The only valid GA criteria are “all issues and gaps identified as feedback during beta are resolved”.
Motivation
Features gates that are enabled by default are enabled in every production Kubernetes cluster in the world. We must avoid making every production cluster into unstable or incomplete feature testing clusters. Even feature gates that make flags accessible, but require a secondary configuration to use must be stable, because it is unrealistic to expect everyone to understand the graduation stages of various flags for each release: the only stages that really matter are “takes enabling an explicit alpha feature gate” and “my production cluster accepts this as valid by default”.
Goals
- Features gates must include all functional, security, monitoring, and testing requirements along with resolving all issues and gaps identified prior to being enabled by default.
- The only valid GA criteria are “all issues and gaps identified as feedback during beta are resolved”.
Non-Goals
- Changing beta APIs off by default rules.
- Change the imperfect mechanisms we have for API evolution.
Proposal
Kubernetes feature gates have three levels: GA (locked on), GA (disable-able), Beta, and Alpha.
- GA (locked-on) means that a feature gate is unconditionally enabled in all production kubernetes clusters and that feature cannot be disabled.
- GA (disable-able) is only for features gates that include a new API serialization that cannot be enabled by default until the API reaches stable. This means that the first time the API is enabled in production, the feature will be GA, but also can be disabled. This is a less common state and does not apply to most features.
- Beta means that a feature gate is usually enabled in all production Kubernetes clusters by default and that feature can be disabled. Exceptions exist for entirely new APIs and some node features, but this broadly the case.
- Alpha means that a feature gate is disabled in all production Kubernetes clusters by default and
can be optionally enabled by setting a
--feature-gatecommand line argument.
Making the jump to GA (cannot be disabled), without actual field experience is irresponsible. The first time we take a feature gate enabled by default in production Kubernetes clusters, we must have a way to disable the feature in case of unexpected stability, performance, or security issues.
Enabling incomplete features in production Kubernetes clusters by default is irresponsible. Features that are known to be incomplete naturally bring with them additional stability, performance, and security issues. Once a feature has been enabled in a production Kubernetes cluster by default, adding to it carries greater risk to upgrading clusters and the ecosystem. The feature can easily have become relied upon by workloads and other platform extensions. If an accident happens in adding those capabilities with stability, performance, and security the cost to disable those features in a cluster becomes significantly greater and breaks existing clusters, workloads and use-cases. This posture makes upgrades higher risk than necessary.
To balance these concerns, we are changing how we evaluate Beta and GA stability criteria. The only valid GA criteria are “all issues and gaps identified as feedback during beta are resolved”. Promotion from Beta to GA must have no significant change for the release. This means that Beta criteria must include all functional, security, monitoring, and testing requirements along with resolving all issues and gaps identified prior to beta.
Phasing in larger features over time can be done by bringing separate feature gates through alpha, beta, and GA. Each feature gate needs to meet the beta and GA criteria for completeness, functional, security, monitoring, and testing. After meeting the criteria for enabled by default, and at the SIG’s discretion, the new feature gate could be set to enabled by default in the release it is introduced. Importantly, the features need to behave in a way that allows old and new clients to interoperate and new additions to larger features able to be independently disablable with their own path for GA.
Risks and Mitigations
What if I need to add capability to my feature?
To handle this situation, we described above how to add second feature gate for the new behavior. This provides a mechanism for adding needed capability, but ensures that cluster-admins never end up stuck after upgrade because they rely on v1.Y-1 behavior that new capability in v1.Y broke under the same feature gate.
Who will make sure that new KEPs follow the promotion rules?
We’ll adjust the KEP template to indicate the allowed criteria, so authors should notice. SIG approvers should enforce those standards. PRR approvers can be a final backstop.
Graduation Criteria
This document is our new position once merged until it is superceded by another position statement.
Drawbacks
This may slow the rate that new features are promoted.
For this to be true, that would mean that we previously enabled feature gates in production that were knowingly incomplete for functional, security, monitoring, testing, or known bugs. We hope this was not the common case, but if it was the common enough to have an impact, we’re pleased that the result is preventing incomplete feature gates from being enabled in production clusters.
Alternatives
None proposed so far.