KEP-2420: Reducing Kubernetes Build Maintenance

Implementation History
STABLE Implemented
Created 2021-02-03
Updated 2022-02-03
Latest v1.23
Milestones
Alpha v1.21
Beta v1.21
Stable v1.23
Ownership
Owning SIG
SIG Testing
Participating SIGs
Primary Authors

KEP-2420: Reducing Kubernetes Build Maintenance

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • “Implementation History” section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
    • N/A, this is not user facing
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Kubernetes currently maintains multiple build systems, an ongoing burden and a source of contributor friction and confusion. Much has changed since Bazel was first introduced as an additional build system, upon re-evaluating the project it is clear that we should dedupe this. More details on why can be found in Motivation and Drawbacks .

Motivation

Goals

  • Remove the toil of maintaining multiple build-systems from the Kubernetes repo maintainers
  • Eliminate the friction of generating BUILD files from Kubernetes contributors using Go natively
  • Simplify Golang upgrades for SIG Release
  • Remove qualification gaps caused by duplicate-but-slightly-different binary builds
  • Remove testing gaps caused by duplicate-but-slightly-different test-invocation methods
  • Empower broader community maintenance of tests by converging on Golang testing standards

Non-Goals

  • Support importing github.com/kubernetes/kubernetes as a library
    • This has never been supported, and is orthogonal to the decision of maintaining duplicate build systems
    • Code under kubernetes/kubernetes/src/staging/k8s.io/foo is intended to be imported as k8s.io/foo from the staged copies at github.com/kubernetes/foo; anything else is internal-only and not supported for import, bazel or otherwise.
  • Removing Bazel from sub-projects other than the core repo
    • Bazel comes with tradeoffs, each subproject can make its own decision in this regard. This KEP only covers the main Kubernetes repository, and nothing else.
  • Improving the previously existing make build
    • We should strongly consider improving the implementation and behavior of this build in the future, but this is largely orthogonal to whether we should consider maintaining two build systems. An anticipated outcome of this KEP is increased bandwidth available to improve our single build system.

Proposal

For the development (main/master) branch only, NOT the existing release branches:

  1. Switch remaining CI usage (mostly a few presubmits) to use the make build.
  • Most of CI already uses the make builds, excluding some presubmits, we will need to switch these (generally a flag flip in the CI configuration).
  • Most of periodic testing consumes pre-uploaded binaries from the make builds, and does not build at all. These will require no changes.
  • In areas where the make build generates fewer artifacts or exercises fewer paths than bazel, we will err on the side of parity with artifacts that end up in a kubernetes/kubernetes release
  1. Remove the bazel build and associated tooling.
  • There are multiple scripts and LOTS of files related wholly to the bazel build in Kubernetes. Once we are confident that CI is no longer reliant on them we can remove these and relieving the maintenance toil.

No changes should be made to the release branches or their CI.

Risks and Mitigations

Design Details

Test Plan

Non-blocking “make build” equivalents to kubernetes/kubernetes “bazel” CI jobs will be introduced (if they don’t already exist). When the new jobs provide equivalent signal, they will be moved to blocking, and the old jobs will be retired.

This is relevant for at least the following jobs release-blocking and merge-blocking:

  • pull-kubernetes-bazel-test (this can be converted to ~ make test)
  • pull-kubernetes-bazel-build (this largely overlaps with other presubmits, if not for testing ~bazel build //... and can likely be removed)
  • periodic-bazel-build-<branch> (this can likely already be removed in favor of ci-kubernetes-build-<branch>)
  • periodic-bazel-test-<branch>
  • post-kubernetes-bazel-build (this can likely already be removed, it’s unclear what depends on this job)

Again, this should not apply to existing release branches.

Graduation Criteria

This will be declared stable/GA when:

  • All kubernetes/kubernetes master branch CI jobs use the preexisting make build system
  • Bazel-related source files and related tooling are removed from the kubernetes/kubernetes repository on currently supported release branches and the current development branch
    • This will only happen for release branches as we phase out support for older releases, rotating in new supported releases that never contained the Bazel build
  • Bazel-related configuration/presets are removed from kubernetes/kubernetes CI jobs in kubernetes/test-infra

As bazel-built artifacts are not built or distributed as part of a kubernetes/kubernetes release, there is no deprecation window required.

Upgrade / Downgrade Strategy

n/a. Not relevant to upgrades. Existing release builds and upgrade CI use make.

Version Skew Strategy

n/a.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

This section must be completed when targeting alpha to a release.

  • How can this feature be enabled / disabled in a live cluster?

    N/A

  • Does enabling the feature change any default behavior?

    N/A

  • Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

    N/A

  • What happens if we reenable the feature if it was previously rolled back?

    N/A

  • Are there any tests for feature enablement/disablement?

    N/A

Rollout, Upgrade and Rollback Planning

This section must be completed when targeting beta graduation to a release.

  • How can a rollout fail? Can it impact already running workloads?

    N/A

  • What specific metrics should inform a rollback?

    N/A

  • Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

    N/A

  • Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

    N/A

Monitoring Requirements

This section must be completed when targeting beta graduation to a release.

  • How can an operator determine if the feature is in use by workloads?

    N/A

  • What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

    N/A

  • What are the reasonable SLOs (Service Level Objectives) for the above SLIs?

    N/A

  • Are there any missing metrics that would be useful to have to improve observability of this feature?

    N/A

Dependencies

This section must be completed when targeting beta graduation to a release.

  • Does this feature depend on any specific services running in the cluster?

    N/A

Scalability

For alpha, this section is encouraged: reviewers should consider these questions and attempt to answer them.

For beta, this section is required: reviewers must answer these questions.

For GA, this section is required: approvers should be able to confirm the previous answers based on experience in the field.

  • Will enabling / using this feature result in any new API calls?

    N/A

  • Will enabling / using this feature result in introducing new API types?

    N/A

  • Will enabling / using this feature result in any new calls to the cloud provider?

    N/A

  • Will enabling / using this feature result in increasing size or count of the existing API objects?

    N/A

  • Will enabling / using this feature result in increasing time taken by any operations covered by [existing SLIs/SLOs]?

    N/A

  • Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

    N/A

Troubleshooting

The Troubleshooting section currently serves the Playbook role. We may consider splitting it into a dedicated Playbook document (potentially with some monitoring details). For now, we leave it here.

This section must be completed when targeting beta graduation to a release.

  • How does this feature react if the API server and/or etcd is unavailable?

    N/A

  • What are other known failure modes?

    N/A

  • What steps should be taken if SLOs are not being met to determine the problem?

    N/A

Implementation History

  • 2020-02-04 - Initial KEP draft / provisional #2421
  • 2020-02-08 - KEP Implementable #2469
  • 2020-04-01 - KEP Alpha, Beta in Kubernetes 1.21
    • There is no distinct alpha/beta for this KEP, only alpha/beta (implemented at HEAD) vs stable (all supported branches)
  • 2020-08-16 - Retroactive implemented declaration

See also PR listing: https://github.com/kubernetes/enhancements/issues/2420#issuecomment-791024902

Drawbacks

  • The make system works best from x86 (though it can cross compile all platforms), largely due to a [bug in the kube-cross image build] and some other small oversights. Improving the existing make system is an explicit non-goal of this KEP, but we support addressing these oversights, and expect the outcome of this KEP will naturally increase available bandwidth to do so.

    • A few other issues related to CGO / $CC on non-amd64 build hosts have already received pull requests / fixes.
    • Remaining issues largely appear trivial, and should be fixed regardless, especially with the proliferation of non-amd64 developer hardware, where contributors will want to build releases matching the official release (with make).
  • Some contributors may be used to the Bazel CLI, however indications are that this is not true for the majority of contributors. Existing discussions on the matter of removing Bazel from the Kubernetes project have received overwhelming support, most (but certainly not all) of the few suggestions against it do not appear to be from community members / active upstream contributors. See: [kubernetes/kubernetes#88533]

  • In Kubernetes’ CI we have bazel remote caching enabled which theoretically reduces our resource consumption and improves build times. In practice this gain is reduced currently, due to enabling multiple runs for unit tests to eliminate flakiness. Current measurements show equivalent runtime for the two builds.

    • go build has developed high quality caching we can leverage if we need, we have a prototype of this already
    • Kubernetes has enabled repeated runs of “unit” tests to surface flakes, which causes both go and Bazel to not cache test results
    • We’ve disabled caching for large, poorly cachable objects like Docker images anyhow

Alternatives

  1. Continue maintaining both build systems

This has major drawbacks:

  • This continues to eat developer time without significant return, for the most part releases, the bulk of CI (periodic tests), and contributor development use the make build(s).

  • Key components of our development setup do not easily port to bazel, so it will continue to be an “also”:

    • code generators: Kubernetes uses a lot of generated code, while Bazel can run code generators just fine without checking in the generated sources, we need to check them in for consumption by non-bazel users (i.e. external projects), and as a result we have never leveraged this. These generators are largely incompatible with bazel “philosophy”:
      • much of the k8s codegen uses fake go build tags, which would require an external process (i.e. gazelle extension) to turn into build files
      • additionally, some codegen relies on certain dependencies, which the gazelle extension would need to figure out
      • many of the code generators are optimized to load the entire Go tree, then generate lots of code at once, which is somewhat incompatible with Bazel’s approach to working on a package-by-package level. Generating code package-by-package is much slower (due to having to reparse the tree each time).
    • separate “hack/tools” go module for linters etc: not only do linters not work well (because any source change busts the cache anyhow), but rules_go does not do multiple go modules well. Multiple go modules allows us to isolate development dependencies (like linters) from release binary dependencies, easing dependency management
  1. Improve bazel integration and drop the make based build

In addition to the points made in 1.) above as to why this is not particularly viable for some of our existing development patterns:

  • We would need to improve support for CGO or eliminate CGO in Kubernetes, both of which would be somewhat expensive to develop
    • At minimum kubelet definitely requires CGO for OS integration (e.g. selinux)
    • CGO pkg-config directives do not work in rules_go, requiring brittle work-arounds
    • Cross compiling with CGO under bazel is tricky, and despite @ixdy getting close at one point, never shipped in Kubernetes, let alone portably, or capable of shipping a full release.
      • @ixdy suspects this would largely have to be reimplemented now
  • This is less likely to grow the amount of potential contributor bandwidth available to improve the kubernetes/kubernetes build system. We have a larger pool of contributors today who have demonstrated experience updating our existing make build system vs. updating bazel’s components and our usage of it. Traction with our contributor base is important to ongoing project health.
  • Large portions of Kubernetes are intended for dual use internally and as exported go libraries (“staging”) to be consumed by other projects, where we’d still need to support use without bazel (i.e. checked-in generated code etc.).

Infrastructure Needed (Optional)

No additional infrastructure is necessary. The existing infrastructure largely uses and hosts make-based builds as-is.

We may in fact consider turning down some of the caching infrastructure if no remaining projects are using it. At least one subproject does use bazel with a different caching deployment than Kubernetes, but it is not apparent that any other subprojects use the Kubernetes build cache implementation (greenhouse ).

[kubernetes/kubernetes#88533] https://github.com/kubernetes/kubernetes/issues/88553