KEP-2818: Reducing Build Maintenance in CIP

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
- Caveats
- Risks and Mitigations
Design Details
Production Readiness Review Questionnaire
Implementation History
Alternatives
- Ko Image Builder
- Kaniko Image Builder

Release Signoff Checklist

Summary

Deprecate Bazel within the k8s-container-image-promoter.

Motivation

In Feb 2020, the Kubernetes community produced a proposal to remove Bazel-based build infrastructure from kubernetes/kubernetes. Justified by decreased dependencies and a simpler build process, several of the project’s subprojects/repositories now primarily rely on make for builds. In an effort to attain the same benefits, CIP aims to reduce its own build maintenance by removing Bazel.

Goals

Prow jobs must remain stable - While replacing existing Bazel infrastructure, it’s important that any changes do not affect any Prow jobs. More specifically, the presubmit, postsubmit, and periodic jobs defined for CIP in kubernetes/test-infra are required for CI/CD. Any proposed solution should avoid interfering with these existing operations, and are crucial for future development and deployment.

Preserving functionality of e2e tests - It’s important when removing Bazel to preserve the business logic of the e2e tests. However, both tests rely on the predictability of Docker digests when verifying image promotions. Therefore, proposed solutions must maintain static golden image digests. This may warrant the adoption of another technology which mimic’s Bazel’s behavior, or avoid image building altogether. Either way, both e2e tests must remain unaffected.

Completely remove Bazel from CIP - The result of this project must remove all references of Bazel. Therefore, machines running the CIP source code, such as existing Prow jobs, should no longer require the installation of Bazel for compilation or execution. Since reducing build maintenance is the primary reason behind removing Bazel, adding additional tools to the build process should be avoided.

Non-Goals

Improve performance - The execution of e2e tests, building of containers or code does not need to speed up. Perceived or measured performance gains from proposed solutions is not a focus of this project.

Complicate e2e test behavior - The behavior of e2e tests should change as little as possible when removing Bazel. Although deterministic image digests are required, desirable solutions must avoid adding complexity to the setup of e2e tests.

Proposal

Bazel’s removal will leverage existing project dependencies (Docker and Golang) to manage binary and container builds. Since Make is already relied upon for triggering specific behaviors, these targets will be be implemented with a function replacement for their existing Bazel function.

For example:

bazel build //cmd/cip

can be substituted for:

go build ./cmd/cip

In addition, go’s dependency management system (go modules) is already setup within the project. This can relieve the need for Gazelle to generate and update existing Bazel BUILD files.

For containerization, Docker’s CLI is more than capable to script the required image bundles previously defined in BUILD files. Existing make targets can trigger docker directly to pull, build, or push CIP images. End-to-end tests, which rely on static image definitions, can utilize local archives committed to source control. When testing begins, docker will load these images from tarbal, removing the need to build images with Bazel all together.

Caveats

The nature of tarball archives conceal the information they compress. Once decompressed, Docker archives reveal multiple json files which define the layers, tag information, and versions of the image. Docker can understand these files to reliably reproduce the saved image. However, since this information is saved in a compressed form, developers will not be able to understand what these golden images contain directly and will have to untar them to inspect them, or use docker load themselves manually.

Risks and Mitigations

In a scenario where the contents of the golden images are lost, recreating these contents from archives would be a great challenge. Realistically, the CIP tool does not need to know the contents of the images promoted. The e2e tests currently require the test images to be static as the same digests are used between multiple e2e test runs. In an event where all golden image backups are lost, these manifests would need to be modified to match the digests of new test images.

Design Details

The CIP repository is written entirely in Golang, which simplifies the compilation process. Since the goal is to strip all Bazel dependencies, this section outlines the function of existing Bazel rules and recommends alternatives as a functional replacement.

Golang

Building all go code from CIP should only utilize existing targets from the Makefile. Most targets however use a Bazel wrapper to handle code compilation. This means Bazel is either invoking the build or run command. Fortunately Golang’s CLI is a direct replacement for such tasks.

For example, this:

# Makefile (with Bazel)
REPO_ROOT:=$(shell dirname $(abspath $(lastword $(MAKEFILE_LIST))))

.PHONY: build
build: ## Bazel build
    bazel build //cmd/cip:cip \
        //test-e2e/cip:e2e \
        //test-e2e/cip-auditor:cip-auditor-e2e

.PHONY: install
install: ## Install
    bazel run //:install-cip -c opt -- $(shell go env GOPATH)/bin

can be transformed into this:

# Makefile (without Bazel)
REPO_ROOT:=$(shell dirname $(abspath $(lastword $(MAKEFILE_LIST))))

.PHONY: build
build: ## Bazel build
    go build $(REPO_ROOT)/cmd/cip:cip && \
    go build $(REPO_ROOT)/test-e2e/cip:e2e && \
    go build $(REPO_ROOT)/test-e2e/cip-auditor:cip-auditor-e2e

.PHONY: install
install: ## Install
    go install $(REPO_ROOT)/cmd/cip

This one-to-one replacement works when building, installing, running and testing go code since these Bazel rules are just using go tools behind the scenes. Therefore, BUILD files containing go_library, go_binary and go_test rules can be eliminated.

Gazelle is currently used as a tool to generate and update Bazel build files for Go projects that follow the conventional “go build” project layout. It is intended to simplify the maintenance of Bazel Go projects as much as possible. However, since Bazel is being removed, so will the use of gazelle. Fortunately, the CIP tool already uses Go Modules which handles dependency vendoring. Therefore, the removal of Gazelle, alongside Bazel, can allow go module tools like tidy to handle importing and pruning dependencies before build time.

Docker

The go_image, container_image, container_layer, container_pull and container_bundle rules allow Bazel to define and build Docker containers. However, since Docker is already a dependency of CIP, it can act as a functional replacement for some of these commands.

Existing Rule(s)	Docker Equivalent	Without Dockerfile
container_bundle	docker save [IMAGE…]	Yes
container_pull	docker pull NAME[:TAG	@DIGEST]
container_layer, container_image, go_image	docker build PATH	URL

Though the first two Bazel rules seem to have straightforward equivalents, Docker’s build command will not complete everything Bazel accomplished. For instance, a Dockerfile must be provided for all images looking to be built. Additionally, Bazel builds images deterministically - digests remain constant for each build. This behavior is not consistent with Docker which fingerprints each image digest with a timestamp, making each digest unique. Therefore, replacing bazel build with docker build will pose a problem for golden images.

E2E Tests

Since Docker can’t reproducibly build static digests, the testing behavior will cause test failures. Replacing bazel build for docker build would generate unique images that would clash with existing test manifests.

Non-Approach: Generate Test Manifests

If docker build produces new golden image digests each time, couldn’t we also generate new test manifests to use these new digests? This would suggest the following behavior:

[IMAGE 1]

This adds two new steps to the behavior of the e2e tests. Before each test, old test manifest files are deleted, as they contain old digest images. After every golden image is built with docker, their unique image digests are saved into new test manifest files. Such an approach would allow e2e tests to pass, but produce some unwanted side-effects.

First, repeating the same e2e test twice is impossible, as the prior test fixtures are discarded. This makes debugging quite difficult since tests could no longer be repeated with the same images or manifests. Additionally, adding two extra steps to the behavior of the e2e tests complicates the testing process which is a non-goal of this project. Therefore, adopting docker build with this approach is non-viable.

Approach #1: Static Hosting

A simpler approach would be to host static golden images in a project owned image repository. This would remove the steps of building the same images for every PR. Instead, these images could be built once and permanently live in an isolated directory. Assuming these images now permanently reside in the source image repository, below is an example of the modified e2e-test behavior.

[IMAGE 2]

In this scenario, there’s no need to clear the src repo since the golden images will already reside there. The destination still needs to be purged in order to remove any residual testing artifacts. However, this approach avoids image building altogether, thus removing the need for pushing images in setup.

Pros

This simplification of e2e tests streamline the number of steps in order to set up promotion. Less steps improve the robustness of the testing procedure as a whole. Additionally, since both cip and cip-auditor tests run multiple sub-tests with multiple golden images, this modified testing strategy should yield performance gains. Though quickened e2e-tests cannot be verified without implementation, it may be a desirable side effect of this approach.

Cons

Since golden images will never be built or pushed to src in the test cycle, they must remain static at all times. Though image migrations are very rare for CIP, changing the placement or images in the src repository would result in immediate Prow job failures resulting in a halt of mergeable PRs. Such a hangup would disrupt the development of CIP and should be avoided at all costs.

To protect the integrity of the golden images, the CIP’s test service-account should have read-only permissions to the src repository. This would thwart Prow from modifying the test fixtures during testing. Developers also pose a risk of tampering with these static images. It’s imperative that the specific testing directory, within the src repository, is well documented in the CIP source code.

Approach #2: Tarball Image Loading (recommended)

An even better approach, which also avoids dynamic image builds, could make use of Docker’s save command. Since all golden images should remain static, they can be archived and source controlled in the CIP repository. Whenever needed, these tarball archives can be loaded back into container images. As container images, they can be pushed to GCR or built locally. What’s desirable about this process is the circumvention of building an image from source which would have modified the image digest. Loading and saving archives retains all digest information and solves the issue of deterministic images.

Using the busybox image as an example, the following script outlines this behavior:

#!/usr/bin/env bash
docker pull busybox
docker save busybox -o archive.tar
docker load -i archive.tar
# push image
docker tag busybox gcr.io/testing/example
docker push gcr.io/testing/example

Running this script multiple times will always produce the same image digest. Therefore, docker save and docker load from tarballs provides similar reproducible builds as Bazel. Below is the behavior of the e2e tests when adopting this approach:

[IMAGE 3]

Notice how this series of events looks almost identical to the original flow of e2e tests. It replaces the second build step with loading images from local archives. This simple adjustment removes the need for any extra setup modification while preserving existing business logic.

Pros

This approach preserves the existing test behavior, minimizing the complexity of implementing this change. Tests will only need to replace the existing manifest creation with loading from a tarball. Of the two proposed approaches, this is simpler.

Cons

The existing tar files must not be modified for tests to work. This would mean committing all archives to source control within the CIP repository. It’s imperative the function of these files are well documented and not moved or modified. Though if these tarballs were moved to another directory or project, it would cause the existing e2e tests, which use them, to fail and raise concern. Such a PR would not be allowed to merge, making this a low risk.

Test Plan

Graduation Criteria

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name:
- Components depending on the feature gate:
Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

N/A

Are there any tests for feature enablement/disablement?

Existing tests existing as make targets, triggered by Prow.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

Events
- Event Reason:
API .status
- Condition name:
- Other field:
Other (treat as last resort)
- Details:

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
Other (treat as last resort)
- Details:

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Docker
Golang

Does this feature depend on any specific services running in the cluster?

For Prow Jobs running particular make targets that require docker, the docker-in-docker feature must be enabled.

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Alternatives

Ko Image Builder

Ko is a simplified container image builder specifically designed for Go applications. This tool makes it easy to build, name, and publish Docker images for Go applications without even requiring Docker as a dependency. Such a lightweight tool seemed to be a promising replacement for existing Bazel build tools.

However, the deal breaker is that existing golden images do not contain actual go programs, but small data files. Therefore, Ko would not work for our use case. Additionally, Ko doesn’t help with deterministic image digests either. Although Docker doesn’t have this feature either, adding Ko as a dependency would not solve any problems. If anything, migrating to another container build system would add complexity.

Kaniko Image Builder

Kaniko is a build tool which converts Dockerfiles to container images. With a variety of features for automation and defining build context, it falls short to help us reduce the build maintenance of the project. Since Docker can already accomplish all of this behavior, adding this tool doesn’t provide a useful substitution for Bazel.