Eliminating Kubernetes Image Signature Replication
The image promoter rewrite laid the groundwork for simplifying how Kubernetes delivers container image signatures. One of the rewrite phases (Phase 6) separated image signing from signature replication into distinct pipeline stages. This follow-up covers the next step: eliminating signature replication entirely.
The problem
After promoting container images to registry.k8s.io, the promoter signs them
using cosign
with keyless (OIDC)
signatures. These signatures are stored as OCI artifacts alongside the images,
tagged with the convention sha256-<digest>.sig and sha256-<digest>.att.
The registry.k8s.io domain is backed by archeio
,
a thin redirector that routes container image requests to the nearest regional
Google Artifact Registry
backend.
When a user in Europe pulls an image, archeio redirects them to
europe-west2-docker.pkg.dev; a user in Asia gets redirected to
asia-east1-docker.pkg.dev, and so on across 22 regional backends.
This geo-routing is great for image layers, where download locality matters for
performance. But it created a problem for signatures: if the promoter only wrote
a signature to one region, cosign verify would fail for users redirected to
any other region. The solution was a dedicated replication pipeline that copied
every .sig and .att tag to all 22 regional backends. This pipeline ran as a
periodic Prow job
every 2 hours on weekdays, performing thousands of API calls per run: listing
tags across all repositories, diffing what existed where, and copying the
missing signatures.
The insight
Signatures and attestations are small metadata artifacts, typically a few kilobytes each. Unlike image layers where geo-locality provides meaningful download performance improvements, fetching a signature from a non-local region adds negligible latency. The entire replication pipeline existed to optimize for a latency difference that users would never notice.
The solution
Instead of replicating signatures everywhere, archeio was taught to route
signature requests to a single canonical upstream. The change is
straightforward: when archeio receives a manifest request for a tag matching
sha256-*.sig or sha256-*.att, it redirects to
us-central1-docker.pkg.dev (the canonical region) instead of the
caller’s nearest regional backend. All other requests continue to use
geo-routing as before.
Normal image pull:
registry.k8s.io ⟶ (geo-routing) ⟶ europe-west2-docker.pkg.dev
Signature verification:
registry.k8s.io ⟶ (canonical) ⟶ us-central1-docker.pkg.dev
This is configured through a new SIGNATURE_UPSTREAM_ENDPOINT environment
variable on each Cloud Run instance that runs archeio.
On the promoter side, the signing target was updated to explicitly use
us-central1-docker.pkg.dev as the canonical registry, instead of relying on
alphabetical sorting of registry names (which would have picked
asia-east1-docker.pkg.dev). The replicate-signatures subcommand was then
removed along with all supporting code.
What changed
The rollout was sequenced to ensure signature verification kept working at every step:
- kubernetes/registry.k8s.io#321
:
Added
SIGNATURE_UPSTREAM_ENDPOINTsupport to archeio - kubernetes/k8s.io#9413 : Deployed the new environment variable to all Cloud Run instances and updated the archeio image digest
- Verified that
cosign verifyworks againstregistry.k8s.ioand that.sig/.attrequests redirect tous-central1-docker.pkg.dev - kubernetes-sigs/promo-tools#1829 : Removed the replication pipeline and updated the signing target, released as kpromo v4.5.0
- kubernetes/test-infra#36909 : Removed the periodic Prow replication job
Impact
Removing signature replication:
- Eliminates thousands of API calls that were spent listing tags and copying signatures across 22 regions every 2 hours
- Removes a source of transient failures, since the replication job was susceptible to Artifact Registry rate limits
- Simplifies the promoter codebase by deleting the two-phase tag listing, multi-registry grouping logic, and concurrent copy orchestration (over 1,200 lines removed)
- Removes a periodic Prow job that ran on weekdays
End users see no change. cosign verify against registry.k8s.io continues to
work exactly as before:
cosign verify registry.k8s.io/kube-apiserver:v1.36.0 \
--certificate-identity krel-trust@k8s-releng-prod.iam.gserviceaccount.com \
--certificate-oidc-issuer https://accounts.google.com
Trade-offs
Routing all signature requests to a single region means that if
us-central1 is unavailable, cosign verify for images served through
registry.k8s.io would fail until the region recovers. This is the main
trade-off of the approach.
A few mitigating factors make this acceptable in practice:
- Artifact Registry is a managed Google Cloud service with high regional
availability. An outage of
us-central1would likely affect far more than just signature serving. - Signatures are small metadata (a few KB). Even during normal operation, cosign already depends on registry availability for verification, whether the manifest comes from a regional or central backend.
- Image pulls themselves are unaffected. Geo-routing for image layers continues to work independently of signature availability.
What’s next
The broader Kubernetes ecosystem is moving toward OCI 1.1 referrers for signature discovery, replacing the tag-based convention that cosign has used historically. cosign v3 defaults to storing signatures as OCI referrers. As this migration progresses, the tag-matching logic in archeio can eventually be replaced with referrer-aware routing.
Getting involved
This work is tracked in kubernetes-sigs/promo-tools#1762 . If you are interested in contributing to SIG Release , join our weekly meeting or reach out on the #sig-release Slack channel.