KEP-2033: Rootless mode

Implementation History
BETA Implementable
Created 2019-06-04
Latest v1.37
Milestones
Alpha v1.22
Beta v1.37
Ownership
Owning SIG
SIG Node
Primary Authors

KEP-2033: Kubelet-in-UserNS (aka Rootless mode)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
    • [N/A] e2e Tests for all Beta API Operations (endpoints)
    • [N/A] (R) Ensure GA e2e tests meet requirements for Conformance Tests
    • [N/A] (R) Minimum Two Week Window for GA e2e tests to prove flake free
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • “Implementation History” section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This KEP allows running the entire Kubernetes components (kubelet, CRI, OCI, CNI, and all kube-*) as a non-root user on the host, by running them in a user namespace. See Notes/Constraints/Caveats for the caveats.

TLDR: Most things do work without modifying Kubernetes. But we need to modify a just few lines of kubelet and kube-proxy to ignore errors during setting some sysctl and rlimit values. See “Required changes to Kubernetes” .

Resources:

Motivation

  • Protect the host from potential container-breakout vulnerabilities. This is the main motivation.
  • Allow users of shared machines (especially HPC) to run Kubernetes without the risk of accidentally breaking their colleagues’ environments. Not recommended for real multi-tenancy where the users cannot be trusted.
    • Safe kind : Kubernetes inside Rootless Docker/Podman.
    • Safe Kubernetes-on-Kubernetes, to isolate workloads more strictly than Kubernetes API namespaces.

FAQ: why not use admission controllers?

Admission controllers like PSP can restrict containers to use extra security options like AppArmor/SELinux, gVisor/Kata, and also potentially Node-level UserNS in the future.

However, these are not efficient to mitigate vulnerabilities of the node components themselves (kubelet, CRI, OCI…).

e.g.

  • CVE-2017-1002102 : kubelet could delete files on the host during syncing secret/configMap/downwardAPI volumes
  • CVE-2019-11245 : Dockerfile USER instruction was ignored by kubelet
  • CVE-2018-11235 : kubelet could execute an arbitrary command as the root via gitRepo volumes
  • Potential image extraction zip-slip vulnerabilities in CRI runtimes. Both containerd and CRI-O are working on implementing supports for new archive formats like zstd, imgcrypt, and stargz. Potentially these implementations have such vulnerabilities.
  • And lots of CRI/OCI vulnerabilities in the past.

Goals

Non-Goals

The Node-level UserNS KEP is similar to this KEP, but out of scope for this KEP.

While Node-level UserNS executes only containers inside UserNS. this KEP executes all the node components inside UserNS to mitigate vulnerabilities of all components,

Node-level UserNS and this KEP do not conflict and can be stacked together. (Node-level UserNS inside Kubelet’s UserNS.)

Proposal

User Stories (Optional)

Story 1: Production cluster

A user is scared of the past vulnerabilities of kubelet/CRI/OCI, and looking for a way to mitigate such potential vulnerabilities.

So the user would want this KEP to be implemented.

The user may face difficulties for deploying stateful workloads because block-based and NFS-based persistent volumes mostly do not work (see Notes/Constraints/Caveats ), but this is not a huge deal, when the user can use managed object storages such as Amazon S3, or managed RDBs such as Amazon RDS for storing persistent data.

If the user really needs to run an application that requires the root privileges, the user would create a mixed cluster composed of rootful nodes and rootless nodes, and set the node selector to ensure the privileged pods to be scheduled on rootful nodes. However, it is more preferable to create another cluster for rootful nodes.

Story 2: HPC cluster

A user wants to deploy a Kubernetes cluster using shared HPC machines to run scientific research workloads.

However, the machine administrator does not want to allow the user to gain the root privileges, because the admin thinks that the user may accidentally break other users’ environments.

And yet, the admin hesitates to deploy a shared Kubernetes cluster and to create RBAC-restricted accounts for users, because user management in Kubernetes is very difficult.

The user would want this KEP to be implemented so that he/she can deploy Kubernetes without convincing the admin.

Story 3: kind with Rootless Docker/Podman

A user wants to run a test cluster inside Docker/Podman on his/her laptop using kind.

However, the user doesn’t want Kubernetes/kind/Docker/Podman to gain the root privileges because these components may accidentally break the host environment, e.g. Docker may modify the host iptables in an unexpected way and break the user’s VPN connectivity.

The user would want this KEP to be implemented so that he/she could run kind with Rootless Docker/Podman, which won’t break the host.

Story 4: Temporary initial cluster for bootstrapping

A user needs a temporary initial cluster to bootstrap an actual cluster with Cluster API.

The user wants to avoid having the root privileges.

Notes/Constraints/Caveats (Optional)

TL;DR: Things that work with Rootless Docker 20.10 and Rootless Podman 2.1 will work with Rootless Kubernetes as well. Other things will not.

cgroup:

  • No support for cgroup v1.
  • Hugepages cannot be supported because systemd doesn’t support delegation of the hugetlb controller: https://github.com/systemd/systemd/issues/16325
  • Device controller cannot be supported as well, but it is not a huge deal, because non-root users don’t have permission to access insecure devices anyway.

Network:

  • kube-proxy needs the following KubeProxyConfiguration to avoid hitting errors during setting sysctl values:
conntrack:
# Skip setting sysctl value "net.netfilter.nf_conntrack_max"
  maxPerCore: 0
# Skip setting "net.netfilter.nf_conntrack_tcp_timeout_established"
  tcpEstablishedTimeout: 0s
# Skip setting "net.netfilter.nf_conntrack_tcp_timeout_close"
  tcpCloseWaitTimeout: 0s
  • Some CNI plugins might not work. Flannel (VXLAN) is known to work.
  • Limited network performance due to the slirp4netns overhead. Mitigation: Install lxc-user-nic (SETUID binary) .
  • NodePort less than 1024 cannot be exposed. This is not a problem with the default --service-node-port-range configuration (30000-32767). Mitigation: set CAP_NET_BIND_SERVICE file capability on rootlesskit binary.

Volumes:

  • Block device volumes and (kernel-mode) NFS does not work, because user namespace only supports tmpfs, bind, and FUSE filesystems. emptyDir, hostPath, local, and API volumes (configMap, secret, downwardAPI, …) are known to work without any issue. FUSE-based CSI volumes can be supported, but not recommended. Mitigation: Use managed object storage services such as Amazon S3/Google Cloud Storage/Azure Blob Storage, or use managed database services for storing persistent data.

SecurityContext:

  • A container with securityContext.privileged cannot gain the real root privileges, obviously.
  • runAsUser: supported, but the number of the UID is limited by /etc/subuid.
  • sysctls: some sysctl parameters are supported, but some would fail in EPERM. Creating a Pod manifest with such sysctl parameters would fail. If this behavior is problematic, user should write a Mutating Admission Webhook to remove such sysctl parameters from Pod manifests.
  • seccomp: supported
  • AppArmor: unsupported. Creating a Pod with an AppArmor profile would fail.
  • SELinux: Same as Rootless Podman. Applying an existing profile would be ok, but creating a new profile would not.
  • Node-level UserNS KEP : can be supported. This UserNS will be nested inside Kubelet’s UserNS.

Risks and Mitigations

If Linux kernel had vulnerabilities in its user namespace implementation, the root in the user namespace might be able to escape from the user namespace, and take the real root privilege of the host.

So, it is still preferred to run pods with sandbox technologies like gVisor to mitigate potential kernel vulnerabilities.

Design Details

Running Kubernetes inside Rootless Docker/Podman (kind, minikube)

When Kubernetes is being executed inside Rootless Docker/Podman, the namespaces and cgroups are already configured by Docker/Podman. So, basically there is no additional task, but we still have to modify a few lines of kubelet and kube-proxy to ignore minor sysctl & rlimit errors. See “Required changes to Kubernetes” .

It should be noted that kind already works with unmodified Kubernetes , but kind currently uses very dirty hack to mount fake files under /proc/sys to avoid hitting sysctl errors. .

Running Kubernetes directly on the host

The node components need to be executed inside a user namespace along with other namespaces (mount namespace, network namespace, etc.) to gain fake-root privileges, mostly for mount and network operations.

To run Rootless Kubernetes directly on the host, RootlessKit can be used for creating namespaces. In a nutshell, RootlessKit is an extended version of unshare for rootless containers. RootlessKit has been already adopted by Docker, BuildKit, Usernetes, k3s, and partially by Podman.

All Kubernetes components including CRI runtime, kubelet, kube-proxy, and CNI daemon need to be executed in RootlessKit’s namespaces.

$ rootlesskit --net=slirp4netns --copy-up=/etc --copy-up=/run --copy-up=/var --pidns --cgroupns --ipcns --utsns -- containerd &
$ nsenter -t $ROOTLESSKIT_CHILD_PID -a kubelet ... &
$ nsenter -t $ROOTLESSKIT_CHILD_PID -a kube-proxy ... &
$ nsenter -t $ROOTLESSKIT_CHILD_PID -a flanneld ... &

Paths

Some paths like /var/log/pods are hardcoded in Kubernetes and hard to change.

Although these directories are not writable by unprivileged users, Kubernetes does NOT need to be changed to use unprivileged home directories, because RootlessKit can bind-mount writable directories on these paths without the root privileges. (rootlesskit --copy-up=/var)

Network

The node components need to be executed in RootlessKit’s network namespace, because an unprivileged user cannot do privileged operations in the host network namespace. As the components are executed inside a network namespace, NodePorts are not directly accessible from other hosts.

An external controller should watch changes on corev1.Service resources and call RootlessKit API to set up port forwarding for the node ports.

k3s implementation: https://github.com/rancher/k3s/blob/v1.17.2+k3s1/pkg/rootlessports/controller.go#L92-L96

RootlessKit network drivers

RootlessKit supports two kinds of network stacks:

  • TAP with pure usermode network stack (either slirp4netns or VPNKit)
  • vEth with setuid binary lxc-user-nic

slirp4netns is preferred for security, lxc-user-nic is preferred for performance.

These stacks are used for the namespace where the node components are executed in, not for the containers’ namespaces. CNI plugins such as Flannel are expected to be used for the containers’ namespace.

CNI plugins

Flannel (VXLAN) is known to work.

cgroup

cgroup v2 and systemd are required. cgroup v1 won’t be supported due to security concerns.

containerd supports cgroup v2 for rootless mode since containerd v1.4. The master branch of CRI-O also supports cgroup v2 for rootless mode. It will be included in CRI-O v1.22.

No code change is required on kubelet for managing cgroups, because we can use cgroup namespaces along with mount namespaces for creating writable /sys/fs/cgroup filesystem.

Required changes to Kubernetes

Most things do work without modifying Kubernetes. But we need to modify a just few lines of kubelet and kube-proxy to ignore errors during setting some sysctl and rlimit values.

kubelet

Patch: “kubelet/cm: ignore sysctl error when running in userns”

The patch modifies kubelet to ignore errors that happens during setting the following sysctl keys:

  • vm.overcommit_memory
  • vm.panic_on_oom
  • kernel.panic
  • kernel.panic_on_oops
  • kernel.keys.root_maxkeys
  • kernel.keys.root_maxbytes

Note These sysctl parameters are set for kubelet itself. These are unrelated to .spec.securityContext.sysctls in Pod manifests.

kube-proxy

Patch: “kube-proxy: allow running in userns”

The patch modifies kube-proxy (userspace mode) to ignore an error during setting RLIMIT_NOFILE. No change is needed for non-userspace mode.

Note userspace proxy was removed in v1.26.

Test Plan

[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

See e2e tests below.

Additional tests are present in several subproject repos and third party repos:

Prerequisite testing updates
Unit tests

N/A. Unit tests do not make sense here, as the relevant code depends on sysctl:

The feature can be tested only by running the entire node components in UserNS.

See e2e tests below for how the feature is actually tested.

Integration tests

N/A, as integration tests do not make sense here, for the same reason as explained above for the unit tests .

See e2e tests below for how the feature is actually tested.

e2e tests

NodeConformance tests are executed using kubetest2-kindinv .

“kindinv” stands for “Kubernetes in (Rootless) Docker in (GCE) VM”. GCE VM is used for enabling systemd that is required by Rootless Docker to set up cgroup v2.

exec kubetest2 kindinv \
  --boskos-location=http://boskos.test-pods.svc.cluster.local \
  --gcp-zone=us-central1-b \
  --instance-image=ubuntu-os-cloud/ubuntu-2404-lts-amd64 \
  --instance-type=n2-standard-4 \
  --kind-rootless \
  --user=rootless \
  --build \
  --up \
  --down \
  --test=ginkgo \
  -- \
  --focus-regex='\[NodeConformance\]' \
  --skip-regex='\[Environment:NotInUserNS\]|\[Slow\]' \
  --parallel=8

Graduation Criteria

Upgrade / Downgrade Strategy

This feature is new, there is no upgrade path from existing nodes.

Version Skew Strategy

N/A. This KEP only affects the internal of kubelet, and does not affect any API.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?
  • Feature gate (also fill in values in kep.yaml)
    • Feature gate name: KubeletInUserNamespace
    • Components depending on the feature gate: kubelet
  • Other
    • Describe the mechanism:
    • Will enabling / disabling the feature require downtime of the control plane?
    • Will enabling / disabling the feature require downtime or reprovisioning of a node?

Enabling KubeletInUsernamespace feature gate does not automatically execute kubelet in a user namespace. The user namespace has to be created by RootlessKit before running kubelet. For kind usecase, the namespace is provided by Rootless Docker or Rootless Podman (they internally use RootlessKit).

Note that this feature gate does not support separating kubelet’s user namespace from the user namespace of other node components such as CRI. All the node components must run in the same user namespace.

Does enabling the feature change any default behavior?

The limitation is same as Rootless Docker, Podman, etc. See https://rootlesscontaine.rs/caveats/ .

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, by turning off the feature gate.

What happens if we reenable the feature if it was previously rolled back?

The rootless functionality is again available in kubelet.

Are there any tests for feature enablement/disablement?

Yes. See Test Plan .

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

Rollout: Rolling out requires recreating a new node instance, in a UserNS. Typical failures:

Rollback: this question is not applicable. Rolling back requires recreating a new node instance.

What specific metrics should inform a rollback?

Increase of node_collector_unhealthy_nodes_in_zone .

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

This question is not applicable. Rolling out and rolling back requires recreating a new node instance.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

Nodes will have kubernetes.io/running-in-user-namespace: <BOOL> labels. NodeSystemInfo will be updated too to have RunningInUserNamespace *bool.

How can someone using this feature know that it is working for their instance?
  • Events
    • Event Reason:
  • API .status
    • Condition name: Nodes will have kubernetes.io/running-in-user-namespace: <BOOL> labels. NodeSystemInfo will be updated to haveRunningInUserNamespace *bool`.
    • Other field:
  • Other (treat as last resort)
    • Details:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?

In default Kubernetes installation with the feature enabled, 99th percentile per cluster-day of node_collector_unhealthy_nodes_in_zone <= X where X depends on the size of the cluster.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
Are there any missing metrics that would be useful to have to improve observability of this feature?

None

Dependencies

  • Kernel: 5.2 or later is recommended. At least 4.15 or later is required. (Reason )
  • Systemd: 244 or later is recommended.
  • CRI: containerd >= 1.4, or CRI-O >= 1.22 is required.
  • OCI: runc >= 1.0-rc91 is required. runc >= 1.0-rc93 is recommended. crun works, too.
Does this feature depend on any specific services running in the cluster?
  • [RootlessKit]
    • Usage description: sets up namespaces, and forwards incoming TCP & UDP packets
      • Impact of its outage on the feature: kubelet, kube-proxy, CRI, and all container processes will crash, and will be restarted by systemd.
      • Impact of its degraded performance or high-error rates on the feature: Incoming packet forwarding will be slow.
  • [slirp4netns]
    • Usage description: forwards outgoing TCP & UDP packets via a virtual router
      • Impact of its outage on the feature: Outgoing packets will be dropped.
      • Impact of its degraded performance or high-error rates on the feature: Outgoing packet forwarding will be slow.

When a cluster is being created in a kind container with Rootless Docker/Rootless Podman provider, the user namespace is already created by Rootless Docker/Rootless Podman, so RootlessKit and slirp4netns do not need to be installed in the kind container.

Both Docker and Podman use RootlessKit and slirp4netns (or VPNkit, optionally) internally.

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

User-mode implementation of TCP/IP (RootlessKit, slirp4netns, paste, etc.) may face high CPU and memory consumption.

The “Figure 8: CPU utilization while running iperf3 client” in https://arxiv.org/pdf/2402.00365 denotes that a configuration with RootlessKit for incoming packets and slirp4netns for outgoing packets may face roughly 20% of CPU usage.

This issue can be addressed by using lxc-user-nic (SETUID helper) or bypass4netns (seccomp-based network accelerator).

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

Same as traditional rootful Kubernetes.

What are other known failure modes?

Same as traditional rootful Kubernetes.

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

The primary drawback of this KEP is its complexity. It also heavily relies on third-party, out-of-tree components.

Alternatives

The Node-level UserNS KEP is often considered to be an alternative, but it is actually not, because it can’t mitigate vulnerabilities of kubelet, CRI, OCI, and their relevant components. See Non-goals section.

Infrastructure Needed (Optional)

CI infra for cgroup v2 is needed