KEP-2033: Rootless mode
KEP-2033: Kubelet-in-UserNS (aka Rootless mode)
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- [N/A] e2e Tests for all Beta API Operations (endpoints)
- [N/A] (R) Ensure GA e2e tests meet requirements for Conformance Tests
- [N/A] (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- [N/A] (R) all GA Endpoints must be hit by Conformance Tests within one minor version of promotion to GA
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
This KEP allows running the entire Kubernetes components (kubelet, CRI, OCI, CNI, and all kube-*) as a non-root user on the host,
by running them in a user namespace.
See Notes/Constraints/Caveats
for the caveats.
TLDR: Most things do work without modifying Kubernetes. But we need to modify a just few lines of kubelet and kube-proxy to ignore errors during setting some sysctl and rlimit values. See “Required changes to Kubernetes” .
Resources:
- POC: Usernetes
- A presentation at KubeCon NA 2020: https://sched.co/fGWc
- Kubernetes PR: https://github.com/kubernetes/kubernetes/pull/92863
- Rootless k3s: https://github.com/k3s-io/k3s/blob/master/k3s-rootless.service
- kind with Rootless Docker/Rootless Podman: https://kind.sigs.k8s.io/docs/user/rootless/ (It already works with unmodified Kubernetes, but contains dirty hack to fake procfs)
- Proposal in minikube repo, for running Kubernetes in Rootless Docker: https://github.com/kubernetes/minikube/issues/9495
- Proposal in minikube repo, for running Kubernetes in Rootless Podman: https://github.com/kubernetes/minikube/issues/8719
Motivation
- Protect the host from potential container-breakout vulnerabilities. This is the main motivation.
- Allow users of shared machines (especially HPC) to run Kubernetes without the risk of accidentally breaking their colleagues’ environments.
Not recommended for real multi-tenancy where the users cannot be trusted.
- Safe
kind: Kubernetes inside Rootless Docker/Podman. - Safe Kubernetes-on-Kubernetes, to isolate workloads more strictly than Kubernetes API namespaces.
- Safe
FAQ: why not use admission controllers?
Admission controllers like PSP can restrict containers to use extra security options like AppArmor/SELinux, gVisor/Kata, and also potentially Node-level UserNS in the future.
However, these are not efficient to mitigate vulnerabilities of the node components themselves (kubelet, CRI, OCI…).
e.g.
- CVE-2017-1002102 : kubelet could delete files on the host during syncing secret/configMap/downwardAPI volumes
- CVE-2019-11245 : Dockerfile USER instruction was ignored by kubelet
- CVE-2018-11235 : kubelet could execute an arbitrary command as the root via gitRepo volumes
- Potential image extraction zip-slip vulnerabilities in CRI runtimes. Both containerd and CRI-O are working on implementing supports for new archive formats like zstd, imgcrypt, and stargz. Potentially these implementations have such vulnerabilities.
- And lots of CRI/OCI vulnerabilities in the past.
Goals
- Allow
kubeletandkube-proxyto be executed inside user namespaces create by a non-root user. See “Required changes to Kubernetes” .
Non-Goals
The Node-level UserNS KEP is similar to this KEP, but out of scope for this KEP.
While Node-level UserNS executes only containers inside UserNS. this KEP executes all the node components inside UserNS to mitigate vulnerabilities of all components,
Node-level UserNS and this KEP do not conflict and can be stacked together. (Node-level UserNS inside Kubelet’s UserNS.)
Proposal
User Stories (Optional)
Story 1: Production cluster
A user is scared of the past vulnerabilities of kubelet/CRI/OCI, and looking for a way to mitigate such potential vulnerabilities.
So the user would want this KEP to be implemented.
The user may face difficulties for deploying stateful workloads because block-based and NFS-based persistent volumes mostly do not work (see Notes/Constraints/Caveats ), but this is not a huge deal, when the user can use managed object storages such as Amazon S3, or managed RDBs such as Amazon RDS for storing persistent data.
If the user really needs to run an application that requires the root privileges, the user would create a mixed cluster composed of rootful nodes and rootless nodes, and set the node selector to ensure the privileged pods to be scheduled on rootful nodes. However, it is more preferable to create another cluster for rootful nodes.
Story 2: HPC cluster
A user wants to deploy a Kubernetes cluster using shared HPC machines to run scientific research workloads.
However, the machine administrator does not want to allow the user to gain the root privileges, because the admin thinks that the user may accidentally break other users’ environments.
And yet, the admin hesitates to deploy a shared Kubernetes cluster and to create RBAC-restricted accounts for users, because user management in Kubernetes is very difficult.
The user would want this KEP to be implemented so that he/she can deploy Kubernetes without convincing the admin.
Story 3: kind with Rootless Docker/Podman
A user wants to run a test cluster inside Docker/Podman on his/her laptop using kind.
However, the user doesn’t want Kubernetes/kind/Docker/Podman to gain the root privileges because these components may accidentally break the host environment, e.g. Docker may modify the host iptables in an unexpected way and break the user’s VPN connectivity.
The user would want this KEP to be implemented so that he/she could run kind with Rootless Docker/Podman,
which won’t break the host.
Story 4: Temporary initial cluster for bootstrapping
A user needs a temporary initial cluster to bootstrap an actual cluster with Cluster API.
The user wants to avoid having the root privileges.
Notes/Constraints/Caveats (Optional)
TL;DR: Things that work with Rootless Docker 20.10 and Rootless Podman 2.1 will work with Rootless Kubernetes as well. Other things will not.
cgroup:
- No support for cgroup v1.
- Hugepages cannot be supported because systemd doesn’t support delegation of the hugetlb controller: https://github.com/systemd/systemd/issues/16325
- Device controller cannot be supported as well, but it is not a huge deal, because non-root users don’t have permission to access insecure devices anyway.
Network:
- kube-proxy needs the following
KubeProxyConfigurationto avoid hitting errors during settingsysctlvalues:
conntrack:
# Skip setting sysctl value "net.netfilter.nf_conntrack_max"
maxPerCore: 0
# Skip setting "net.netfilter.nf_conntrack_tcp_timeout_established"
tcpEstablishedTimeout: 0s
# Skip setting "net.netfilter.nf_conntrack_tcp_timeout_close"
tcpCloseWaitTimeout: 0s
- Some CNI plugins might not work. Flannel (VXLAN) is known to work.
- Limited network performance due to the slirp4netns overhead.
Mitigation: Install
lxc-user-nic(SETUID binary) . - NodePort less than 1024 cannot be exposed. This is not a problem with the default
--service-node-port-rangeconfiguration (30000-32767). Mitigation: setCAP_NET_BIND_SERVICEfile capability onrootlesskitbinary.
Volumes:
- Block device volumes and (kernel-mode) NFS does not work, because user namespace only supports
tmpfs,bind, and FUSE filesystems.emptyDir,hostPath,local, and API volumes (configMap,secret,downwardAPI, …) are known to work without any issue. FUSE-based CSI volumes can be supported, but not recommended. Mitigation: Use managed object storage services such as Amazon S3/Google Cloud Storage/Azure Blob Storage, or use managed database services for storing persistent data.
SecurityContext:
- A container with
securityContext.privilegedcannot gain the real root privileges, obviously. runAsUser: supported, but the number of the UID is limited by/etc/subuid.sysctls: some sysctl parameters are supported, but some would fail inEPERM. Creating a Pod manifest with such sysctl parameters would fail. If this behavior is problematic, user should write a Mutating Admission Webhook to remove such sysctl parameters from Pod manifests.- seccomp: supported
- AppArmor: unsupported. Creating a Pod with an AppArmor profile would fail.
- SELinux: Same as Rootless Podman. Applying an existing profile would be ok, but creating a new profile would not.
- Node-level UserNS KEP : can be supported. This UserNS will be nested inside Kubelet’s UserNS.
Risks and Mitigations
If Linux kernel had vulnerabilities in its user namespace implementation, the root in the user namespace might be able to escape from the user namespace, and take the real root privilege of the host.
So, it is still preferred to run pods with sandbox technologies like gVisor to mitigate potential kernel vulnerabilities.
Design Details
Running Kubernetes inside Rootless Docker/Podman (kind, minikube)
When Kubernetes is being executed inside Rootless Docker/Podman, the namespaces and cgroups are already configured by Docker/Podman. So, basically there is no additional task, but we still have to modify a few lines of kubelet and kube-proxy to ignore minor sysctl & rlimit errors. See “Required changes to Kubernetes” .
It should be noted that kind already works with unmodified Kubernetes
,
but kind currently uses very dirty hack to mount fake files under /proc/sys to avoid hitting sysctl errors.
.
Running Kubernetes directly on the host
The node components need to be executed inside a user namespace along with other namespaces (mount namespace, network namespace, etc.) to gain fake-root privileges, mostly for mount and network operations.
To run Rootless Kubernetes directly on the host, RootlessKit
can be used for creating namespaces.
In a nutshell, RootlessKit is an extended version of unshare
for rootless containers.
RootlessKit has been already adopted by Docker, BuildKit, Usernetes, k3s, and partially by Podman.
All Kubernetes components including CRI runtime, kubelet, kube-proxy, and CNI daemon need to be executed in RootlessKit’s namespaces.
$ rootlesskit --net=slirp4netns --copy-up=/etc --copy-up=/run --copy-up=/var --pidns --cgroupns --ipcns --utsns -- containerd &
$ nsenter -t $ROOTLESSKIT_CHILD_PID -a kubelet ... &
$ nsenter -t $ROOTLESSKIT_CHILD_PID -a kube-proxy ... &
$ nsenter -t $ROOTLESSKIT_CHILD_PID -a flanneld ... &
Paths
Some paths like /var/log/pods are hardcoded in Kubernetes and hard to change.
Although these directories are not writable by unprivileged users, Kubernetes does NOT need to be changed to use unprivileged home directories,
because RootlessKit can bind-mount writable directories on these paths without the root privileges. (rootlesskit --copy-up=/var)
Network
The node components need to be executed in RootlessKit’s network namespace, because an unprivileged user cannot do privileged operations in the host network namespace.
As the components are executed inside a network namespace, NodePorts are not directly accessible from other hosts.
An external controller should watch changes on corev1.Service resources and call RootlessKit API
to set up port forwarding for the node ports.
k3s implementation: https://github.com/rancher/k3s/blob/v1.17.2+k3s1/pkg/rootlessports/controller.go#L92-L96
RootlessKit network drivers
RootlessKit supports two kinds of network stacks:
- TAP with pure usermode network stack (either
slirp4netnsor VPNKit) - vEth with setuid binary
lxc-user-nic
slirp4netns is preferred for security, lxc-user-nic is preferred for performance.
These stacks are used for the namespace where the node components are executed in, not for the containers’ namespaces. CNI plugins such as Flannel are expected to be used for the containers’ namespace.
CNI plugins
Flannel (VXLAN) is known to work.
cgroup
cgroup v2 and systemd are required. cgroup v1 won’t be supported due to security concerns.
containerd supports cgroup v2 for rootless mode since containerd v1.4. The master branch of CRI-O also supports cgroup v2 for rootless mode. It will be included in CRI-O v1.22.
No code change is required on kubelet for managing cgroups, because we can use cgroup namespaces along with mount namespaces for creating writable /sys/fs/cgroup filesystem.
Required changes to Kubernetes
Most things do work without modifying Kubernetes. But we need to modify a just few lines of kubelet and kube-proxy to ignore errors during setting some sysctl and rlimit values.
kubelet
Patch: “kubelet/cm: ignore sysctl error when running in userns”
The patch modifies kubelet to ignore errors that happens during setting the following sysctl keys:
vm.overcommit_memoryvm.panic_on_oomkernel.panickernel.panic_on_oopskernel.keys.root_maxkeyskernel.keys.root_maxbytes
Note These sysctl parameters are set for
kubeletitself. These are unrelated to.spec.securityContext.sysctlsin Pod manifests.
kube-proxy
Patch: “kube-proxy: allow running in userns”
The patch modifies kube-proxy (userspace mode) to ignore an error during setting RLIMIT_NOFILE.
No change is needed for non-userspace mode.
Note
userspaceproxy was removed in v1.26.
Test Plan
[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
See e2e tests below.
Additional tests are present in several subproject repos and third party repos:
- https://github.com/kubernetes-sigs/kind/blob/v0.29.0/.github/workflows/vm.yaml#L24
- https://github.com/kubernetes/minikube/blob/v1.36.0/.github/workflows/pr.yml#L299-L415
- https://github.com/k3s-io/k3s/blob/v1.33.1%2Bk3s1/.github/workflows/e2e.yaml#L56
- https://github.com/rootless-containers/usernetes/blob/gen2-v20250501.0/.github/workflows/main.yaml
- Covers multi-node clusters with Flannel (VXLAN)
- Covers several host distributions (Ubuntu, CentOS Stream, and Fedora)
Prerequisite testing updates
Unit tests
N/A. Unit tests do not make sense here, as the relevant code depends on sysctl:
- https://github.com/kubernetes/kubernetes/blob/v1.34.1/pkg/kubelet/cm/container_manager_linux.go#L483-L485
- https://github.com/kubernetes/kubernetes/blob/v1.34.1/pkg/kubelet/kubelet.go#L559-L567
The feature can be tested only by running the entire node components in UserNS.
See e2e tests below for how the feature is actually tested.
Integration tests
N/A, as integration tests do not make sense here, for the same reason as explained above for the unit tests .
See e2e tests below for how the feature is actually tested.
e2e tests
NodeConformance tests are executed using kubetest2-kindinv
.
“kindinv” stands for “Kubernetes in (Rootless) Docker in (GCE) VM”. GCE VM is used for enabling systemd that is required by Rootless Docker to set up cgroup v2.
exec kubetest2 kindinv \
--boskos-location=http://boskos.test-pods.svc.cluster.local \
--gcp-zone=us-central1-b \
--instance-image=ubuntu-os-cloud/ubuntu-2404-lts-amd64 \
--instance-type=n2-standard-4 \
--kind-rootless \
--user=rootless \
--build \
--up \
--down \
--test=ginkgo \
-- \
--focus-regex='\[NodeConformance\]' \
--skip-regex='\[Environment:NotInUserNS\]|\[Slow\]' \
--parallel=8
- Prow manifest: https://github.com/kubernetes/test-infra/blob/aefb999cad82965bd6fb7e3104525fe8d87e434f/config/jobs/kubernetes/sig-testing/kubernetes-kind-ci.yaml#L250-L314
- Logs: https://prow.k8s.io/job-history/gs/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kind-rootless
Graduation Criteria
Alpha: Basic support for rootless mode on cgroups v2 hosts.
Beta: e2e tests coverage. The tests are covered by
NodeConformancetests (see above). Requirements:- the cgroup v2 KEP to reach Beta or GA. Open Source Usage:
- https://github.com/rootless-containers/usernetes/blob/gen2-v20250828.0/kubeadm-config.yaml#L45
- https://github.com/kubernetes-sigs/kind/blob/v0.30.0/pkg/cluster/internal/kubeadm/config.go#L501
- https://github.com/kubernetes/minikube/blob/v1.36.0/cmd/minikube/cmd/start_flags.go#L654
- https://github.com/k3s-io/k3s/blob/v1.33.4%2Bk3s1/pkg/daemons/agent/agent_linux.go#L26
- https://github.com/k3d-io/k3d/blob/v5.8.3/docs/usage/advanced/podman.md?plain=1#L141
- https://github.com/epinio/epinio/blob/v1.12.0/scripts/acceptance-cluster-setup.sh#L92
- https://github.com/lxc/cluster-api-provider-incus/blob/v0.7.0/docs/book/src/explanation/unprivileged-containers.md?plain=1#L23
- https://github.com/NVIDIA/aistore/blob/v1.3.31/deploy/dev/k8s/utils/ci/generate_kind_config.sh#L18
- https://github.com/GoogleCloudPlatform/anthos-samples/blob/8aff62c3f0bd835bda7479a01a591e1849c48fe9/anthos-attached-clusters/kind/main.tf#L37
- https://github.com/GoogleCloudPlatform/cloud-solutions/blob/pino-logging-gcp-config-v1.1.0/projects/k8s-hybrid-neg-controller/hack/kind-cluster-config.yaml#L24
- https://github.com/GoogleCloudPlatform/solutions-workshops/blob/grpc-xds/v0.5.0/grpc-xds/hack/kind-cluster-config-2.yaml#L24
In beta, nodes will have
kubernetes.io/running-in-user-namespace: <BOOL>labels.NodeSystemInfowill be updated too to haveRunningInUserNamespace *bool.
GA: Assuming no negative user feedback based on production experience, promote after >= 2 releases in beta. Requirements:
- the cgroup v2 KEP to reach GA.
Upgrade / Downgrade Strategy
This feature is new, there is no upgrade path from existing nodes.
Version Skew Strategy
N/A. This KEP only affects the internal of kubelet, and does not affect any API.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name:
KubeletInUserNamespace - Components depending on the feature gate: kubelet
- Feature gate name:
- Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?
Enabling KubeletInUsernamespace feature gate does not automatically execute kubelet in a user namespace.
The user namespace has to be created by RootlessKit before running kubelet.
For kind usecase, the namespace is provided by Rootless Docker or Rootless Podman (they internally use RootlessKit).
Note that this feature gate does not support separating kubelet’s user namespace from the user namespace of other node components such as CRI. All the node components must run in the same user namespace.
Does enabling the feature change any default behavior?
The limitation is same as Rootless Docker, Podman, etc. See https://rootlesscontaine.rs/caveats/ .
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
Yes, by turning off the feature gate.
What happens if we reenable the feature if it was previously rolled back?
The rootless functionality is again available in kubelet.
Are there any tests for feature enablement/disablement?
Yes. See Test Plan .
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
Rollout: Rolling out requires recreating a new node instance, in a UserNS. Typical failures:
Rollback: this question is not applicable. Rolling back requires recreating a new node instance.
What specific metrics should inform a rollback?
Increase of node_collector_unhealthy_nodes_in_zone
.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
This question is not applicable. Rolling out and rolling back requires recreating a new node instance.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
Nodes will have kubernetes.io/running-in-user-namespace: <BOOL> labels.
NodeSystemInfo
will be updated too to have RunningInUserNamespace *bool.
How can someone using this feature know that it is working for their instance?
- Events
- Event Reason:
- API .status
- Condition name: Nodes will have
kubernetes.io/running-in-user-namespace: <BOOL>labels.NodeSystemInfowill be updated to haveRunningInUserNamespace *bool`. - Other field:
- Condition name: Nodes will have
- Other (treat as last resort)
- Details:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
In default Kubernetes installation with the feature enabled,
99th percentile per cluster-day of node_collector_unhealthy_nodes_in_zone <= X
where X depends on the size of the cluster.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
node_collector_unhealthy_nodes_in_zone - [Optional] Aggregation method:
- Components exposing the metric: node-lifecycle-controller
- Metric name:
- Other (treat as last resort)
- Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?
None
Dependencies
- Kernel: 5.2 or later is recommended. At least 4.15 or later is required. (Reason )
- Systemd: 244 or later is recommended.
- CRI: containerd >= 1.4, or CRI-O >= 1.22 is required.
- OCI: runc >= 1.0-rc91 is required. runc >= 1.0-rc93 is recommended. crun works, too.
Does this feature depend on any specific services running in the cluster?
- [RootlessKit]
- Usage description: sets up namespaces, and forwards incoming TCP & UDP packets
- Impact of its outage on the feature: kubelet, kube-proxy, CRI, and all container processes will crash, and will be restarted by systemd.
- Impact of its degraded performance or high-error rates on the feature: Incoming packet forwarding will be slow.
- Usage description: sets up namespaces, and forwards incoming TCP & UDP packets
- [slirp4netns]
- Usage description: forwards outgoing TCP & UDP packets via a virtual router
- Impact of its outage on the feature: Outgoing packets will be dropped.
- Impact of its degraded performance or high-error rates on the feature: Outgoing packet forwarding will be slow.
- Usage description: forwards outgoing TCP & UDP packets via a virtual router
When a cluster is being created in a kind container with Rootless Docker/Rootless Podman provider,
the user namespace is already created by Rootless Docker/Rootless Podman, so RootlessKit and slirp4netns do not need to be installed
in the kind container.
Both Docker and Podman use RootlessKit and slirp4netns (or VPNkit, optionally) internally.
Scalability
Will enabling / using this feature result in any new API calls?
No.
Will enabling / using this feature result in introducing new API types?
No.
Will enabling / using this feature result in any new calls to the cloud provider?
No.
Will enabling / using this feature result in increasing size or count of the existing API objects?
No.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
User-mode implementation of TCP/IP (RootlessKit, slirp4netns, paste, etc.) may face high CPU and memory consumption.
The “Figure 8: CPU utilization while running iperf3 client” in https://arxiv.org/pdf/2402.00365 denotes that a configuration with RootlessKit for incoming packets and slirp4netns for outgoing packets may face roughly 20% of CPU usage.
This issue can be addressed by using lxc-user-nic (SETUID helper) or bypass4netns (seccomp-based network accelerator).
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
Same as traditional rootful Kubernetes.
What are other known failure modes?
Same as traditional rootful Kubernetes.
What steps should be taken if SLOs are not being met to determine the problem?
- Make sure that the supported version of the components are used
- Make sure that more than 65536 subuids are allocated
- Make sure that cgroup v2 delegation is enabled
Implementation History
- 2018-07-20: Early POC implementation in Usernetes project
- 2019-04-10: k3s adopted the Usernetes patches (cgroupless version)
- 2019-06-04: Presented KEP to SIG-node (cgroupless version)
- 2019-07-08: Withdrew the cgroupless KEP
- 2019-11-19: @giuseppe submitted cgroup v2 KEP
- 2019-11-19: present KEP to SIG-node (cgroup v2 version)
- 2020-07-07: the cgroup v2 support is in
implementablestatus - 2021-08-04: Kubernetes v1.22 (Alpha)
- 2026-08-XX: Kubernetes v1.37 (Beta)
Drawbacks
The primary drawback of this KEP is its complexity. It also heavily relies on third-party, out-of-tree components.
Alternatives
The Node-level UserNS KEP is often considered to be an alternative, but it is actually not, because it can’t mitigate vulnerabilities of kubelet, CRI, OCI, and their relevant components. See Non-goals section.
Infrastructure Needed (Optional)
CI infra for cgroup v2 is needed