KEP-3857: Recursive read-only mounts
KEP-3857: Recursive read-only (RRO) mounts
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
Make read-only volumes recursively read-only.
e.g., if /mnt is mounted as read-only, its submounts such as /mnt/usbstorage should be read-only too.
Motivation
The current readOnly volumes are not recursively read-only, and may result in compromise of data;
e.g., even if /mnt is mounted as read-only, its submounts such as /mnt/usbstorage are not read-only.
This issue can be fixed by utilizing OCI Runtime’s “rro” bind mount option (https://github.com/opencontainers/runtime-spec/blob/v1.2.0/config.md#linux-mount-options ) to make read-only bind mounts recursively read-only.
The “rro” bind mount options is implemented by calling mount_setattr(2)
with MOUNT_ATTR_RDONLY and AT_RECURSIVE.
Requires kernel >= 5.12, with one of the following OCI runtimes:
- runc >= 1.1
- crun >= 1.4
Goals
Support recursive read-only mounts for kernel >= 5.12.
Non-Goals
Support recursive read-only mounts for old runc and old kernel releases.
Proposal
User Stories (Optional)
Story 1
A user wants to mount /mnt, includings its submounts such as /mnt/usbstorage, as read-only.
Notes/Constraints/Caveats (Optional)
Constraints: needs runc >= 1.1 && kernel >= 5.12.
Risks and Mitigations
Increased API surface but still not secure-by-default, for sake of compatibility.
- Mitigation: None
False sense of security when not implemented
- Mitigation:
VolumeMountStatusindicating actual RRO setting
- Mitigation:
Design Details
Core API
Add RecursiveReadOnly: (Disabled|IfPossible|Enabled) to the VolumeMount
struct.
A pod manifest will look like this:
spec:
volumes:
- name: foo
hostPath:
path: /mnt
type: Directory
containers:
- volumeMounts:
- mountPath: /mnt
name: foo
mountPropagation: None
readOnly: true
# NEW
recursiveReadOnly: IfPossible
See the comment lines in the diff below for the constraints of the VolumeMount options:
diff --git a/pkg/apis/core/types.go b/pkg/apis/core/types.go
index e40b8bfa104..09c88222c2d 100644
--- a/pkg/apis/core/types.go
+++ b/pkg/apis/core/types.go
@@ -1914,6 +1914,31 @@ type VolumeMount struct {
// Optional: Defaults to false (read-write).
// +optional
ReadOnly bool
+ // RecursiveReadOnly specifies recursive-readonly mode.
+ //
+ // 1. If ReadOnly is false, RecursiveReadOnly must be unspecified.
+ // 2. If ReadOnly is true:
+ // 2.1. If RecursiveReadOnly is unspecified:
+ // 2.1.1. if it belongs to a Pod being created, it is initialized to Disabled.
+ // 2.1.2 if it belongs to a PodSpec under Deployment, Job, etc., it remains unspecified
+ // (and will be set to Disabled eventually, when the Pod is created).
+ // 2.2. If RecursiveReadOnly is set to Disabled, the mount is not made recursively read-only.
+ // 2.3. If RecursiveReadOnly is set to IfPossible, the mount is made recursively read-only,
+ // if it is supported by the runtime.
+ // If it is not supported by the runtime, the mount is not made recursively read-only.
+ // MountPropagation must be None or unspecified (which defaults to None).
+ // 2.4. If RecursiveReadOnly is set to Enabled, the mount is made recursively read-only.
+ // If it is not supported by the runtime, the Pod will be terminated by kubelet,
+ // and an error will be generated to indicate the reason.
+ // MountPropagation must be None or unspecified (which defaults to None).
+ // 2.5. If RecursiveReadOnly is set to unknown value, it will result in an error.
+ //
+ // When this property is recognized by kubelet and kube-apiserver,
+ // VolumeMountStatus.RecursiveReadOnly will be set to either Disabled or Enabled.
+ //
+ // +featureGate=RecursiveReadOnlyMounts
+ // +optional
+ RecursiveReadOnly *RecursiveReadOnlyMode
// Required. If the path is not an absolute path (e.g. some/path) it
// will be prepended with the appropriate root prefix for the operating
// system. On Linux this is '/', on Windows this is 'C:\'.
@@ -1926,6 +1951,8 @@ type VolumeMount struct {
// to container and the other way around.
// When not set, MountPropagationNone is used.
// This field is beta in 1.10.
+ // When RecursiveReadOnly is set to IfPossible or to Enabled, MountPropagation must be None or unspecified
+ // (which defaults to None).
// +optional
MountPropagation *MountPropagationMode
// Expanded path within the volume from which the container's volume should be mounted.
@@ -1961,6 +1988,18 @@ const (
MountPropagationBidirectional MountPropagationMode = "Bidirectional"
)
+// RecursiveReadOnlyMode describes recursive-readonly mode.
+type RecursiveReadOnlyMode string
+
+const (
+ // RecursiveReadOnlyDisabled disables recursive-readonly mode.
+ RecursiveReadOnlyDisabled RecursiveReadOnlyMode = "Disabled"
+ // RecursiveReadOnlyIfPossible enables recursive-readonly mode if possible.
+ RecursiveReadOnlyIfPossible RecursiveReadOnlyMode = "IfPossible"
+ // RecursiveReadOnlyEnabled enables recursive-readonly mode, or raise an error.
+ RecursiveReadOnlyEnabled RecursiveReadOnlyMode = "Enabled"
+)
+
// VolumeDevice describes a mapping of a raw block device within a container.
type VolumeDevice struct {
// name must match the name of a persistentVolumeClaim in the pod
@@ -2591,6 +2630,10 @@ type ContainerStatus struct {
// +featureGate=InPlacePodVerticalScaling
// +optional
Resources *ResourceRequirements
+ // Status of volume mounts.
+ // +listType=atomic
+ // +optional
+ VolumeMounts []VolumeMountStatus
}
// PodPhase is a label for the condition of a pod at the current time.
@@ -2664,6 +2707,21 @@ const (
PodResizeStatusInfeasible PodResizeStatus = "Infeasible"
)
+// VolumeMountStatus shows status of volume mounts.
+type VolumeMountStatus struct {
+ // Name corresponds to the name of the original VolumeMount.
+ Name string
+ // ReadOnly corresponds to the original VolumeMount.
+ // +optional
+ ReadOnly bool
+ // RecursiveReadOnly must be set to Disabled, Enabled, or unspecified (for non-readonly mounts).
+ // An IfPossible value in the original VolumeMount must be translated to Disabled or Enabled,
+ // depending on the mount result.
+ // +featureGate=RecursiveReadOnlyMounts
+ // +optional
+ RecursiveReadOnly *RecursiveReadOnlyMode
+}
+
// RestartPolicy describes how the container should be restarted.
// Only one of the following restart policies may be specified.
// If none of the following policies is specified, the default one
@@ -4591,6 +4649,24 @@ type NodeDaemonEndpoints struct {
KubeletEndpoint DaemonEndpoint
}
+// RuntimeClassFeatures is a set of runtime features.
+type RuntimeClassFeatures struct {
+ // RecursiveReadOnlyMounts is set to true if the runtime class supports RecursiveReadOnlyMounts.
+ // +optional
+ RecursiveReadOnlyMounts *bool
+}
+
+// RuntimeClass is a set of runtime class information.
+type RuntimeClass struct {
+ // Runtime class name.
+ // Empty for the default runtime class.
+ // +optional
+ Name string
+ // Supported features.
+ // +optional
+ Features *RuntimeClassFeatures
+}
+
// NodeSystemInfo is a set of ids/uuids to uniquely identify the node.
type NodeSystemInfo struct {
// MachineID reported by the node. For unique machine identification
@@ -4701,6 +4777,9 @@ type NodeStatus struct {
// Status of the config assigned to the node via the dynamic Kubelet config feature.
// +optional
Config *NodeConfigStatus
+ // The available runtime classes.
+ // +optional
+ RuntimeClasses []RuntimeClass
}
// UniqueVolumeName defines the name of attached volume
CRI API
Add bool recursive_read_only to the Mount
message.
CRI implementations will also expose the availability of the feature via the RuntimeHandlerFeatures message.
As kubelet can inspect the availability of the feature via the RuntimeHandlerFeatures message,
there is no concept of “IfPossible” in the CRI API;
kubelet translates an “IfPossible” value in the Core API into true or false in the CRI API
The RuntimeHandlerFeatures message is also propagated to the NodeSystemInfo struct of the Core API.
Diff:
diff --git a/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto b/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto
index e16688d8386..194d591c27f 100644
--- a/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto
+++ b/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto
@@ -235,6 +235,15 @@ message Mount {
repeated IDMapping uidMappings = 6;
// GidMappings specifies the runtime GID mappings for the mount.
repeated IDMapping gidMappings = 7;
+ // If set to true, the mount is made recursive read-only.
+ // In this CRI API, recursive_read_only is a plain true/false boolean, although its equivalent
+ // in the Kubernetes core API is a quaternary that can be nil, "Enabled", "IfPossible", or "Disabled".
+ // kubelet translates that quaternary value in the core API into a boolean in this CRI API.
+ // Remarks:
+ // - nil is just treated as false
+ // - when set to true, readonly must be explicitly set to true, and propagation must be PRIVATE (0).
+ // - (readonly == false && recursive_read_only == false) does not make the mount read-only.
+ bool recursive_read_only = 8;
}
// IDMapping describes host to container ID mappings for a pod sandbox.
@@ -1524,6 +1533,22 @@ message StatusRequest {
bool verbose = 1;
}
+message RuntimeHandlerFeatures {
+ // recursive_read_only_mounts is set to true if the runtime handler supports
+ // recursive read-only mounts.
+ // For runc-compatible runtimes, availability of this feature can be detected by checking whether
+ // the Linux kernel version is >= 5.12, and, `runc features | jq .mountOptions` contains "rro".
+ bool recursive_read_only_mounts = 1;
+}
+
+message RuntimeHandler {
+ // Name must be unique in StatusResponse.
+ // An empty string denotes the default handler.
+ string name = 1;
+ // Supported features.
+ RuntimeHandlerFeatures features = 2;
+}
+
message StatusResponse {
// Status of the Runtime.
RuntimeStatus status = 1;
@@ -1532,6 +1557,8 @@ message StatusResponse {
// debug, e.g. plugins used by the container runtime.
// It should only be returned non-empty when Verbose is true.
map<string, string> info = 2;
+ // Runtime handlers.
+ repeated RuntimeHandler runtime_handlers = 3;
}
message ImageFsInfoRequest {}
diff --git a/staging/src/k8s.io/cri-api/pkg/errors/errors.go b/staging/src/k8s.io/cri-api/pkg/errors/errors.go
index a4538669122..c8e4a18dec5 100644
--- a/staging/src/k8s.io/cri-api/pkg/errors/errors.go
+++ b/staging/src/k8s.io/cri-api/pkg/errors/errors.go
@@ -29,6 +29,9 @@ var (
// ErrSignatureValidationFailed - Unable to validate the image signature on the PullImage RPC call.
ErrSignatureValidationFailed = errors.New("SignatureValidationFailed")
+
+ // ErrRROUnsupported - Unable to enforce recursive readonly mounts
+ ErrRROUnsupported = errors.New("RROUnsupported")
)
// IsNotFound returns a boolean indicating whether the error
Test Plan
[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
The existing tests will continue to pass. New tests have to be added to cover the proposed feature.
Unit tests
kubelet unit tests: takes a CRI status and populate the
RecursiveReadOnlyfield in theVolumeMountStatusstruct. Implemented in https://github.com/kubernetes/kubernetes/blob/v1.30.0/pkg/kubelet/kubelet_pods_test.go#L6080-L6201 . The unit test set covers 16 conditions as of Kubernetes v1.30.0. Coverage:k8s.io/kubernetes/pkg/kubelet: 2025-02-11 - 70.7%
CRI test : similar to e2e tests below but without using Kubernetes Core API. Implemented in https://github.com/kubernetes-sigs/cri-tools/blob/v1.30.0/pkg/validate/container_linux.go#L311-L413 .
Integration tests
See e2e tests below.
e2e tests
- run a pod in each RecursiveReadOnly mode and verify that the status comes back correctly
- run RecursiveReadOnly=“Enabled” on a runtime that does not support it and ensure the error
- run RecursiveReadOnly=“Enabled”, and verify that the mount is actually recursively read-only
- run RecursiveReadOnly=“Disabled”, and verify that the mount is actually not recursively read-only
The e2e_node tests are implemented in https://github.com/kubernetes/kubernetes/blob/v1.30.0/test/e2e_node/mount_rro_linux_test.go
.
Test grid:
E2eNode Suite.[It] [sig-node] Mount recursive read-only [LinuxOnly] [Feature:RecursiveReadOnlyMounts] Mount recursive read-only when the runtime does not support recursive read-only mounts should accept non-recursive read-only mounts
E2eNode Suite.[It] [sig-node] Mount recursive read-only [LinuxOnly] [Feature:RecursiveReadOnlyMounts] Mount recursive read-only when the runtime does not support recursive read-only mounts should reject recursive read-only mounts
E2eNode Suite.[It] [sig-node] Mount recursive read-only [LinuxOnly] [Feature:RecursiveReadOnlyMounts] Mount recursive read-only when the runtime supports recursive read-only mounts should accept recursive read-only mounts
E2eNode Suite.[It] [sig-node] Mount recursive read-only [LinuxOnly] [Feature:RecursiveReadOnlyMounts] Mount recursive read-only when the runtime supports recursive read-only mounts should reject invalid recursive read-only mounts
k8s-triage:
0 clusters of 0 failures out of 127983 builds from 2025/1/28 9:00:38 to 2025/2/11 12:45:18.
Graduation Criteria
Alpha
- Feature implemented behind a feature flag
- Unit tests and CRI tests will pass
Beta
- e2e tests pass with containerd, CRI-O, and cri-dockerd
- https://github.com/containerd/containerd/pull/9787
- https://github.com/cri-o/cri-o/pull/7962
- https://github.com/Mirantis/cri-dockerd/pull/370
GA
- Two beta releases of Kubernetes at least
- containerd (v2.0) and CRI-O (v1.30) support the feature with their GA releases.
The feature has been implemented in the
masterbranch of cri-dockerd too.
Upgrade / Downgrade Strategy
Upgrade: No action is needed. Existing readonly mounts will remain non-recursively readonly.
Downgrade:
On downgrading kube-apiserver, the
[]volumeMounts.recursiveReadOnlyproperty will be lost and will not be propagated to kubelet. If the mode was set to non-Disabled, this will result in producing writable mounts. It is the user’s responsibility to use the correct version of kube-apiserver when they need non-Disabledmode.On downgrading kubelet, the
[]volumeMounts.recursiveReadOnlyproperties will be lost, and the[]containerStatuses.[]volumeMount.recursiveReadOnlystatus will not be updated. It is the user’s responsibility to use the correct version of kubelet when they need to check[]containerStatuses.[]volumeMount.recursiveReadOnly.On downgrading the CRI or OCI runtime, if the
RecursiveReadOnlymode is set toEnabled, kubelet will raise an error.IfPossiblewill be just treated asDisabled.
Version Skew Strategy
It is the user’s responsibility to use the correct version of kube-apiserver when they need non-
Disabledmode. Otherwise the mode will not be propagated to kubelet.It is the user’s responsibility to use the correct version of kube-apiserver and kubelet when they need to check
[]containerStatuses.[]volumeMount.recursiveReadOnly. Otherwise the property may have an inconsistent value.CRI and OCI runtimes have to be updated before kubelet, otherwise kubelet will not be aware whether they supports the feature or not, and it will assume that they do not support the feature.
If only partial nodes supports the feature,
DisabledandIfPossiblewill continue to work on all the nodes, butEnabledwill fail on a node that does not support the feature. kube-scheduler does not care about this, and, it is the user’s responsibility to setnodeSelector,nodeAffinity, etc. to avoid scheduling a pod withEnabledto a node that does not support the feature.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name:
RecursiveReadOnlyMounts - Components depending on the feature gate: kube-apiserver,kubelet
- Feature gate name:
Does enabling the feature change any default behavior?
No
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
Yes, by unsetting RecursiveReadOnly=Enabled.
Components can be downgraded too, but it should be noted that VolumeMountStatus
may still see an inconsistent state when kubelet was downgraded.
The pod manifest has to be recreated to get a consistent state in this case.
What happens if we reenable the feature if it was previously rolled back?
Works. Just same as a fresh roll-out, as long as the user has recreated the pod manifests. (See “Can the feature be disabled once …” section above)
Are there any tests for feature enablement/disablement?
Unit tests will run with and without the feature gate.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
A rollout may fail when at least one of the following components are too old:
| Component | readOnlyRecursive value that will cause an error |
|---|---|
| kube-apiserver | any value |
| kubelet | any value |
| CRI runtime | Enabled |
| OCI runtime | Enabled |
| kernel | Enabled |
For example, an error will be returned like this if kube-apiserver is too old:
$ kubectl apply -f rro.yaml
Error from server (BadRequest): error when creating "rro.yaml": Pod in version "v1" cannot be handled as a Pod:
strict decoding error: unknown field "spec.containers[0].volumeMounts[0].recursiveReadOnly"
No impact on already running workloads.
What specific metrics should inform a rollback?
Look for an event saying indicating RRO is not supported by the runtime.
$ kubectl get events -o json -w
...
{
...
"kind": "Event",
"message": "Error: RRONotSupported",
...
}
...
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
During the beta phase, the following test will be manually performed:
- Enable the
RecursiveReadOnlyfeature gate for kube-apiserver and kubelet. - Create a pod with
recursiveReadOnlyspecified. - Disable the
RecursiveReadOnlyfeature gate for kube-apiserver, and confirm that the pod gets rejected. - Enable the
RecursiveReadOnlyfeature gate again, and confirm that the pod gets scheduled again. - Do the same for kubelet too.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
Yes, the feature is used if the following jq command prints non-zero number:
kubectl get pods -A -o json | jq '[.items[].spec.containers[].volumeMounts[]? | select(.recursiveReadOnly)] | length'
How can someone using this feature know that it is working for their instance?
- API .status
- Condition name:
volumeMountStatus.recursiveReadOnly
- Condition name:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
recursiveReadOnly=Enabled: 100% of pods that were scheduled into a node must run with recursive read-only mounts, or, 100% of them must fail to run.recursiveReadOnly=IfPossible: 100% of pods that were scheduled into a node must run with or without recursive read-only mountsrecursiveReadOnly=Disabled, or unset: 100% of pods that were scheduled into a node must run without recursive read-only mounts
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name: Event
- [Optional] Aggregation method:
kubectl get events -o json -w - Components exposing the metric: kubelet -> kube-apiserver
If recursiveReadOnly is set to Enabled but it is not supported, kubelet will raise an event like this:
$ kubectl get events -o json -w
...
{
...
"kind": "Event",
"message": "Error: RRONotSupported",
...
}
...
If the OCI runtime claims that it supports recursive read only mounts but it actually fails to mount them, the pod will enter CrashLoopBackoff. The error from the OCI runtime can be inspected by running:
kubectl get pod -o json foo | jq .status.containerStatuses[0].lastState.terminated.message
Are there any missing metrics that would be useful to have to improve observability of this feature?
Potentially, kube-scheduler could be implemented to avoid scheduling a pod with recursiveReadOnly: Enabled
to a pod running an old kernel.
In this way, the Event metric described above would not happen, and users would instead see Pending pods
as an error metric.
However, this is not planned to be implemented in kube-scheduler, as it seems overengineering.
Users may use nodeSelector, nodeAffinity, etc. to workaround this.
Dependencies
Does this feature depend on any specific services running in the cluster?
Specific version of CRI, OCI, and Linux kernel
Scalability
A pod with recursiveReadOnly: Enabled may be rejected by kubelet with the probablility of $$B/A$$,
where $$A$$ is the number of all the nodes that may potentially accept the pod,
and $$B$$ is the number of the nodes that may potentially accept the pod but does not support RRO.
This may affect scalability.
To evaluate this risk, users may run
kubectl get nodes -o json | jq '[.items[].status.runtimeClasses[].Features]'
to see how many nodes support RecursiveReadOnlyMounts: true.
Will enabling / using this feature result in any new API calls?
No
Will enabling / using this feature result in introducing new API types?
No
Will enabling / using this feature result in any new calls to the cloud provider?
No
Will enabling / using this feature result in increasing size or count of the existing API objects?
A dozen of bytes
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
No
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
A pod cannot be created, just as in other pods.
What are other known failure modes?
None
What steps should be taken if SLOs are not being met to determine the problem?
- Make sure that the node is running Linux kernel v5.12 or later.
- Make sure that
runc features | jq .mountOptionscontains “rro”. Otherwise update runc. - Make sure that
crictl info(with the latest crictl) reports thatRecursiveReadOnlyMountsis supported. Otherwise update the CRI runtime, and make sure that no relevant error is printed in the CRI runtime’s log. - Make sure that
kubectl get nodes -o json | jq '[.items[].status.runtimeClasses[].Features]'(with the latest kubectl and control planes) reports thatRecursiveReadOnlyMountsis supported. Otherwise update the CRI runtime, and make sure that no relevant error is printed in kubelet’s log.
Implementation History
- v1.30: alpha
- v1.31: beta
- v1.33: GA
Drawbacks
See “Alternatives” below.
Alternatives
Plan B is to keep the Kubernetes Core API and the CRI API completely unmodified, and just let the CRI runtime treat “readonly” as “recursive readonly”.
This would be much easier to implement and adopt, however, small portion of users may find this to be a breaking change.
Actually, containerd has once adopted the Plan B (https://github.com/containerd/containerd/pull/9713 ) in its main branch (not in any GA release), but it is being reverted in favor of this KEP now (https://github.com/containerd/containerd/pull/9747) .
Infrastructure Needed (Optional)
runc >= 1.1 && kernel >= 5.12