KEP-6030: Dynamic Resize of Memory-Backed Volumes
KEP-6030: Dynamic Resize of Memory-Backed Volumes
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- API Changes
- Component Interaction & Design Flow
- 1. User/Control Plane
- 2. Kubelet Pod Workers & Allocation
- 3. Coordination & Actuation (KuberuntimeManager)
- 4. Observation & Feedback Loop
- Resource Coordination: Volume and Cgroup Ordering of Updates
- Shrinkage Safety
- Interaction with Ephemeral Storage: Resource Accounting and Eviction
- Interaction with Secrets, Projected Volumes, and DownwardAPI
- Test Plan
- Graduation Criteria
- Upgrade / Downgrade Strategy
- Version Skew Strategy
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests within one minor version of promotion to GA
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
This KEP proposes a mechanism to dynamically resize emptyDir volumes with medium: Memory (tmpfs) without requiring Pod recreation or Pod restart, and while maintaining the files already written to the volume’s filesystem. It leverages the InPlacePodVerticalScaling (IPPVS) infrastructure to manage the transition from the desired state to the actual state via an allocation and actuation loop.
Motivation
Currently, increasing memory-backed storage (tmpfs emptyDir volumes) requires Pod recreation, which is disruptive for high-performance caches or in-memory databases that are otherwise capable of resizing their memory footprints dynamically.
This enhancement builds directly on the foundation laid by In-Place Pod Vertical Scaling (IPPVS) (KEP-1287). Just as IPPVS allows scaling container CPU and memory resources without Pod disruption, high-performance workloads frequently require scaling their scratch space or in-memory caches simultaneously. Since memory-backed emptyDir volumes consume the node’s memory (and can be bounded by the Pod’s memory limits), integrating their resize lifecycle with the IPPVS allocation and actuation phases ensures consistent resource accounting and prevents node exhaustion.
Goals
- Allow updating
emptyDir.sizeLimitfor existing Pods whenmediumisMemory. - Reflect resize progress in
Pod.Statususing an allocation/actuation model similar to IPPVS. - Actuate the resize via
tmpfsremount without Pod disruption and while maintaining the files already written to the volume’s filesystem. - Support dynamic resizing of memory-backed volumes on cgroup v2 nodes only (where memory charging and charge migration for
tmpfsare properly supported by the kernel).
Non-Goals
- Resizing non-memory-backed
emptyDirvolumes (disk-backed). - Resizing other volume types (such as PVCs, CSI ephemeral volumes etc.) through this mechanism.
- Supporting this feature on nodes running with cgroup v1 (due to limitations in memory accounting and tracking for shared
tmpfsvolumes).
Proposal
We propose extending the InPlacePodVerticalScaling pattern to memory-backed emptyDir volumes.
User Stories
Story 1
A user runs an in-memory database in a Pod and needs to increase the available cache size. The user updates the emptyDir.sizeLimit in the Pod spec. The Kubelet remounts the tmpfs volume with the new size limit without restarting the container, allowing the database to expand its cache immediately.
Story 2
An autoscaling controller dynamically scales the memory limits of workload worker pods (e.g., for machine learning or data processing) to handle varying demands. These workers utilize memory-backed emptyDir volumes for shared memory and caching, typically sizing the volume to a fixed proportion of the container’s total memory limit.
While the controller can use InPlacePodVerticalScaling to scale up a container’s memory limit without restarting the pod, the inability to dynamically resize the associated memory-backed volume limits the worker’s ability to utilize the expanded capacity for shared memory tasks without a disruptive restart. This feature allows the controller to update both the container memory limits and the emptyDir.sizeLimit simultaneously, providing seamless vertical scaling for distributed workloads.
Risks and Mitigations
OOM-kills
The user is allowed to set the volume sizeLimit higher than the pod or container memory limits. Attempting to add such a restriction is considered out of scope for this KEP (see Restricting sizeLimit to not exceed memory limits
). This means that the user is responsible for ensuring that the volume sizeLimit is set to be within the pod or container memory limits. Failure to do so may result in OOM-kills as applications within the pod fill up the volume.
The mitigation for this risk is that the API server can emit a warning when it sees a resize request that results in the volume sizeLimit being set to higher than the pod-level memory limits.
Design Details
API Changes
New Status Fields and State Mapping
To support dynamic resizing, the Pod.Spec.Volumes[].EmptyDir.SizeLimit field is made mutable when modified via the /resize subresource for existing Pods (when medium is Memory). This represents the Desired State.
We introduce a new sub-structure under corev1.VolumeStatus (in Pod.Status) to track the actual resize state of the volume:
- Desired State: Represented by the mutable
Pod.Spec.Volumes[].EmptyDir.SizeLimit. - Allocated State: Checkpointed and maintained by the Kubelet’s
AllocationManagerfor local resource accounting, but not exposed inPod.Status.- Definition: Represents the volume capacity acknowledged and admitted by the Kubelet on the node.
- Initial Admission: Admitted and checkpointed locally by the Kubelet once the initial resource admission phase completes.
- Admission during Resize: Upon admitting a resize update, the
AllocationManagerupdates the Kubelet’s local checkpoint with the newly acknowledged capacity. - Scope: Maintained only for memory-backed
emptyDirvolumes.
- Actual State: Represented by the new
VolumeStatus.EmptyDir.SizeLimit.- Definition: Represents the actual mounted filesystem capacity reported by the mount/remount call.
- Pod Creation: Remains unpopulated (
nil) until Kubelet completes the initial volume mount setup and receives capacity confirmation from the mounter. - Pod Resize: During an active resize operation, this field retains the old mounted size until the Kubelet successfully actuates the
tmpfsremount, at which point it is updated to the new actual capacity. - Default/Fallback: Remains unpopulated (
nil) for all non-memory-backed volumes.
The new status field Pod.Status.ContainerStatuses[].VolumeMounts[].VolumeStatus.EmptyDir.SizeLimit is only populated for memory-backed volumes, and remains unpopulated for other types of volumes. The VolumeStatus Go struct is updated as follows:
type VolumeStatus struct {
...
// emptyDir represents the status of an emptyDir volume.
// +optional
EmptyDir *EmptyDirVolumeStatus `json:"emptyDir,omitempty" protobuf:"bytes,2,opt,name=emptyDir"`
}
type EmptyDirVolumeStatus struct {
// sizeLimit represents the actual mounted capacity of the emptyDir volume.
// This is only populated for memory-backed emptyDir.
// +optional
SizeLimit *resource.Quantity `json:"sizeLimit,omitempty" protobuf:"bytes,1,opt,name=sizeLimit"`
}
API Validation and Restrictions
API validation logic ensures that Pod.Spec.Volumes[].EmptyDir.SizeLimit is only mutable when:
- The volume is memory-backed (meaning the
mediumfield is set toMemory). - Updates occur through the
resizesubresource.
Only modifications to an existing Pod.Spec.Volumes[].EmptyDir.SizeLimit are permitted.
- The addition or removal of a memory-backed volume in a pod spec is forbidden.
- For Alpha, adding or removing a
sizeLimitto an existing memory-backed volume is not allowed. This boundary is re-evaluated for Beta.
Resize Restart Policy
There are existing resize restart policies that control when a container is restarted during a resize operation, which are described in more detail in the InPlacePodVerticalScaling KEP .
We do not introduce an analogous policy nor honor the existing restart policies when resizing memory-backed emptyDir volumes. The resizing of memory-backed emptyDir volumes is fundamentally non-disruptive to running processes and does not require application-level awareness beyond standard handling of filesystem capacity. If a volume resize is accompanied by a container or pod resource resize, the container’s existing RestartPolicy governs container restarts.
Atomic Resize Principle
The atomic resize principle is preserved. The Kubelet treats updates to volume sizeLimit and container or pod-level resources (both requests and limits) as a single atomic transaction. This applies to requests that modify multiple volumes, multiple containers, pod-level resources, or any combination thereof.
The AllocationManager performs an All-or-Nothing Admission check, rejecting the entire update if the resize feasibility checks do not pass. The volume sizeLimit is also checkpointed atomically along with container and pod resources.
- The volume
sizeLimitis not taken into consideration during the resource-fit checks, but will fail the admission check atomically if the container/pod resize fails to be admitted. - On cgroups v1, volume resize is not supported, so any resize request containing a
sizeLimitupdate will be atomically rejected asInfeasibleduring the admission phase.
The Pod’s ResizeStatus (tracked via PodResizeInProgress condition) only transitions to completed once all container and volume states reach their respective desired values. If any component of the update fails, the status remains in a PodResizeInProgress state (with an error) to reflect the incomplete transition. This ensures resource integrity by preventing “partial resizes” that could leave a volume expanded within a container that fails to scale its memory budget.
See Non-atomic handling of volume and resource resizes for the alternative that was considered and rejected.
Component Interaction & Design Flow
The following diagram illustrates the proposed end-to-end flow.
graph TD
User["User / Controller"]
API["API Server"]
Kubelet["Kubelet Pod Worker"]
Alloc["AllocationManager"]
KRT["KuberuntimeManager"]
VM["Volume Manager"]
CRI["CRI"]
Kernel["Linux Kernel tmpfs"]
User -->|1. Updates Spec| API
API -->|2. Syncs Pod| Kubelet
Kubelet -->|3. Requests Allocation| Alloc
Alloc -->|4. Checkpoints State Locally| Alloc
Kubelet -->|"5. Updates Status (Container Resources)"| API
KRT -->|6. Reads Checkpointed Allocation| Alloc
KRT -->|7. Evaluates Resize Direction| KRT
KRT -->|"8a. Remounts Volume (Downsize)"| VM
KRT -->|8b. Updates Cgroups| CRI
KRT -->|"8c. Remounts Volume (Upsize)"| VM
VM -->|9. Executes Remount System Call| Kernel
KRT -->|"10. Updates Status (Volume Mount Actual State)"| APIThe following steps elaborate on this flow:
1. User/Control Plane
- The user or an external controller updates
Pod.Spec.Volumes[].EmptyDir.SizeLimit. - The API server preserves volume updates during the
dropNonResizeUpdatesphase.
2. Kubelet Pod Workers & Allocation
- The Kubelet detects the specification change in
HandlePodUpdates. - The
AllocationManagervalidates the new resize request. The changingsizeLimitdoes not affect the feasibility of the resize on cgroup v2, as pod admission takes only pod and container requests into consideration. If the node is running with cgroup v1, theAllocationManagerwill reject the resize request asInfeasible. - If allocation succeeds, the
AllocationManagerupdates the Kubelet’s internal allocated state. TheAllocatedSizeLimitis checkpointed locally (along with container and pod requests/limits) to ensure that memory-backed volume size limits survive Kubelet restarts.
3. Coordination & Actuation (KuberuntimeManager)
- The
KuberuntimeManagerreads theAllocatedSizeLimitfrom the local checkpointed state maintained byAllocationManager. - It acts as the central orchestrator for the actuation phase to enforce strict ordering. See Resource Coordination: Volume and Cgroup Ordering of Updates for the detailed rationale behind this ordering.
- The
KuberuntimeManagerchecks the direction of the volume resize (upsize vs. downsize) by comparing the newly allocated size with the checkpointed Actuated State (the target size that the Kubelet last attempted to apply, which is maintained in Kuberuntime’s internal checkpoint). Comparing against the actuated state rather than the live kernel state avoids unnecessary remount requests caused by kernel rounding differences. - The
KuberuntimeManagerdirectly calls a new interface on the Volume Manager (see Volume Manager interface ) to perform the remount at the correct time, bypassing the async reconciler loop. The Volume Manager executes the remount:mount -o remount,size=<limit> -t tmpfs tmpfs <path>. Once this actuation completes, theKuberuntimeManagercheckpoints this sizeLimit in its own Actuated State. This Actuated State represents what the Kubelet tries to actuate, which may diverge slightly from what is read directly from the kernel (Actual State). - Note: Bypassing the async reconciler loop is both safe and inconsequential. The
emptyDirreconciler today never remounts becauseRequiresRemountis hardcoded to always return false; this means that theKuberuntimeManageris the only component that makes this change. We considered alternatives such as relying on theemptyDirreconciler to detect changes asynchronously or movingemptyDirhandling entirely toKuberuntimeManager, but they are ruled out. See Alternatives for more details.
4. Observation & Feedback Loop
- Upon successful remount and cgroup update, the
KuberuntimeManagercheckpoints the new target values as the current ‘actuated resources’ (the baseline for future resizes). - The Kubelet updates
Pod.Status.ContainerStatuses[].VolumeMounts[].VolumeStatus.EmptyDir.SizeLimitto reflect the successful resize. - Once the actual size matches the allocated size checkpoint (and container resource updates are complete), the Kubelet clears the
PodResizeInProgresscondition.
Resource Coordination: Volume and Cgroup Ordering of Updates
Background: The Enforcement Divergence
Memory-backed volumes consume the same physical RAM as the container’s processes. While all usage is enforced by the container and pod cgroups, it is governed by two different enforcement thresholds:
- tmpfs
sizeoption: An internal quota for the volume. When a write exceeds the volume’ssizeLimit, the kernel returns a disk full (ENOSPC) error. This is a graceful I/O exception. - cgroup
memory.max: The external limit for the entire container. Sincetmpfsusage counts toward the cgroup total, hitting this limit triggers an OOM (Out of Memory) Kill, resulting in container termination.
Memory Attribution & cgroup Limits
Because memory-backed volumes are backed by tmpfs, their memory usage is charged directly to the cgroup of the process that writes the data, rather than to the volume itself.
- Container-level charging: If Container A writes to the shared volume, the memory is initially charged against Container A’s cgroup memory limit. If Container B writes to the shared volume, it is charged against Container B’s cgroup limit.
- cgroup v2 Page Charge Migration: In clusters running cgroup v2, if Container A writes files to a shared
tmpfsvolume and Container B later accesses or page-faults those same files, the kernel’s active memory controller can dynamically migrate or split the page cache attribution between Container A’s and Container B’s cgroups under memory pressure. This behavior is a key enhancement of cgroup v2 but makes individual container-level memory metrics unpredictable, as a container’s memory usage can rise or fall based on access to shared cached files.
- cgroup v2 Page Charge Migration: In clusters running cgroup v2, if Container A writes files to a shared
- Pod-level charging: Since container cgroups are nested under the parent Pod cgroup, all files written to the memory-backed volume always count against the Pod-level cgroup memory limit, regardless of any dynamic charge migration between containers.
This dynamic page migration has implications for how we validate the safety of resource down-sizing. See Shrinkage Safety for details on how this is taken into account.
Sharing Volumes Between Multiple Containers
When a single memory-backed emptyDir volume is mounted and shared by multiple containers under the same Pod:
- Volume Quota (
ENOSPC): The volume-levelsizeLimitacts as a hard boundary on the aggregate space consumed. If the sum of bytes written across all sharing containers exceeds this limit, standard disk-full (ENOSPC) errors are returned to subsequent writes. The containers are not OOM-killed at this boundary. - cgroup Enforcement (OOM-Kill): If a container’s write to the shared volume pushes its total memory usage (processes + files written) past its cgroup memory limit, the container’s processes are OOM-killed by the kernel—even if the shared volume usage is well below its
sizeLimit. - Persistence of Orphaned Space: If a container is terminated or OOM-killed, the files it wrote remain in host memory (tmpfs survives container lifecycles). This memory continues to consume headroom in the parent Pod cgroup.
Init Container Lifecycles & Handoff
Init containers are commonly used to pre-populate a memory-backed volume (e.g., pre-loading templates, caches, or datasets) before primary application containers start:
- Init Execution: During execution, files written by the Init container count against the Init container’s individual limit.
- Termination: When the Init container exits and its cgroup is cleaned up, the files remain in the memory volume. The memory usage of these files continues to count against the overall Pod cgroup memory limit.
- App Container Handoff: When the primary application container starts, it can read the pre-populated data. Under standard memory accounting rules, simply reading pre-existing files does not charge that memory to the app container’s individual limit. The app container can use the cache without it counting toward its own limit (it only consumes Pod-level cgroup headroom).
Priority of Enforcement
The volume sizeLimit acts as a safety “inner boundary” to ensure volume growth does not starve the container’s processes of execution RAM. Ideally, sizeLimit is lower than the container’s memory limit.
Kubernetes does not strictly enforce this “lower-than” relationship, allowing users to set a sizeLimit exceeding the pod’s memory limits or shrink pod memory limits below the volume sizeLimit. In these cases:
- The Kubelet does not block the resize but issues a Warning Event.
- If
sizeLimit> cgroup memory limits, the volume’sENOSPCprotection is effectively disabled, as the container is OOM-killed before the volume quota is reached. This is consistent with what pod creation permits today (see Restricting sizeLimit to not exceed memory limits ).
Order of Actuation
To prevent race-condition OOMs during a resize, the order of operations is driven primarily by the direction of the Volume sizeLimit change to ensure the container maintains the maximum possible “slack” during transition.
- When Increasing Volume
sizeLimit:- Adjust Cgroup First: Adjust
memory.maxto ensure the container envelope is prepared for potential volume expansion. - Remount Volume Second: Increase the
tmpfsquota.
- Adjust Cgroup First: Adjust
(Rationale: Ensures the physical budget is expanded before the internal quota allows more usage.)
- When Decreasing Volume
sizeLimit:- Remount Volume First: Shrink the
tmpfsquota to reclaim space. - Adjust Cgroup Second: Adjust
memory.max.
- Remount Volume First: Shrink the
(Rationale: Ensures the volume’s claim on memory is reduced before the cgroup envelope is tightened around it.)
In the case of multiple volumes, all volumes with a decreasing sizeLimit are downsized first, followed by the cgroup memory limit update, followed by all volumes with an increasing sizeLimit upsized.
Asynchronous Updates or Opposite Directions
It is possible that the resize requests for the container memory limits and the volume sizeLimit occur in separate updates or in opposite directions. Both scenarios are permitted. In these cases, the actuation order is less critical because the opposing nature of the updates means the sequence does not materially increase or decrease the risk of OOM kills. For implementation simplicity and predictability, actuation follows the standard order described above.
The user remains responsible for ensuring the container memory limit provides sufficient headroom for both volume data and process execution to prevent OOM kills.
Actuation Priority Matrix
The following table illustrates the finalized actuation priority matrix that the Kubelet follows, based on the direction of the volume’s sizeLimit change:
| Scenario | Volume sizeLimit | Container memory.max | 1st Action | 2nd Action | Rationale |
|---|---|---|---|---|---|
| Expansion | Increasing ⬆️ | Increasing ⬆️ | Cgroup | Volume | Expand envelope before expanding the internal quota. |
| Contraction | Decreasing ⬇️ | Decreasing ⬇️ | Volume | Cgroup | Shrink the internal quota before tightening the envelope. |
| Volume Only | Increasing / Decreasing | Unchanged | Volume | N/A | No cgroup update necessary. |
| Opposite (Vol Up) | Increasing ⬆️ | Decreasing ⬇️ | Cgroup | Volume | If this configuration results in an OOM kill, it does so regardless of ordering; follows “Vol Up” sequence for consistency. |
| Opposite (Vol Down) | Decreasing ⬇️ | Increasing ⬆️ | Volume | Cgroup | If this configuration results in an OOM kill it does so regardless of ordering; follows “Vol Down” sequence for consistency. |
| Cgroup Only | Unchanged | Increasing / Decreasing | Cgroup | N/A | Standard container resize; no volume remount required. |
Role of the Allocation Manager and Kuberuntime Manager
The AllocationManager’s role is focused on maintaining the source of truth for allocated resources (checkpointing) and ensuring atomic admission of resource updates. The volume manager should only ever see the allocated size limits, rather than the desired size limits.
The KuberuntimeManager serves as the coordination layer that understands the nested relationship between volume size and cgroup limits and enforces the strict ordering of updates.
Volume Manager interface
To support direct resizing from the KuberuntimeManager while maintaining separation of concerns, this proposal extends the Volume Manager and Volume Plugin interfaces:
- VolumeManager Extension: The
VolumeManagerinterface is extended withResizeEphemeralVolume(pod *v1.Pod, volumeName string, newSize *resource.Quantity) errorandGetVolumeSize(pod *v1.Pod, volumeName string) (*resource.Quantity, error). This allows theKuberuntimeManagerto query the current size and trigger volume operations synchronously at the correct time in its sync loop. - ResizableEphemeralVolumePlugin Interface: A new optional interface
ResizableEphemeralVolumePluginis introduced inpkg/volume/plugins.go. Volume plugins that support online, synchronous resizing (in this case, theemptyDirplugin for memory-backed volumes) can implement this interface. TheVolumeManagerchecks if the plugin for the volume implements this interface and delegates the call to it.
This design allows KuberuntimeManager to act as the central orchestrator for resource updates (ensuring strict ordering with cgroups) while leaving the actual filesystem manipulation logic encapsulated within the volume plugins.
We considered alternatives such as relying on the Volume Reconciler to detect changes asynchronously or moving emptyDir handling entirely to KuberuntimeManager, but they are ruled out. See Alternatives
for more details.
The Role of the Container Runtime
On startup, the container runtime helps to bind the Kubelet’s host path into the container’s mount namespace, but subsequent remount requests do not require intervention from the container runtime, nor does the container runtime need to be notified of changing volume sizes.
Shrinkage Safety
In the existing InPlacePodVerticalScaling feature, a best-effort safety logic ensures that container memory limits are not decreased below usage.
Alpha
For Alpha, shrinkage safety checks for memory-backed volumes are out of scope. If a user attempts to shrink a volume below its current usage, the tmpfs remount fails at the kernel level with an EINVAL error. This error is propagated back to the Kubelet and surfaced in the PodResizeInProgress condition as mount point not mounted or bad option. The Kubelet will periodically retry the resize operation until it succeeds, is cancelled, or the Pod terminates.
Beta
For Beta, we will leverage and align with the Kubelet’s existing resource validation framework to prevent OOMs:
- Existing Cgroup Validation: The Kubelet already contains validation logic (in
validateMemoryResizeAction) to ensure that proposed cgroup limit reductions do not fall below active memory usage. We will leverage this check:- Container-level Validation: Ensures the proposed container memory limit is not decreased below its current container cgroup usage. Note that this check is insufficient if dynamic kernel-level page charge migration has temporarily shifted memory charges to another container (see Memory Attribution & cgroup Limits ).
- Pod-level Validation: For pods with enforced Pod-level cgroup limits (e.g., pods with pod-level limits specified, Guaranteed pods, or Burstable pods where all containers specify limits), the Kubelet validates the proposed aggregate Pod limit against the Pod-level cgroup usage. This acts as a reliable backstop since it compares against the aggregate memory footprint regardless of charge migration.
- Volume-level Safety Options: For shrinking the volume
sizeLimititself, we will consider one of two options (to be finalized prior to Beta):- Error Reporting Enhancement: Assume the generic kernel
EINVALerror implies that the volume size limit shrink failed because the volume usage exceeds the new desired limit, and improve the error message accordingly. - Active Pre-Resize Validation: Actually add an additional validation check in
KuberuntimeManagerthat compares the actual bytes stored in the volume against the new desiredsizeLimitbefore attempting the remount. If this check fails, the Kubelet will block the resize. This check is subject to a TOCTOU (time-of-check to time-of-use) race condition.
- Error Reporting Enhancement: Assume the generic kernel
The Kubelet will periodically retry the resize operation until it succeeds, is cancelled, or the Pod terminates.
emptyDir Size Limit Monitoring and Eviction
The Kubelet runs a local storage capacity monitoring system (part of the eviction manager) to ensure that the space usage of emptyDir volumes does not exceed their volume size limits, and triggers evictions when necessary. The eviction manager currently checks the sizeLimit defined in the Pod spec and compares it to the volume usage of the Pod.
For memory-backed volumes, the eviction manager is updated to instead inspect the actual sizeLimit by reading the actual capacity from the pod volume stats. This ensures that the eviction manager does not trigger evictions due to changes in the desired sizeLimit, but only when the actual sizeLimit is exceeded.
Interaction with Ephemeral Storage: Resource Accounting and Eviction
Memory-backed emptyDir volumes are functionally distinct from disk-backed ephemeral storage. This KEP maintains the existing isolation between these resource pools to ensure that dynamic resizing does not interfere with node-level storage management.
While this KEP only deals with memory-backed volumes, we call out the following behavior to avoid confusion:
- Resource Accounting: Usage of
medium: Memoryvolumes is tracked via the cgroups, rather than theephemeral-storage(disk) resource. Resizing thesizeLimithas no effect on theephemeral-storagecapacity or quotas assigned to the node. - Exclusion from Disk Monitoring: The Kubelet’s eviction manager monitors disk-backed ephemeral storage usage using periodic filesystem scans. Because memory-backed volumes are mounted as
tmpfs, they are naturally excluded from these root-partition disk-usage calculations. - Eviction Logic: Scaling a memory-backed volume up or down never triggers
DiskPressureevictions. Pressure resulting from these volumes is handled via the OOM killer (at the container level),emptyDir limit evictionat the pod level, orMemoryPressureeviction thresholds (at the node level).
Interaction with Secrets, Projected Volumes, and DownwardAPI
Projected volumes, secrets, and downwardAPI are typically implemented as tmpfs mounts to ensure data is never persisted to physical storage. These share the same underlying kernel infrastructure as memory-backed emptyDir volumes.
While projected volumes are out of scope for this KEP, we call out their behavior here to avoid confusion:
- Implicit Memory Consumption: Memory used by projected volumes is accounted for in the container or Pod-level memory cgroup usage. The
AllocationManagerdoes not explicitly track or reserve space for these volumes, as they do not have user-defined resource requests orsizeLimits. - Resize Validation Boundaries: When validating a resize request, the
AllocationManagerensures the new container and pod requests are feasible based on node capacity. It does not account for the “hidden” overhead of projected volumes; users must ensure that their container memory limit provides sufficient headroom for both the expandedemptyDirand any projected volumes. - Actuation Isolation: Dynamic resizing of an
emptyDirvolume is a targetedremountoperation on a specific mount point. It does not interfere with the file descriptors or mount states of existing Secret or ConfigMap volumes. - Shared Fate in OOM Scenarios: Because all
tmpfsmounts share a memory budget, an over-provisionedemptyDirexpansion can push the cgroup usage to its limit. If a process attempts to access a large Secret or write to theemptyDirwhen the cgroup is exhausted, the kernel triggers an OOM kill based on the cgroup threshold, not the individual volume quota.
ConfigMaps
At the time of writing, ConfigMaps are implemented using disk-backed volumes rather than tmpfs. This means that they are currently independent of the memory-backed volumes implementation. There is a pending action item to move it (see the code comment here
), at which point they fall into the same category as Secrets and Projected Volumes described above.
Test Plan
[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
None.
Unit tests
Unit tests in the following packages are added or extended to cover the new logic:
pkg/kubelet/kuberuntime: TestcomputeVolumeResizeActionto ensure it correctly detects size differences, anddoPodResizeActionto verify the strict ordering of cgroup and volume updates.pkg/kubelet/volumemanager: TestResizeEphemeralVolumeandGetVolumeSizeto ensure correct delegation to supporting plugins.pkg/volume/emptydir: TestDirectResize(remount execution) andGetVolumeSize(parsing mount options).pkg/kubelet/eviction: TestemptyDirLimitEvictionto ensure it correctly reads from stats capacity instead of spec.
Integration tests
Unit and E2E tests provide sufficient coverage for Alpha. For Beta, the testing plan re-evaluates whether integration tests are required.
e2e tests
A new E2E test case is added (or existing In-Place Pod Resize tests are extended) in test/e2e/:
- Successful Upsize Test: Creates a pod with a memory volume (e.g., 100Mi), patches the size limit to 200Mi via the
resizesubresource, and verifies that the mount inside the container reflects the new size (e.g., checkingmountoutput via exec) and that thePodResizeInProgresscondition clears. - Failure on Shrink Test: Verifies that attempting to shrink a volume below its current data usage results in a remount failure, and that the error is surfaced in the pod status condition.
Graduation Criteria
Alpha
- The feature is implemented behind the
InPlacePodVerticalScalingMemoryBackedVolumesfeature gate. - Unit test coverage is complete.
- E2E tests implemented in CI verify successful upsize and downsize operations in a live cluster.
Beta
- Gather feedback from users.
- Metrics are defined and implemented.
- Kubelet emits a warning event when
sizeLimitexceeds pod-level memory limits. - The plan evaluates whether integration tests are required.
- Edge cases around “shrinkage safety” are resolved (e.g., reporting distinct errors when attempting to shrink a volume below current usage).
GA
- The feature is stable in Beta and enabled by default for at least one release.
- Critical bugs discovered during the Beta phase are resolved.
Upgrade / Downgrade Strategy
Upgrade
Upgrading is non-disruptive. Existing pods continue to run with their original volume sizes. To utilize the feature on existing pods, users send patch requests to the resize subresource.
Downgrade
If the control plane rolls back to a previous version where the feature is disabled:
- The API server stops accepting updates to
sizeLimitvia theresizesubresource. - Pods that are already resized retain their updated
sizeLimitin etcd. - If these pods are restarted on an older Kubelet, the old Kubelet still reads the updated
sizeLimitfrom the spec during the initial mount phase, starting the pod with the updated size limit.
Version Skew Strategy
Kubelet vs API Server
If the API Server supports the feature but the target Kubelet does not, the NodeDeclaredFeatures framework rejects the resize request during admission.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name:
InPlacePodVerticalScalingMemoryBackedVolumes - Components depending on the feature gate:
kube-apiserver,kubelet
- Feature gate name:
Enabling or disabling this feature requires setting the InPlacePodVerticalScalingMemoryBackedVolumes feature gate flag on the respective components. This operation requires a restart of the control plane components (kube-apiserver) and the kubelet on each node. It cannot be enabled or disabled dynamically without component restarts.
Does enabling the feature change any default behavior?
No. Enabling this feature does not change any default behavior for existing workloads. It only introduces a new capability allowing the sizeLimit of memory-backed emptyDir volumes to be mutated via the resize subresource. Existing Pods and volumes continue to function as before without any impact unless users explicitly invoke the new resize capability.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
Yes. The feature can be disabled by turning off the feature gate and restarting the components.
Consequences of disablement:
- If a volume was already resized while the feature was enabled, the actual mounted
tmpfsvolume retains its new size on the node (we do not automatically shrink it back). - The API server rejects further mutations to
emptyDir.sizeLimitvia theresizesubresource. - For Pods that were already resized, the Pod spec retains the updated
sizeLimitvalue, but it becomes immutable again.
What happens if we reenable the feature if it was previously rolled back?
If the feature is re-enabled:
- The API server once again permits mutations to
emptyDir.sizeLimitvia theresizesubresource. - The Kubelet resumes its reconciliation logic. If a Pod’s actual mounted volume size differs from the target allocated capacity checkpointed locally (e.g., if a resize is pending or if the spec was updated but not actuated before rollback), the Kubelet proceeds to actuate the resize to match the desired state.
Are there any tests for feature enablement/disablement?
Yes.
- API Validation Tests: Unit tests in
pkg/apis/core/validationverify that mutatingemptyDir.sizeLimitis forbidden when the feature gate is disabled and permitted when it is enabled. - Kubelet Tests: Unit tests in the Kubelet (specifically in
KuberuntimeManagerand theevictionmanager) verify that actuation and enhanced monitoring logic are skipped or fall back to default behavior when the feature gate is disabled.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
Rollout Failure Scenarios and Impact:
- Node-Declared Feature Gating (Skew Prevention): If a client attempts to mutate
emptyDir.sizeLimiton a Pod scheduled on a Kubelet that does not support the feature (or does not have the feature gate enabled), the API Server automatically rejects the patch via Node Declared Features. This prevents skew issues where a spec is modified but cannot be actuated by the Kubelet. - Kubelet Crash or Restart Mid-Actuation: If the Kubelet restarts or crashes after writing the local
AllocationManagercheckpoint but before executing the dynamictmpfsremount syscall, the Kubelet’s startup reconciliation loop automatically discovers the discrepancy between the checkpointed state (allocated limit) and the actual host mount parameters, and it will safely re-attempt and complete the remount. - Actuation Failure: If the mounter’s
remountsyscall fails, the volume stays at its previous size limit. The Pod’s actual volume status will remain out-of-sync with the desired spec, but the running workload continues to execute safely at the old size.
Rollback Failure Scenarios and Impact:
- Pending Resize State Mismatch (Disablement Skew): If the feature gate is disabled while a Pod has an active, pending resize operation (i.e., the Pod spec has been updated to a new
sizeLimitbut the Kubelet has not yet completed the mount/remount actuation):- Once the feature gate is disabled, the Kubelet stops processing dynamic volume updates. The actual volume size remains stuck at the old limit, creating a permanent discrepancy between the desired state in the Pod spec and the actual state on the node.
- The
PodResizeInProgresscondition will no longer reflect the discrepancy and may be cleared despite the desired and actual volume size mismatch, as the Kubelet will no longer check the volume size as part of its reconciliation loop.
- Workload Safety: Disabling the feature gate does not automatically shrink already-resized volumes back to their original limits.
What specific metrics should inform a rollback?
Operators should monitor the following signals to identify anomalies that might require a rollback:
- API Server Metrics:
- A high rate of client request errors (4xx or 5xx response codes) on the Pod
resizesubresource endpoint, tracked via theapiserver_request_total{subresource="resize", verb="PATCH"}counter.
- A high rate of client request errors (4xx or 5xx response codes) on the Pod
- Kubelet Volume Operation Metrics:
- An increase in failure rates for volume mount or filesystem resize operations, tracked via the
storage_operation_duration_seconds_count{volume_plugin="kubernetes.io/empty-dir", operation_name=~"volume_mount|volume_fs_resize", status=~"fail-.*"}metric. - A failure of actual reported capacity to transition to the newly desired capacity, tracked via the
kubelet_volume_stats_capacity_bytes{volume_plugin="kubernetes.io/empty-dir"}gauge.
- An increase in failure rates for volume mount or filesystem resize operations, tracked via the
- Node and Container Health Metrics:
- An increase in the Kubelet pod eviction rate, tracked via the
kubelet_evictionscounter (specifically monitoring for evictions driven by emptyDir limit enforcement).
- An increase in the Kubelet pod eviction rate, tracked via the
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
The following manual verification plan will validate the behavior during development and testing:
- Upgrade Path Test:
- Deploy a cluster with the
InPlacePodVerticalScalingMemoryBackedVolumesfeature gate disabled. - Create a Pod with a memory-backed
emptyDirvolume configured with a staticsizeLimit: 128Mi. - Enable the feature gate and restart the control plane and Kubelet components.
- Submit a dynamic resize request targeting
256Mi. Verify that the Kubelet successfully remounts thetmpfsfilesystem, the host mount reflects256Miviadf, and the Pod status reports256MiunderVolumeStatus.EmptyDir.SizeLimit.
- Deploy a cluster with the
- Downgrade/Rollback Path Test:
- With the feature gate active, create a Pod with a memory-backed volume of
128Miand dynamically resize it to256Mi. - Write
200Miof files to the volume to confirm active capacity utilization. - Disable the feature gate and restart the API servers and Kubelets.
- Verify that the running Pod’s memory volume is not disrupted, the files remain fully readable/writeable, and the mounted directory keeps its
256Micapacity. - Attempt a new dynamic volume resize update and verify that the API server returns a validation error rejecting the request.
- With the feature gate active, create a Pod with a memory-backed volume of
- Re-upgrade/Re-enablement Path Test:
- Re-enable the feature gate and restart the Kubelets and API servers.
- Verify that the API server resumes accepting mutations on the subresource and that the Kubelet resumes full dynamic actuation.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No. This feature is strictly additive and introduces new dynamic capabilities for existing fields and volume types. No existing APIs, fields, flags, or default volume behaviors are deprecated or removed as part of this enhancement. Legacy static workloads remain fully compatible.
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
How can someone using this feature know that it is working for their instance?
- Events
- Event Reason:
- API .status
- Condition name:
- Other field:
- Other (treat as last resort)
- Details:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- Other (treat as last resort)
- Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?
Dependencies
Does this feature depend on any specific services running in the cluster?
Scalability
Will enabling / using this feature result in any new API calls?
Yes.
- API call type: One new
PATCH /statuscall on thePodresource per successful pod volume resize to update the actual volume status. - Estimated throughput: Extremely low. Resizing memory-backed volumes is an infrequent, on-demand event triggered manually by cluster administrators or periodically by vertical autoscalers. It is not a standard high-frequency workload operation.
- Originating component(s): Kubelet (via
PodStatusupdates). - Listing / Watching: No new listing or watching of resources is introduced.
Will enabling / using this feature result in introducing new API types?
No. No new API types are introduced. The feature purely extends existing fields and makes the existing Pod.Spec.Volumes[].EmptyDir.SizeLimit mutable for memory-backed volumes.
Will enabling / using this feature result in any new calls to the cloud provider?
No. Volume resizing is strictly local and does not require any external cloud provider storage controller interactions.
Will enabling / using this feature result in increasing size or count of the existing API objects?
Yes.
- API Type:
Pod(Spec and Status). - Estimated increase in size:
Pod.Spec: None. The existingsizeLimitfield underEmptyDirVolumeSourceis simply made mutable.Pod.Status: One new*resource.Quantityfield (EmptyDir.SizeLimitinsideVolumeStatus) is added. This status field remains empty by default, until the Kubelet populates it for memory-backed volumes.- Net Overhead: These fields are only populated for memory-backed volumes (
medium: Memory), which represent a tiny fraction of volumes in a standard cluster. Pods without memory-backed volumes experience zero size increase.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No.
- Resizing memory-backed
emptyDirvolumes is performed synchronously in the Kubelet pod sync loop via lightweight Linux system calls (mount -o remount). - The duration of the remount operation is on the order of microseconds/milliseconds and does not block pod creation, container startup, or container termination.
- There is zero impact on scheduling latency, pod startup latency, or any other standard Kubernetes SLIs/SLOs.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
No.
- Kubelet: The memory and CPU overhead for tracking the actual volume status is negligible. The local Kubelet checkpoint footprint increases by a negligible amount (a few bytes) to store the internal allocated state.
- kube-apiserver / etcd: The etcd storage footprint increases negligibly. No additional CPU or memory overhead is introduced.
- Node RAM/CPU: Physical memory consumption remains governed entirely by the container/pod memory cgroups and standard kernel-level
tmpfsquota enforcement.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No.
- Resizing acts on existing mount points by modifying their options. No new mount points, file descriptors, or socket connections are created.
- The inode limit on
tmpfsmounts remains governed by kernel defaults or specificsizeLimitattributes, ensuring no new paths for resource exhaustion are introduced.
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
What are other known failure modes?
What steps should be taken if SLOs are not being met to determine the problem?
Implementation History
- 2026-04-30: Initial alpha KEP submitted.
Drawbacks
No drawbacks have been identified.
Alternatives
Volume Reconciler detects changes (Async approach)
- Description: Relying on the Volume Manager’s periodic reconciler to detect changes in
emptyDir.sizeLimit(viaRequiresRemount) and trigger the remount asynchronously. - Why ruled out: It makes coordination with cgroup updates extremely difficult. Cgroup updates are handled synchronously in
kuberuntime’sSyncPodflow. Attempting to coordinate strict ordering (upsize: cgroups first; downsize: volume first) between a synchronous flow and an asynchronous polling loop introduces complex state management and potential race conditions.
Moving emptyDir handling entirely to KuberuntimeManager
- Description: Handling creation and resizing of memory-backed
emptyDirvolumes directly inKuberuntimeManager. - Why ruled out:
- Violation of Concerns:
KuberuntimeManagermanages containers and cgroups via CRI, not volumes. - Architectural Fragmentation: Special-cases
emptyDirand breaks the unified volume lifecycle model used for all other volumes. - Metrics Breakage: Bypassing Volume Manager breaks centralized metrics collection, requiring duplicated filesystem monitoring logic in
KuberuntimeManagerto feed the stats API.
- Violation of Concerns:
Restricting sizeLimit to not exceed memory limits
- Description: Not allowing users to adjust
sizeLimitto exceed pod or container level memory limits as part of this KEP. - Why ruled out: This was deemed out of scope for this KEP since this behavior is allowed during pod creation today. Enforcing this restriction only during resize updates would introduce inconsistency in API behavior.
Non-atomic handling of volume and resource resizes
- Description: Treating updates to volume
sizeLimitand container/pod resource limits as separate, non-atomic transactions. - Why ruled out: This introduces a risk of partial resizes. If a user updates both expecting the volume to maintain a certain proportion of the container’s memory limit, a failure or delay in applying the cgroup limit while the volume is expanded could lead to unexpected OOM kills if the application attempts to fill the volume.