KEP-5996: Support Default Pod Sysctls in Kubelet Configuration

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests within one minor version of promotion to GA
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Add support for new field DefaultPodSysctls in the Kubelet configuration. This field allows node administrators to define a set of namespaced kernel parameters (sysctls) that will be applied by default to all pods running on the node. These defaults are applied during pod sandbox creation, specifically for sysctls that are namespaced (e.g., within the network or IPC namespaces), provided the pod is not using the host namespace for those subsystems.

Motivation

Currently, Kubernetes allows users to specify sysctls for individual pods via the securityContext.sysctls field in the Pod spec. However, in many production environments, node administrators often need to enforce consistent kernel parameter tuning across all workloads on a node or within a node pool/group. For example, high-performance networking or messaging applications may require specific net.* or kernel.shm* values to be set globally for all containers on specialized nodes. Setting these parameters manually in every Pod spec is error-prone and redundant. Furthermore, cluster operators may wish to manage these defaults at the infrastructure level (e.g., via Kubelet configuration) to ensure performance and stability without requiring application developers to be aware of the underlying host kernel tuning needs.

Goals

Add new field DefaultPodSysctls to the Kubelet configuration.
Allow node-level defaulting of namespaced kernel parameters for all pods, including static pods.
- Namespaced sysctls cover both safe and unsafe sysctls.
- Cover list: kernel.shm*, kernel.msg*, kernel.sem, fs.mqueue.*, net.*, kernel.domainname, user.*.
Ensure that sysctls are only applied if the pod is namespaced for the corresponding subsystem (i.e., not using HostNetwork or HostIPC).

Non-Goals

NOT support configuring non-namespaced sysctls in Pods.
NOT support pod level defaulting of sysctls, i.e. not support additional filtering on Pods to apply default sysctls.
Configuration changes will only affect new pods; dynamic reconfiguration for existing pods is NOT supported.

Proposal

We propose adding a new field DefaultPodSysctls to the Kubelet configuration. This field will allow operators to specify key-value pairs of sysctls. The Kubelet will merge these defaults into the Pod sandbox configuration during sandbox creation, and thus applied to all Pods.

The precedence for sysctl values will be:

Pod-level SecurityContext: Values explicitly set in pod spec.securityContext.sysctls will always override the Kubelet’s default sysctls.
Kubelet Defaults: Values set in the new kubelet field which will be applied to ALL pods.

The new kubelet field will support all namespaced sysctl (kernel.shm*, kernel.msg*, kernel.sem, fs.mqueue.*, net.*, kernel.domainname, user.*), covering both safe and unsafe sysctls. Note, net.* sysctl is a bit different, since part of networking sysctls are unnamespaced and cannot be set in pods.

Besides, sysctls inside pods are seen as individual values. Thus, grouping of sysctls needs to be specified explicitly with key and value pairs in the kubelet field map to take effect correctly.

On the other hand, with Kubelet applying sysctls to pod sandbox, no changes in CRI API or container runtime are needed.

User Stories

Story 1

As a cluster admin, I would like to be able to set namespaced sysctls for all pods to share the same kernel environments. And pods keep the ability to customize specific sysctl in its own namespace.

For example, I plan to run a large-scale data processing pipeline or a distributed machine learning training workload, which involves significant inter-pod network communication, transferring large volumes of data. To improve data transfer speed and reduce overall job completion time, I need to increase net.ipv4.tcp_rmem and net.ipv4.tcp_wmem for all pods to increase the TCP buffer sizes.

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Overwriting critical host settings if a global sysctl is accidentally specified.
- The application logic strictly enforces namespacing. Global (non-namespaced) sysctls specified in DefaultPodSysctls will be ignored if they cannot be applied in a namespace context, adhering to underlying OCI Runtimes (eg., runc and crun) restrictions. However, some net.* sysctls are unnamespaced, and would lead to FailedCreatePodSandBox failure. Users need to ensure that the net sysctls are namespaced.
The provided sysctl keys or values are invalid.
- The OCI Runtimes (eg., runc and crun) handles the direct application of sysctl configurations by writing to the appropriate files within /proc/sys/. The Linux Kernel handles validation on value type and range. Consequently, if a sysctl key is invalid or the provided value is unsupported, the operation will fail. This failure prevents the Kubelet from successfully creating the pod sandbox, resulting in a FailedCreatePodSandBox error. Therefore cluster admin needs to verify that all provided sysctl keys and values are accurate and valid to avoid the pod failure.

Design Details

Kubelet Configuration Changes

The KubeletConfiguration struct will be updated to include a new field:

type KubeletConfiguration struct {
    // Existing fields: A comma separated allowlist of unsafe sysctls or sysctl patterns. Users are allowed to set them in pod securityContext.sysctls.
    AllowedUnsafeSysctls []string

    // New field: DefaultPodSysctls is a list of sysctls that will be applied inside all pods running on this node by default.
    // +optional
    DefaultPodSysctls map[string]string
}

Validation

The field will be guarded by a new feature gate called DefaultPodSysctls added in kube_features.go .

The namespace-level verification will be performed while applying to pods in Application Logic below, since it depends on the pod namespace information. Unnamespaced flags will be ignored to avoid failure. There will be log messages containing all applied sysctls.

Besides, we will NOT perform exhaustive validation on the specific sysctl keys or values within the field, similar to AllowedUnsafeSysctls , considering the large amount of parameters. Linux kernel defines handlers for each category of sysctls and validates on parameter type and ranges. For example, net.ipv4.* defines the handler to check integer type and min/max values. Thus the cluster admin needs to ensure the provided keys and values are valid.

Application Logic

The field will be called in the generatePodSandboxLinuxConfig function in kuberuntime before securityContext.sysctls is applied, and parsed to linux config (as below) to start PodSandbox .

// NamespaceOption provides options for Linux namespaces.
message PodSandboxConfig {
    PodSandboxMetadata metadata = 1;
    // Hostname of the sandbox. Hostname could only be empty when the pod
    // network namespace is NODE.
    string hostname = 2;
    ...
    LinuxPodSandboxConfig linux = 8;
    // Optional configurations specific to Windows hosts.
    WindowsPodSandboxConfig windows = 9;
}

// LinuxPodSandboxConfig holds platform-specific configurations for Linux
// host platforms and Linux-based containers.
message LinuxPodSandboxConfig {
    // Parent cgroup of the PodSandbox.
    // The cgroupfs style syntax will be used, but the container runtime can
    // convert it to systemd semantics if needed.
    string cgroup_parent = 1;
    // LinuxSandboxSecurityContext holds sandbox security attributes.
    LinuxSandboxSecurityContext security_context = 2;
    // Sysctls holds linux sysctls config for the sandbox.
    map<string, string> sysctls = 3;
    ...
}

Containerd doesn’t have any validation on sysctl settings, while the underlying OCI Runtimes (eg., runc and crun) have. A filtering function consistent with runc will be applied in Kubelet to ensure following sysctls are applied in the required namespaces:

kernel.shm* => IPC namespace, not HostIPC
kernel.msg* => IPC
kernel.sem => IPC
fs.mqueue.* => IPC
kernel.domainname => UTS
- kernel.hostname is explicitly denied even in UTS ns
net.* => Net, not HostNetwork
user.* => User (recently supported in PR last year)

OCI Runtimes

Containerd uses runc by default to actually spawn and run the containers on Linux. runc serves as a low-level, lightweight container runtime, acting as a bridge between high-level container orchestration tools (like Containerd) and the underlying operating system. Kubelet applies the same application logic as runc to avoid FailedCreatePodSandBox failure caused by namespace checks. However the failure could still happen if the sysctl keys or values are invalid while in WriteSysctls function (denied by kernel).

On the other hand, Containerd can also use crun for the same scenario. crun applies similar validation to runc on namespace checks in the validate_sysctl function. The only difference is that crun doesn’t support user.* params, which was recently added in runc.

An alternative of Containerd is CRI-O, which also applies similar validation on sysctl namespaces in CRI-O code , called when creating podsandbox. The only difference is that it doesn’t support kernel.domainname and user.*.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

pkg/kubelet/kuberuntime: 2026-04-09 - 70.9%
pkg/kubelet/apis/config/validation/: 2026-04-09 - 91.7%

These data is obtained by running go test -cover based on commit f5c7b422749303542baa17f1322d95250df9b0b5 .

Integration tests

N/A. This feature requires e2e tests.

e2e tests

Test sysctls set in DefaultPodSysctls field are applied to ALL pods, including static pods.
Test pod level sysctl set in spec.securityContext.sysctls could override sysctls set in DefaultPodSysctls.
Test unnamespaced sysctls are ignored instead of leading to failures.

Graduation Criteria

Alpha

Implement the DefaultPodSysctls field in Kubelet configuration behind the feature flag.
Implement the validating and merging logic in kuberuntime.
Add unit tests for merging precedence.

Beta

Enable the feature gate by default.
Add E2E tests covering various namespaced sysctls.

GA

Gather feedback from users running specialized workloads.
Documentation completed on kubernetes.io.

Deprecation

Upgrade / Downgrade Strategy

Upgrade:
- Upgrading the Kubelet does not change any default behavior, as the feature is disabled by default during Alpha.
- To make use of the feature after upgrade, operators must enable the DefaultPodSysctls feature gate and specify the default sysctls in the KubeletConfiguration (defaultPodSysctls).
- Pods that are already running prior to the upgrade will continue running unaffected. Only new pods created post-upgrade will have default sysctls applied.
Downgrade:
- Before downgrading the Kubelet to a version where the feature is not supported or when disabling the feature gate, any defaultPodSysctls field in the KubeletConfiguration must be removed to prevent validation failures during Kubelet startup.
- Active workloads constructed prior to the rollback/downgrade will keep running with their set sysctl values. Any new sandboxes created after the downgrade will not receive default sysctl configurations.

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: DefaultPodSysctls
- Components depending on the feature gate: Kubelet
Kubelet configuration parameter
- Field: defaultPodSysctls under KubeletConfiguration

Does enabling the feature change any default behavior?

No. Enabling the feature gate exposes the config field and does not alter pod behavior unless DefaultPodSysctls is populated in the Kubelet configuration.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. To disable the feature, turn off the DefaultPodSysctls feature gate, or remove the DefaultPodSysctls map from the Kubelet configuration, and restart the Kubelet. Workloads running on existing pod sandboxes will be unaffected.

What happens if we reenable the feature if it was previously rolled back?

New pods will have default sysctl settings applied based on the configured DefaultPodSysctls field. Existing running pods will continue to run with the sysctls configured at their sandbox creation time.

Are there any tests for feature enablement/disablement?

Yes. Unit tests will verify Kubelet configuration parsing and validation, ensuring configuration rejection/errors when the feature gate is disabled, and success when enabled. Unit tests will also verify the merge behavior and precedence on pod sandbox configuration generation.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

Rollout Failure:
- Pod Sandbox Creation Failure: If a sysctl configured in DefaultPodSysctls is invalid (e.g., value out of bounds, key not recognized by the host kernel), OCI runtimes (like runc) will fail to write to /proc/sys/. This causes pod sandbox creation to fail with a FailedCreatePodSandBox warning event, and new pods will remain in the Pending state.
Rollback Failure:
- If disabling the feature, omitting to cleanse the KubeletConfiguration files of the defaultPodSysctls field on older downgraded Kubelet binaries will lead to startup failures due to unrecognized fields.
Impact on Running Workloads:
- Already running workloads are not impacted by a rollout or rollback. Defaults are applied only when a new pod sandbox is created.
- If a running pod restarts (e.g., due to container crash or rescheduled node), the pod sandbox is reconstructed, which would then apply the updated configurations and could fail to start if the configuration is invalid.

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

Check Kubelet logs to see if the feature gate is enabled and kubelet configuration is set.

How can someone using this feature know that it is working for their instance?

Events
- Event Reason:
API .status
- Condition name:
- Other field:
Other (treat as last resort)
- Details:

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
Other (treat as last resort)
- Details:

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No. But a new kubelet configuration will be added.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Not likely.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

This feature is enabled in Kubelet alone.

What are other known failure modes?

Invalid Sysctl keys or values leading to WriteSysctl failure.
Unnamespaced sysctls (some net. params) failed to be set in pods.

What steps should be taken if SLOs are not being met to determine the problem?

Kubelet configz endpoint

The new kubelet field DefaultPodSysctls can be inspected from configz endpoint following guidance .

Check failure in events

When starting a new pod, examine the pod status and events to verify if anything fails. If sysctl fails to apply, the pod will remain in Pending status, and the event contains FailedCreatePodSandBox failure. For example,

kubectl describe pod <pod-name> -n <namespace>
(Or "kubectl get events -n <namespace>")

Name:           my-sysctl-pod
Namespace:      default
Priority:       0
Node:           gke-my-cluster-node-1/10.128.0.1
Status:         Pending
...
Events:
  Type     Reason                  Age                From               Message
  ----     ------                  ----               ----               -------
  Normal   Scheduled               2m                 default-scheduler  Successfully assigned default/my-sysctl-pod to gke-my-cluster-node-1
  Warning  FailedCreatePodSandBox  30s (x5 over 2m)   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error applying sysctl options: failed to write net.ipv4.tcp_tw_reuse = "bad-value": Invalid argument

Check sysctl in pod

To check detailed sysctl settings inside a running Kubernetes pod, you can use kubectl exec to run commands within one of the pod’s containers. For example,

# start an interactive shell session within a container in the pod.
kubectl exec --stdin --tty <pod-name> -n <namespace> -- /bin/sh 

# Inside the pod's container shell:
/ # sysctl net.ipv4.ip_forward
net.ipv4.ip_forward = 0
/ # sysctl -a
... (all sysctls)

Implementation History

Drawbacks

Alternatives

Node Resource Interface (NRI)

An NRI plugin could be used to inject sysctls into pod sandboxes by adding native sysctl support to the NRI framework (https://github.com/containerd/nri/pull/248 ) or using OCI hooks. While NRI might offer a faster implementation path for new features, especially those with unique requirements, integrating directly into Kubernetes is generally the superior long-term approach for general-purpose improvements, like sysctl management enhancement here. The direct Kubernetes support, achieved through the upstream KEP process, results in a more streamlined architecture with reduced maintenance efforts.

Mutating Admission Webhooks

Injecting sysctls at admission time with mutating webhook is possible but requires a running control plane component, which also requires much internal maintenance efforts. And we may require duplicate validation logic to enable “safe” sysctls in the webhooks. Besides, it doesn’t allow for node-specific or node-pool-specific defaults without complex logic based on node labels.

Mutating Admission Policy

Mutating admission policy (GA in 1.36 is better than mutating admission webhooks since it does not require an additional control plane component. However other restrictions still apply.