KEP-4885: Windows CPU and Memory Affinity

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This kep outlines how to add support for the CPU, Memory and Topology Managers in kubelet for Windows.
The Managers are already available and support in kubelet on Linux and there have been requests to sig-windows to add support on Windows to help with workloads that require co-located workloads. The goal of the KEP is to add Windows support without significant changes to the Managers logic while providing the same feature sets available on Linux today.

The existing KEPS are:

https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3570-cpumanager https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1769-memory-manager https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/693-topology-manager

Motivation

Currently enabling low latency workloads co-hosted on the same nodes in Windows Server create noisy neighbor behaviors preventing them from achieving their expected performance goals. The CPU, Memory and Topology Managers feature is needed to add the necessary isolation to accomplish both high performance and co-hosting efficiency.
The feature is enabled and available in Linux and Windows users are asking for the same features on Windows.

Goals

Enable CPU manager for Windows allowing for CPU affinity for configured pods
Enable Memory Manager for Windows allowing for memory affinity for configured pods
Enable Topology Manager for Windows allowing for coordination of Memory and CPU affinity at the node level for scheduled pods

Non-Goals

We do not wish to create new managers and instead re-use the existing logic provided
Modify or bypass any existing feature gated features. Existing Policy features gates will still be used to progress specific policies related to the managers.

Proposal

The proposal requires very little changes to the code for the managers and instead extends the Windows concepts to a CAdvisor mapping to enable the topology structure in kubelet .

There are no plans to change the core logic for selecting CPU’s and NUMA nodes in the CPU/Memory/Tolopology managers from the existing KEPS (memory-manager /cpu-manager /topology-manager ). The logic is currently in platform agnostic structures so the selection process is does not require changes for adoption on Windows. The Windows specific considerations for each of the managers will be covered in separate sections in this document.

User Stories (Optional)

The User stories on Windows are similar to Linux:

https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3570-cpumanager#user-stories-optional https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1769-memory-manager#user-stories https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/693-topology-manager#user-stories-optional

Notes/Constraints/Caveats (Optional)

Windows does not have an API to constrain workloads to a specific NUMA node. This is addressed in the Memory Manager section below.

Risks and Mitigations

The technical risks are the same from existing KEP’s:

For sig-windows, we also see a risk to enabling a feature that has already Stable or fully featured on Linux. To mitigate this risk we have opted to create a separate KEP with a feature flag so we can communicate our status effectively.

Another risk is the testing implementation for these features is mostly in e2e_node which doesn’t currently support Windows. As a mitigation there was some exploration to see if these tests could be enabled on Windows so we can progress this feature with confidence in the testing suite.

Design Details

Windows CPU Discovery

The Windows Kubelet provides an implementation for the cadvisor api in order to provide Windows stats to other components without modification.
The ability to provide the cadvisorapi.MachineInfo api is already partially mapped in on the Windows client. By mapping the Windows specific topology API’s to cadvisor API, no changes are required to the CPU Manager.

The Windows concepts are mapped to Linux concepts with the following:

Kubelet Term	Description	Cadvisor term	Windows term
CPU	logical CPU	thread	Logical processor
Core	physical CPU	Core	Core
Socket	socket	Socket	Physical Processor
NUMA Node	NUMA cell	Node	Numa node

The result of this mapping gives the following output from CPU manager after the conversion into kubelet’s memory structure:

"Detected CPU topology" 
topology={"NumCPUs":8,"NumCores":4,"NumSockets":1,"NumNUMANodes":1,"CPUDetails":{
"0":{"NUMANodeID":0,"SocketID":1,"CoreID":0},
"1":{"NUMANodeID":0,"SocketID":1,"CoreID":0},
"2":{"NUMANodeID":0,"SocketID":1,"CoreID":2},
"3":{"NUMANodeID":0,"SocketID":1,"CoreID":2},
"4":{"NUMANodeID":0,"SocketID":1,"CoreID":4},
"5":{"NUMANodeID":0,"SocketID":1,"CoreID":4},
"6":{"NUMANodeID":0,"SocketID":1,"CoreID":6},
"7":{"NUMANodeID":0,"SocketID":1,"CoreID":6}}}

The Windows API’s used will be

One difference between the Windows API and Linux is the concept of Processor groups . On Windows systems with more than 64 cores the CPU’s will be split into groups, each processor is identified by its group number and its group-relative processor number.

In CRI we will add the following structure to the WindowsContainerResources in CRI:

message WindowsCpuGroupAffinity {
    // CPU mask relative to this CPU group.
    uint64 cpu_mask = 1;
    // CPU group that this CPU belongs to.
    uint32 cpu_group = 2;
}

Since the Kubelet API’s are looking for a distinct ProcessorId, the processorid’s will be calculated by looping through the mask and calculating the ids with (group *64) + procesorid resulting in unique processor id’s from group 0 as 0-63 and processor Id’s from group 1 as 64-127 and so on. This translation will be done only in kubelet, the cpu_mask will be used when communicating with the container runtime.

for i := 0; i < 64; i++ {
		if groupaffinity.Mask&(1<<i) != 0 {
			processors = append(processors, i+(int(a.Group)*64))
		}
	}
}

Using this logic, a cpu bit mask of 0000111 (leading zero’s removed) would result in cpu’s:

0,1,2 in group 0
64,65,66 in group 1.

When converting back to the Windows Group Affinity we will divide the cpu number by 64 to get the group number then use mod of 64 to calculate the location of the cpu in mask:

group := cpu / 64
mask := 1 << (cpu % 64)
groupaffinity.Mask |= mask

There are some scenarios where cpu count might be greater than 64 cores but in each group it is less than 64. For instance, you could have 2 CPU groups with 35 processors each. The unique ID’s using the strategy above would give you:

CPU group 0 : 0 to 34
CPU group 2: 64 to 99

Windows Memory considerations

Numa nodes can not be directly assigned or guaranteed via the Windows API but the windows sub system attempts to use memory assigned to the CPU to improve performance.
It is possible to indicate to a process which Numa node is preferred but a limitation of the Windows API’s is that PROC_THREAD_ATTRIBUTE_PREFERRED_NODE does not support setting multiple Numa nodes for a single Job object (i.e. Container) so is not usable in the context of Windows containers which have multiple processes.

Since the existing Memory Manager Policy Static on Linux has semantic meaning that ensures that only the memory from a NUMA node selected is used. We can not re-use this policy on Windows given that there is no way to ensure only the memory on the Node that the memory manager selects. For these reason if the Static policy is chosen on Windows kubelet will fail to start with an error message that states it can use the Static policy. Instead we will create a new Windows only Policy called BestEffort which will initially only be implemented on Windows and Linux will fail to start if the Policy is set.
We do not have any use cases for this policy to be implemented on Linux at this time and so we will avoid adding a feature that isn’t applicable to that platform.

The main purpose of the BestEffort policy on Windows will be to ensure that at the time of pod start up there is enough Memory on a given NUMA node to meet the memory requests of the pod. The intent here is to make sure if CPU’s are selected that there is enough memory to also support the request to avoid cross CPU/NUMA node processing. On Windows, even though we cannot guarantee NUMA node selection, the Windows Schedule will do the right thing in most cases. By using Kubelet’s existing Memory Mapping strategy we can ensure NUMA nodes have enough memory at the time of scheduling. It is important to note that this does not mean that it is guaranteed (hence the policy name change)

Since Windows does not have an API to directly assign NUMA nodes, the kubelet will query the OS to get the affinity masks associated with each of the Numa nodes selected by the memory manager and update the CPU Group affinity accordingly in the CRI field. This will result in the memory from the Numa node being used. There are a couple scenarios that need to be considered:

Memory manager is enabled, cpu manager is not: kubelet will look up all the cpu’s associated with the selected Numa nodes and assign the CPU Group affinity. For example if NumaNode 0 is selected by memory manager, and NumaNode 0 has the first four CPU’s in Windows CPU group 0 the result would be cpu affinity: 0000001111, group 0.
Memory manager is enabled, CPU manager is enabled
- cpu manager selects fewer CPU’s than Numa nodes and CPU’s fall with in Numa node: Kubelet will only set only the CPU’s selected by the cpu-manager as the memory from the memory manager will be used by default.
- cpu manager selects more CPU’s than Numa nodes and CPU’s fall within/or outside Numa node: kubelet will set selected only CPU’s from cpu-manager
- cpu manager selects fewer CPU’s than the CPU’s associated with the Numa nodes selected by the memory manager: Kubelet would set the CPU’s by cpu-manager plus all the CPU’s associated with the Numa node.

Using Memory manager’s internal mapping this should provide the desired behavior in most cases. Since Memory affinity isn’t guaranteed, It is possible that a CPU could access memory from a different Numa Node than it is currently in, resulting in decreased performance. For this reason, we will add documentation, a log warning message in kubelet, and an warning event to help raise awareness of this possibility. If access from the CPUs different than the assigned Numa Node is undesirable then single-numa-node and the CPU manager should be configured in the Topology Manager policy setting which would force Kubelet to only select a Numa node if it will have enough memory and CPU’s available. In the future, in the case of workloads that span multiple Numa nodes, it may be desirable for Topology manager to have a new policy specific for Windows. This would require a separate KEP to add a new policy.

Kubelet memory management

Windows support for kubelet’s memory eviction was enabled in 1.31 and would follow the same patterns as Mechanism I . Windows does not have an OOM killer and so Mechanisms II and III are out of scope in the section related to the Kubernetes Node Memory Management .

Windows Topology manager considerations

Topology manager is already enabled on Windows in order to support the device manager. Enabling the CPU and Memory manager as hint providers will be behind a feature flag. The CPU manager and Memory Manager can independently be enabled or disabled to support cases where the features needs to be shut off.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

The testing plan is to enable basic tests in Windows testing folder in Alpha. This will enable us to progress to a state we in Alpha that will allow our end users to test and give feedback in real world scenarios.

We we also work to enable e2e_node test suite to run on Windows and enable the applicable CPU /Memory /Topology Manager tests for Beta. The goal will be to enable as many of those tests as possible while recognizing some may not be applicable to Windows. Where we find gaps we will fill them with Windows specific tests.

Prerequisite testing updates

Unit tests

pkg/kubelet/cm/container_manager_windows.go
pkg/kubelet/cm/internal_container_lifecycle_windows.go
pkg/kubelet/winstats/cpu_topology_test.go

Integration tests

Integration tests do not run on Windows. Functionality will be covered by unit and e2e tests.

e2e tests

e2e_node will need to be enabled for Windows to add coverage. We plan to enable just e2e tests that relate to memory/cpu/topology manager, not the full suite.

Graduation Criteria

Alpha

Feature implemented behind a feature flag
Initial basic e2e tests in Windows e2e suite are added
unit tests for Windows specific components are added

Beta

Gather feedback from developers
e2e_node tests are in Testgrid and linked in KEP

GA

2 examples of real-world usage
Allowing time for feedback

Note: Generally we also wait at least two releases between beta and GA/stable, because there’s no opportunity for user feedback, or even bug reports, in back-to-back releases.

For non-optional features moving to GA, the graduation criteria must include conformance tests .

Deprecation

N/A

Upgrade / Downgrade Strategy

There is no interaction with out Kubernetes components and upgrade / downgrade strategy is the same as the existing CPU/Memory/Topology manager. https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1769-memory-manager#upgrade--downgrade-strategy https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/693-topology-manager#upgrade--downgrade-strategy https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3570-cpumanager#upgrade--downgrade-strategy

Version Skew Strategy

This feature requires updated to CRI-API (see above) and containerd in order to set CPU affinity when running Windows containers. If the kubelet requests CPU affinity for a container and the container runtime does not support it, the container will be started without CPU affinity. This follows the same behavior as other kubelet enhancements that require container runtime support. Cluster operators that wish to use this feature are responsible to ensuring they have a container runtime that respects the CPU affinity settings since the kubelet doesn’t perform minimum version checks for the container runtime or query the container runtime for its capabilities. Once the required functionality has been implemented in contaienrd this KEP will be updated with a minimum version required for support of this feature.

Production Readiness Review Questionnaire

This KEP discusses the changes required to enable for the various managers for Windows. This means many of the PRR questions for these features have already been covered and implemented as part of those KEPs. We try to give details relevant to Windows but do not plan to change any of the details of the features enablement in the KEP unless it is required because of a difference in Windows.

https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1769-memory-manager#production-readiness-review-questionnaire https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/693-topology-manager#production-readiness-review-questionnaire https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3570-cpumanager#production-readiness-review-questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: WindowsCPUAndMemoryAffinity
- Components depending on the feature gate: Kubelet
- Will enabling / disabling the feature require downtime of the control plane? No
- Will enabling / disabling the feature require downtime or reprovisioning of a node? This is behavior is is the same as the features is implemented today in existing KEPs:
  https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3570-cpumanager#troubleshooting https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1769-memory-manager#feature-enablement-and-rollback
  Yes it uses a feature gate. Memory and CPU managers have a state file that requires cleanup. After changing the CPU manager policy from none to static or the the other way around, before to start the kubelet again, you must remove the CPU manager state file(/var/lib/kubelet/cpu_manager_state), otherwise the kubelet start will fail. Startup failures for this reason will be logged in the kubelet log.
  Details for the steps to reset a state file are in https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#changing-the-cpu-manager-policy . Memory manager has the same steps for resetting.

Does enabling the feature change any default behavior?

No, Additional settings are required to enable the features. The default policies for CPU/Memory manager will be None, meaning that they will not interact with running of pods. The Cluster administrator will need to set specific CPU/Memory/Topology manager policies to enable any features described here.

See feature details in:

https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3570-cpumanager#feature-enablement-and-rollback https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1769-memory-manager#feature-enablement-and-rollback https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/693-topology-manager#feature-enablement-and-rollback

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. A rolling restart (delete or delete and redeploy) of the pods will be required to remove the CPU/Memory affinity from running pods. Restarting kubelet after changing the feature will not affect any running pods but new pods created will be affected by the changes.

What happens if we reenable the feature if it was previously rolled back?

The Memory Manager and CPU managers utilize a state file to track assignments. If State file is not valid, it must be removed and kubelet restarted. E.g., State file might become invalid when kube/system reserved have changed (increased), which may lead to a situation when some containers cannot be started.

Are there any tests for feature enablement/disablement?

Yes, there is a number of Unit Tests designated for State file validation.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

Impact is node local, and doesn’t affect rest of the cluster.

It is possible that the state file from the memory/cpu manager will have inconsistent data during the rollout, because of the kubelet restart, but you can easily to fix it by removing memory manager state file and run kubelet restart. It should not affect any running workloads.

What specific metrics should inform a rollback?

The pod may fail with the admission error because the kubelet can not provide all resources. You can see the error messages under the pod events.

There are existing metrics provided by Managers that can be monitored:

// Metrics to track the CPU manager behavior
CPUManagerPinningRequestsTotalKey         = "cpu_manager_pinning_requests_total"
CPUManagerPinningErrorsTotalKey           = "cpu_manager_pinning_errors_total"
CPUManagerSharedPoolSizeMilliCoresKey     = "cpu_manager_shared_pool_size_millicores"
CPUManagerExclusiveCPUsAllocationCountKey = "cpu_manager_exclusive_cpu_allocation_count"

// Metrics to track the Memory manager behavior
MemoryManagerPinningRequestsTotalKey = "memory_manager_pinning_requests_total"
MemoryManagerPinningErrorsTotalKey   = "memory_manager_pinning_errors_total"

// Metrics to track the Topology manager behavior
TopologyManagerAdmissionRequestsTotalKey = "topology_manager_admission_requests_total"
TopologyManagerAdmissionErrorsTotalKey   = "topology_manager_admission_errors_total"
TopologyManagerAdmissionDurationKey      = "topology_manager_admission_duration_ms"

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

We will use the existing Metrics provided by CPU/Memory Manager.

https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3570-cpumanager#monitoring-requirements https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1769-memory-manager#monitoring-requirements

How can an operator determine if the feature is in use by workloads?

The memory/cpu manager will be under the pod resources API. And there are proposed metrics to improve this in kubernetes/kubernetes#127155

How can someone using this feature know that it is working for their instance?

Other (treat as last resort)
- Details: check the kubelet metric cpu_manager_pinning_requests_total
- check the kubelet metric memory_manager_pinning_requests_total

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

n/a

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

These will be the same as cpu/memory/topology manager.

Are there any missing metrics that would be useful to have to improve observability of this feature?

Since the CPU/Memory/Topology manager are already implemented most of the metrics are implemented. If we find missing metrics on Windows we will address as we move to Beta/Stable.

Dependencies

Does this feature depend on any specific services running in the cluster?

This will require changes to CRI and containerd Windows agents.

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

We will monitor for cpu consumption to query the CPU topology. If required we may wish to implement a caching strategy while also supporting any new support for dynamic node resizing.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Memory and CPU’s could be exhausted resulting in Pods not being scheduled.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

N/a

What are other known failure modes?

The failure modes for pods on the node are the same as in CPU/Memory/topology Manager

KEP-4885: Windows CPU and Memory Affinity

KEP-4885: Windows CPU and Memory Affinity

Release Signoff Checklist

Summary

Motivation

Goals

Non-Goals

Proposal

User Stories (Optional)

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Design Details

Windows CPU Discovery

Windows Memory considerations

Kubelet memory management

Windows Topology manager considerations

Test Plan

Prerequisite testing updates

Unit tests

Integration tests

e2e tests

Graduation Criteria

Alpha

Beta

GA

Deprecation

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

Infrastructure Needed (Optional)