KEP-6012: CompositePodGroup API
KEP-6012: CompositePodGroup API
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- API overview
- Changes to the
WorkloadAPI - Changes to the
PodGroupAPI CompositePodGroupAPI- API consumption model
- API validation
- Changes in kube-scheduler
- Test Plan
- Graduation Criteria
- Upgrade / Downgrade Strategy
- Version Skew Strategy
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests within one minor version of promotion to GA
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
This KEP describes the evolution in the workload-aware scheduling architecture that is necessary to support more complex, hierarchical scheduling requirements of modern high-performance distributed workloads. We focus on the API, framework and the basic building blocks - performance optimizations of the underlying algorithms can come as follow-ups.
To achieve this, the KEP builds on the Workload and PodGroup APIs from
KEP-4671
and introduces a new core API called CompositePodGroup. This API
allows expressing multi-level topology constraints, gang scheduling and
preemption policies for heterogeneous groups of Pods and facilitates extending
the Kubernetes scheduler with more policies in the future.
Motivation
Kubernetes 1.36 has made great strides in direction of evolving the process of
scheduling from a Pod-centric approach towards a workload-centric one. Thanks to
these efforts, we are now able to provide simple forms of gang scheduling and
gang preemption policies using the PodGroup API introduced in KEP-4671
. This
release also added support for single-level topology-aware scheduling using the
Node labels-based topology constraints baked into the PodGroup and Workload
APIs (KEP-5732
). These features already cover the use cases of simple batch
workloads that are characterized by a flat structure. KEP-5547
is an example
of a successful integration of the new APIs with the Job controller for a fully
parallel static indexed Job.
Many modern distributed workloads (especially AI ones) demand scheduling capabilities that cannot be expressed using today’s flat APIs. The primary gap is the ability to model complex, heterogeneous workloads composed of distinct groups with multi-level dependencies.
Multi-level topology-aware scheduling (TAS) is a prominent example. In hardware architectures like TPU slices, a multi-level topology layout is critical to reflecting hardware layouts and obtaining desired performance. Conversely, workloads like disaggregated serving (prefill and decode) rely on single-level network domains but require a multi-level structure to enforce complex lifecycle dependencies (e.g. requiring at least $N$ Prefill and $M$ Decode groups).
These workloads often require multi-level gang scheduling. In this model, a parent group dictates that it cannot be scheduled until a specified minimum number of its child groups are schedulable. Essentially, this extends traditional gang scheduling by treating entire child groups, rather than individual Pods, as members of a gang.
In addition, a multi-level workload might tolerate partial disruptions. There should be a way for such workload to express different disruption policies for different portions of that particular workload.
All of these gaps come from the fact that current scheduling APIs do not allow
expressing any kind of multi-level hierarchy that many out-of-tree Kubernetes
APIs are often characterized with. JobSet1 and LeaderWorkerSet2 are
probably the most popular instances of a higher-order API where such hierarchy
exists which often bring about matching scheduling requirements like the ones
mentioned above. To close these gaps, we need to extend the foundational
scheduling APIs in a way that the true workload controllers can express their
multi-level scheduling requirements that kube-scheduler could understand and act
upon accordingly.
Goals
- Define a new API that facilitates describing the hierarchy of a workload.
- Extend scheduling capabilities to support hierarchical scheduling
requirements, including:
- Multi-level gang scheduling.
- Multi-level preemption policies
- Multi-level topology scheduling constraints.
- Ensure future extensibility of the API with new scheduling and disruption policies.
Non-Goals
- Extend topology-aware scheduling with the notion of preferred constraints.
- Define the way how to express multi-level scheduling requirements in true
workload APIs.
- This will be addressed in a KEP-6089 .
- Add support for associating
ResourceClaimswith instances of the new API.- We will continue supporting sharing
ResourceClaimsamong Pods within an individualPodGroup, however.
- We will continue supporting sharing
- Guarantee an optimal result of multi-level scheduling algorithms.
- Bin packing is inherently an NP-hard problem and it becomes even more complex for multi-level structures. While we aim to design efficient heuristics, guaranteeing an optimal placement is out of scope.
Proposal
The proposal introduces a design of a new API called CompositePodGroup and
describes what hierarchical scheduling requirements this API solves and in what
way. The design outlines the API shape, its lifecycle and validation and the way
how true workload controllers can integrate with it. We also discuss adjustments
inside the kube-scheduler that are needed to support scheduling requirements
that can be expressed through this API.
This proposal builds and depends heavily on the enhancements that have been recently introduced in the workload-aware scheduling space. We assume that the reader is already acquainted with the following KEPs:
- KEP-4671: Gang Scheduling using Workload Object
- KEP-5710: Workload-aware preemption
- KEP-5732: Topology-aware workload scheduling
Rather than revolutionize the core concepts that these KEPs introduced, the proposal generalizes them and leaves the door open for further extensions.
Backward compatibility
The proposal adjusts the structure of the Workload and PodGroup APIs so that
they can be conveniently used in conjunction with the CompositePodGroup API.
That said, for flat homogeneous workloads there is no need to use the
CompositePodGroup API. True workload controllers can continue using the
PodGroup and Workload APIs exclusively in similar way they used to in the
past - this consumption pattern will continue to be supported.
User Stories
AI training on TPUs
As an AI researcher running AI training jobs on newer generation TPUs, I want to schedule a distributed training job such that individual shards run within specific 4x4x4 cubes, while the entire workload is guaranteed to live within a single superslice (e.g., 8x8x16). This allows me to leverage the specific hierarchical network topology of TPU clusters for optimal training performance.
Disaggregated serving under LeaderWorkerSet
As a machine learning engineer deploying disaggregated serving (prefill and
decode stages) under LeaderWorkerSet, I want to express complex dependencies across
heterogeneous worker groups. Both stages require single-level high-bandwidth topology
co-location, but rely on a hierarchy to enforce holistic workload lifecycle
policies: requiring at least $N$ Prefill and $M$ Decode active groups to serve, and
ensuring that non-topological components (like frontend pods) share the preemption
fate of the core execution engines.
Replicated training jobs under JobSet
As an infrastructure operator running complex training pipelines, I want to schedule
a multi-stage TrainJob under JobSet containing replicated sub-jobs with varied
scheduling requirements. For instance, the pre-training data and model initialization
stage can use a basic scheduling policy (starting as soon as some data-downloaders are
ready), while the subsequent core Trainer stage (MPI Launcher and workers) requires
strict gang scheduling. Both stages belong to the same parent CPG to coordinate
coordinated start and collective preemption fate-sharing.
Notes/Constraints/Caveats (Optional)
To ensure cluster stability, control-plane reliability, and to prevent excessive scheduling complexity under nested hierarchies, we introduce explicit structural limits on the workload group-template hierarchy:
- Maximum Nesting Depth: The group-template hierarchy supports a maximum depth of 4 levels.
- List Cappings: The new
CompositePodGroupTemplateslist is strictly capped at 8 items (aligning with the pre-existing cap on thePodGroupTemplateslist).
These constraints are introduced upfront starting from the Alpha phase for strategic API safety. While analyzing current and planned distributed use cases suggests that a depth of 4 levels and a branching factor of 8 are more than sufficient, these limits can be easily increased in future releases if new requirements emerge.
Conversely, shrinking a limit or introducing one retroactively is a breaking API change that can severely disrupt existing workloads. By establishing conservative limits from the very beginning, we safeguard the API and scheduler performance while preserving the flexibility to safely scale up limits in future iterations based on real-world profiling.
Risks and Mitigations
Suboptimal Placement Decisions due to NP-Hardness of Multi-level Scheduling
While the greedy scheduling heuristic of kube-scheduler already introduces suboptimal
placements for single-level gangs and topology constraints, these inefficiencies can be
significantly amplified when scheduling the much larger, hierarchical workload trees enabled
by the CompositePodGroup API.
Mitigation: This is a fundamental limitation of solving an NP-complete problem within a heuristic-based scheduling loop. In Beta and future releases, we will utilize real-world user feedback to incrementally refine and locally optimize scheduling heuristics for specifically reported use cases.
Consistency and Validity Across Decoupled Hierarchy Objects
Because the scheduling hierarchy is represented using separate, decoupled runtime objects
(CompositePodGroup and PodGroup), there is a risk of declaring conflicting, malformed,
or cyclic configurations (such as cyclic parent references, excessive nesting depth, or
diverging priorities) that cannot be reliably prevented by API admission.
Mitigation: The static template definition within Workload will enforce unique names and
a depth limit of 4 levels at admission time. In the runtime group hierarchy,
kube-scheduler will detect invalid states (such as cycles, excessive depth, or priority
divergence) during the scheduling cycle, immediately mark the affected groups as invalid via
status Conditions, and skip scheduling their constituent Pods to maintain cluster stability
and raise operator visibility.
API Coverage and Extensibility Gaps
The newly introduced CompositePodGroup API might fail to cover the scheduling needs of
complex, fast-evolving AI and distributed workload classes (such as disaggregated serving,
complex leader-worker arrangements, or novel hardware topologies).
Mitigation: We are mitigating this by performing extensive upfront research on key
state-of-the-art use cases, specifically including JobSet (for bulk training) and
LeaderWorkerSet (for serving/inference). The CompositePodGroup API is designed using the
composite pattern, ensuring that it is open for future extensions with new scheduling and
disruption policies without requiring API schema redesigns.
Design Details
API overview
We introduce the CompositePodGroup API as the main building block for
representing multi-level, hierarchical workloads. As the naming suggests, this
API acts as a composition of one-or-more PodGroup and CompositePodGroup
objects. In other words, hierarchical workloads can be now expressed as a tree
of groups where CompositePodGroup objects correspond to non-leaf nodes and
PodGroup objects correspond to leaf nodes. To maintain the tree structure,
groups will have an optional reference to the parent group which will be empty
for the root group. It is worth noting that in this model, only a
CompositePodGroup can be a parent to other groups.
Every CompositePodGroup object defines scheduling policies and constraints
that apply to the workload portion enclosed in the subtree that has this
CompositePodGroup object as its root. We will discuss precise meaning of those
policies and constraints in the following subsections.
Workload API, which continues to represent the static policy configuration of
a true workload, starts to contain the definition of templates for the
CompositePodGroup objects, similar to how it already did so for the PodGroup
objects. To clearly reflect the hierarchical nature of a workload, templates
themselves are evolved into a tree-like structure.
For illustration, here is a diagram depicting a sample three-level group
hierarchy consisting of CompositePodGroup and PodGroup objects with the
references to the templates within the matching Workload object:
flowchart TD
subgraph Instances ["<b>Runtime groups</b>"]
RootCPG["CompositePodGroup:<br/>job-root"]
subgraph Branch1 [" "]
ChildCPG1["CompositePodGroup: replica-0"]
PG1["PodGroup: workers-0"]
PG2["PodGroup: driver-0"]
ChildCPG1 <--> PG1
ChildCPG1 <--> PG2
end
subgraph Branch2 [" "]
ChildCPG2["CompositePodGroup: replica-1"]
PG3["PodGroup: workers-1"]
PG4["PodGroup: driver-1"]
ChildCPG2 <--> PG3
ChildCPG2 <--> PG4
end
RootCPG <--> ChildCPG1
RootCPG <--> ChildCPG2
end
subgraph Templates ["<b>Workload templates</b>"]
RootCPGT["CompositePodGroupTemplate: Root"]
ChildCPGT["CompositePodGroupTemplate: Replica"]
PGT1["PodGroupTemplate:</br>Workers"]
PGT2["PodGroupTemplate:</br>Driver"]
RootCPGT --> ChildCPGT
ChildCPGT --> PGT1
ChildCPGT --> PGT2
end
RootCPG -. "WorkloadRef" .-> RootCPGT
ChildCPG1 -. "WorkloadRef" .-> ChildCPGT
ChildCPG2 -. "WorkloadRef" .-> ChildCPGT
PG1 -. "WorkloadRef" .-> PGT1
PG2 -. "WorkloadRef" .-> PGT2
PG3 -. "WorkloadRef" .-> PGT1
PG4 -. "WorkloadRef" .-> PGT2
classDef composite stroke-width:2px;
classDef podgroup stroke-width:1px;
classDef template stroke-width:1px,stroke-dasharray: 5 5;
classDef hiddenBranch fill:none,stroke:none;
class RootCPG,ChildCPG1,ChildCPG2 composite;
class PG1,PG2,PG3,PG4 podgroup;
class RootCPGT,ChildCPGT,PGT1,PGT2 template;
class Branch1,Branch2 hiddenBranch;Changes to the Workload API
Workload spec gets extended with a field called CompositePodGroupTemplates.
This field contains definitions of templates for the top-level
CompositePodGroup objects. In addition, this field is a union member field
together with the PodGroupTemplates field. This will allow the Workload API
to be continued to be used to represent the scheduling requirements using just
the PodGroupTemplates field.
// WorkloadSpec defines the desired state of a Workload.
type WorkloadSpec struct {
// ... existing fields ...
// CompositePodGroupTemplates is the list of CompositePodGroup templates that make up the Workload.
// The maximum number of templates is 8. This field is immutable.
// Exactly one of CompositePodGroupTemplates and PodGroupTemplates must be set.
//
// This field is used only when the CompositePodGroup feature gate is enabled.
//
// +featureGate=CompositePodGroup
// +optional
// +listType=map
// +listMapKey=name
// +k8s:ifDisabled("CompositePodGroup")=+k8s:forbidden
// +k8s:ifEnabled("CompositePodGroup")=+k8s:optional
// +k8s:ifEnabled("CompositePodGroup")=+k8s:unionMember
// +k8s:ifEnabled("CompositePodGroup")=+k8s:listType=map
// +k8s:ifEnabled("CompositePodGroup")=+k8s:listMapKey=name
// +k8s:ifEnabled("CompositePodGroup")=+k8s:maxItems=8
// +k8s:ifEnabled("CompositePodGroup")=+k8s:immutable
CompositePodGroupTemplates []CompositePodGroupTemplate
}
Similarly to PodGroupTemplate, the CompositePodGroupTemplate data structure
contains all the information necessary to construct a corresponding
CompositePodGroup object. In addition, CompositePodGroupTemplate contains
template definitions for the children groups - which can be either
CompositePodGroup or PodGroup objects:
// CompositePodGroupTemplate represents a template for a CompositePodGroup with a scheduling policy.
type CompositePodGroupTemplate struct {
// Name is a unique identifier for the CompositePodGroupTemplate within the Workload.
// It must be a DNS label. This field is required.
// This field is immutable.
//
// +required
// +k8s:required
// +k8s:format=k8s-short-name
Name string
// ...
// ... scheduling policy, disruption and constraints-related fields ...
// ...
// CompositePodGroupTemplates is the list of templates for children CompositePodGroups.
// The maximum number of templates is 8. This field is immutable.
//
// +optional
// +listType=map
// +listMapKey=name
// +k8s:optional
// +k8s:listType=map
// +k8s:listMapKey=name
// +k8s:maxItems=8
// +k8s:immutable
CompositePodGroupTemplates []CompositePodGroupTemplate
// PodGroupTemplates is the list of templates for children PodGroups.
// The maximum number of templates is 8. This field is immutable.
//
// +optional
// +listType=map
// +listMapKey=name
// +k8s:optional
// +k8s:listType=map
// +k8s:listMapKey=name
// +k8s:maxItems=8
// +k8s:immutable
PodGroupTemplates []PodGroupTemplate
}
Policy- and constraints-related fields were omitted from the template definition
for brevity and clarity - we will discuss them in detail in the deep dive
section about the CompositePodGroup below. These fields have matching
structure and semantics as the fields in CompositePodGroupTemplate and their
values are supposed to be copied from the template on the CompositePodGroup
creation.
Changes to the PodGroup API
There are two changes to the PodGroup spec:
PodGroupTemplateReffield gets replaced with an optionalWorkloadRefthat contains a reference to theWorkloadtogether with a name of a template within thatWorkloadobject.- New field called
ParentCompositePodGroupNameis added which denotes a name of an optional parentCompositePodGroupobject.
// PodGroupSpec defines the desired state of a PodGroup.
type PodGroupSpec struct {
// ... existing fields ...
// WorkloadRef references an optional PodGroup template within the Workload
// object that was used to create the PodGroup.
// This field is immutable.
//
// +optional
// +k8s:optional
// +k8s:immutable
WorkloadRef *WorkloadReference `json:"workloadRef"`
// ParentCompositePodGroupName contains the name of the parent composite pod group
// within the same namespace as this pod group.
// If it's nil, then this pod group is a root of a workload's hierarchy.
// This field is used only when the CompositePodGroup feature gate is enabled.
// This field is immutable.
//
// +featureGate=CompositePodGroup
// +optional
// +k8s:ifDisabled(CompositePodGroup)=+k8s:forbidden
// +k8s:ifEnabled(CompositePodGroup)=+k8s:optional
// +k8s:ifEnabled(CompositePodGroup)=+k8s:immutable
// +k8s:ifEnabled(CompositePodGroup)=+k8s:format=k8s-long-name
// +k8s:ifEnabled(CompositePodGroup)=+k8s:dependentRequired("workloadRef")
ParentCompositePodGroupName *string `json:"parentCompositePodGroupName"`
}
WorkloadReference
WorkloadReference contains information about the referred Workload and the
reference to the template definition embedded in that Workload object that was
used to create that particular PodGroup.
// WorkloadReference references the Workload object together with the template
// that was used to create a particular PodGroup or CompositePodGroup.
type WorkloadReference struct {
// WorkloadName is the name of the Workload object that contains a template
// that was used when creating a pod group or a composite pod group. It must
// be a DNS name.
// This field is immutable.
// This field is required.
//
// +required
// +k8s:required
// +k8s:immutable
// +k8s:format=k8s-long-name
WorkloadName string
// TemplateName is the name of a template within the Workload object that
// was used to create a pod group or a composite pod group. It must be a DNS label.
// This field is immutable.
// This field is required.
//
// +required
// +k8s:required
// +k8s:immutable
// +k8s:format=k8s-short-name
TemplateName string
}
Standalone PodGroup objects
In KEP-4671
, we introduced a notion of standalone PodGroups which are
PodGroup objects that can be created without a matching Workload object and
their workload reference is hence nil.
This proposal wants to preserve this possibility but limit the use of it to the
flat workloads exclusively. In other words, PodGroup objects with a non-nil
parent reference must have a workload reference.
CompositePodGroup API
This is the main API change in this proposal. CompositePodGroup is a new API
resource, hence we need to generate a client for it. In addition, this API
supports the status subresource that will be updated with the runtime status
information.
// +genclient
// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object
// +k8s:supportsSubresource="/status"
// CompositePodGroup represents a runtime instance of pod groups grouped together.
// CompositePodGroups are created by workload controllers (LWS, JobSet, etc...) from
// Workload.compositePodGroupTemplates.
// CompositePodGroup API enablement is toggled by the CompositePodGroup feature gate.
// FOR API REVIEW: Alternative Names: CompositeGroup
type CompositePodGroup struct {
metav1.TypeMeta
// Standard object's metadata.
// More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
//
// +optional
metav1.ObjectMeta
// Spec defines the desired state of the CompositePodGroup.
//
// +required
Spec CompositePodGroupSpec
// Status represents the current observed state of the CompositePodGroup.
//
// +optional
Status CompositePodGroupStatus
}
Spec
CompositePodGroup API spec will have a very similar structure to the spec of
the PodGroup API.
type CompositePodGroupSpec struct {
// ParentCompositePodGroupName contains the name of the parent composite pod group
// within the same namespace as this composite pod group. It must be a DNS name.
// If it's nil, then this composite pod group is a root of a workload's hierarchy.
// This field is used only when the CompositePodGroup feature gate is enabled.
// This field is immutable.
//
// +optional
// +k8s:optional
// +k8s:immutable
// +k8s:format=k8s-long-name
ParentCompositePodGroupName *string
// WorkloadRef references an optional CompositePodGroup template within the
// Workload object that was used to create the CompositePodGroup.
// This field is required.
// This field is immutable.
//
// +required
// +k8s:required
// +k8s:immutable
WorkloadRef *WorkloadReference
// SchedulingPolicy defines the scheduling policy for this instance of the CompositePodGroup.
// Controllers are expected to fill this field by copying it from a CompositePodGroupTemplate.
// This field is immutable.
//
// +required
// +k8s:required
// +k8s:immutable
SchedulingPolicy CompositePodGroupSchedulingPolicy
// SchedulingConstraints defines optional scheduling constraints (e.g. topology) for this
// CompositePodGroup.
// Controllers are expected to fill this field by copying it from a CompositePodGroupTemplate.
// This field is immutable.
// This field is only available when the TopologyAwareWorkloadScheduling feature gate is enabled.
//
// +featureGate=TopologyAwareWorkloadScheduling
// +optional
// +k8s:ifDisabled(TopologyAwareWorkloadScheduling)=+k8s:forbidden
// +k8s:ifEnabled(TopologyAwareWorkloadScheduling)=+k8s:optional
// +k8s:ifEnabled(TopologyAwareWorkloadScheduling)=+k8s:immutable
SchedulingConstraints *CompositePodGroupSchedulingConstraints
// DisruptionMode defines the mode in which a given CompositePodGroup can be disrupted.
// Controllers are expected to fill this field by copying it from a CompositePodGroupTemplate.
// One of Single, All. Defaults to Single if unset. This field is immutable.
//
// +optional
// +k8s:optional
// +k8s:immutable
// +default={"single": {}}
DisruptionMode *CompositeDisruptionMode
// PriorityClassName defines the priority that should be considered when scheduling this CompositePodGroup.
// Controllers are expected to fill this field by copying it from a CompositePodGroupTemplate.
// If left unspecified, it is validated and resolved similarly to the PriorityClassName field in Pods
// (i.e. if no priority class is specified, admission control can set this to the global default
// priority class if it exists. Otherwise, the composite pod group's priority will be zero).
// This field is immutable.
//
// +optional
// +k8s:optional
// +k8s:format=k8s-long-name
// +k8s:immutable
PriorityClassName string
// Priority is the value of priority of this composite pod group. Various system components
// use this field to find the priority of the composite pod group. When Priority Admission
// Controller is enabled, it prevents users from setting this field. The admission
// controller populates this field from PriorityClassName.
// The higher the value, the higher the priority.
// This field is immutable.
//
// +optional
// +k8s:optional
// +k8s:immutable
// +k8s:maximum=1000000000 # HighestUserDefinablePriority
Priority *int32
}
Workload reference
The WorkloadRef has semantics that matches the meaning of a corresponding
field in the PodGroup API - with an exception that it is supposed to refer to
a CompositePodGroupTemplate entry within the Workload object, not to a
PodGroupTemplate entry.
Another difference is that the WorkloadRef is required here. Contrary to the
PodGroup API, we do not support the notion of standalone groups in the
CompositePodGroup API.
Scheduling policy
Analogous to the scheduling policy defined at the PodGroup level for Pods, the
CompositePodGroupSchedulingPolicy specifies the policy for scheduling child groups
belonging to a CompositePodGroup. Specifically, this determines whether the nested child
groups are admitted and scheduled independently (Basic) or treated as an all-or-nothing
scheduling unit (Gang).
// CompositePodGroupSchedulingPolicy defines the scheduling configuration for a CompositePodGroup.
// Exactly one policy must be set.
// +union
type CompositePodGroupSchedulingPolicy struct {
// Basic specifies that the groups of this composite group should be scheduled independently.
//
// +optional
// +k8s:optional
// +k8s:unionMember
Basic *BasicGroupSchedulingPolicy
// Gang specifies that the groups of this composite group should be scheduled using
// all-or-nothing semantics.
//
// +optional
// +k8s:optional
// +k8s:unionMember
Gang *GangGroupSchedulingPolicy
}
// BasicGroupSchedulingPolicy indicates that the groups belonging to the composite group
// should be scheduled independently.
type BasicGroupSchedulingPolicy struct {
// This is intentionally empty. Its presence indicates that the basic
// scheduling policy should be applied. In the future, new fields may appear,
// describing such constraints on a composite pod group level without
// "all or nothing" (gang) scheduling.
}
// GangGroupSchedulingPolicy indicates that the groups belonging to the composite group
// should be scheduled using all-or-nothing semantics.
type GangGroupSchedulingPolicy struct {
// MinGroupCount is the minimum number of child groups that must be schedulable
// or scheduled at the same time for the scheduler to admit the entire group.
// It must be a positive integer.
//
// +optional
// +k8s:required
// +k8s:minimum=1
// +k8s:immutable
MinGroupCount int32
}
Scheduling constraints
Analogously to PodGroup, we can specify topology constraints that need to be
taken into account when scheduling a CompositePodGroup.
// CompositePodGroupSchedulingConstraints defines scheduling constraints (e.g. topology)
// for a CompositePodGroup.
type CompositePodGroupSchedulingConstraints struct {
// Topology defines the topology constraints for the composite pod group.
// Currently only a single topology constraint can be specified. This may change in the future.
//
// +optional
// +listType=atomic
// +k8s:optional
// +k8s:maxItems=1
// +k8s:listType=atomic
Topology []TopologyConstraint
}
Despite having a separate structure storing the constraints for the
CompositePodGroup API, we will reuse the TopologyConstraint struct that is
already used in the PodGroupSchedulingConstraints type.
When scheduler attempts to schedule a hierarchy of groups that specifies topological constraints on multiple levels, these constraints will be resolved in a top-down manner. This means that such constraints should be ordered from least constrictive ones to to the ones defining the smallest topology domains.
Disruption mode, priority class name and priority
The idea of disruption mode generalizes naturally to the CompositePodGroup
API:
// DisruptionMode defines how individual entities within a composite pod group can be disrupted.
// Exactly one mode must be set.
// +union
// FOR API REVIEW: Alternative Names: GroupDisruptionMode
type CompositeDisruptionMode struct {
// Single specifies that children can be disrupted independently from each other.
//
// +optional
// +k8s:optional
// +k8s:unionMember
Single *SingleCompositeDisruptionMode
// All specifies that all children can only be disrupted together.
//
// +optional
// +k8s:optional
// +k8s:unionMember
All *AllCompositeDisruptionMode
}
// SingleCompositeDisruptionMode means that individual children of a CompositePodGroup
// can be disrupted or preempted independently.
// FOR API REVIEW: Alternative Names: SingleGroupDisruptionMode
type SingleCompositeDisruptionMode struct {
// This is intentionally empty.
}
// AllCompositeDisruptionMode means that children of a CompositePodGroup can only be
// disrupted or preempted together.
// FOR API REVIEW: Alternative Names: AllGroupDisruptionMode
type AllCompositeDisruptionMode struct {
// This is intentionally empty.
}
The nesting of scheduling groups with potentially differing DisruptionModes at separate
levels of the hierarchy introduces support for complex disruption semantics.
However, not all hierarchical disruption configurations represent semantically clear runtime
states. For example, if a parent CompositePodGroup is configured with the All disruption
mode (requiring the entire subtree to be preempted or disrupted as a single atomic unit) but
contains child groups configured with the Single disruption mode (allowing their
individual elements to be preempted independently), the expected behavior is highly
ambiguous.
To ensure deterministic preemption and eviction behavior, the API will enforce the following structural restrictions on the Workload level API for the Alpha release:
- A
CompositePodGroupTemplateconfigured with theAlldisruption mode can only have children groups (nestedCompositePodGroupTemplatesor leafPodGroupTemplates) that are also configured with theAlldisruption mode. - A
CompositePodGroupTemplateconfigured with theSingledisruption mode can have children groups configured with either theSingleorAlldisruption modes.
Runtime structure validation and more complex configurations will be considered for Beta and future releases once concrete production use-cases and community feedback are established.
The Priority and the PriorityClassName fields are resolved in the exact same
way as they already are for Pods and PodGroups - specifically, the Priority
admission controller gets extended to additionally support the
CompositePodGroup API.
For the Alpha release, we enforce a strict single-priority constraint: all member groups and pods within a single group hierarchy tree must share the exact same priority and PriorityClassName. Support for differing group-level priorities under basic scheduling policies will be explored for the Beta release.
The value of the Priority field is being used in the following two contexts:
CompositePodGroupobjects without a parent reference are being put in the scheduling queue. Their priority is taken into account by the PrioritySort plugin when determining the importance of scheduling unit.- When running preemption to fit a
CompositePodGroupin the cluster, only preemption units (individualPodsorPodGroupsorCompositePodGroups) with a lower priority than the preemptor can be selected as prospective victims.
Status
Analogous to PodGroupStatus, CompositePodGroupStatus represents the observed state of a CompositePodGroup.
// CompositePodGroupStatus represents information about the status of a composite pod group.
type CompositePodGroupStatus struct {
// Conditions represent the latest observations of the CompositePodGroup's state.
//
// Known condition types:
// - "CompositePodGroupInitiallyScheduled": Indicates whether the overall scheduling requirement
// for the subtree under this CompositePodGroup has been satisfied. Once this condition
// transitions to True, it serves as a terminal state and will never revert to False,
// even if pods are subsequently deleted and group constraints are no longer met.
// - "DisruptionTarget": Indicates whether the CompositePodGroup is about to be terminated
// due to disruption such as preemption.
//
// Known reasons for the CompositePodGroupInitiallyScheduled condition:
// - "Unschedulable": The CompositePodGroup's subtree could not be placed due to resource constraints,
// affinity/anti-affinity, or topological constraints.
// - "SchedulerError": The CompositePodGroup cannot be scheduled due to some internal error
// that occurred during scheduling.
// - "Invalid": Set to True when kube-scheduler detects an invalid group layout during
// runtime validation. The `message` field details the specific layout violation (such as
// a detected cycle, exceeding the maximum depth of 4, or referencing multiple distinct Workloads).
//
// Known reasons for the DisruptionTarget condition:
// - "PreemptionByScheduler": The CompositePodGroup was targeted by the scheduler's preemption loop
// to free up capacity for higher-priority preemptors.
//
// +optional
// +patchMergeKey=type
// +patchStrategy=merge
// +listType=map
// +listMapKey=type
Conditions []metav1.Condition
}
API consumption model
The CompositePodGroup API is intended to be used in a similar way to how the
PodGroup API is supposed to be used according to KEP-4671
.
The following sequence of events describes the lifecycle and responsibilities of various actors in the cluster in a happy path:
- User creates a true workload (e.g.
JobSet), - Controller (e.g.
JobSetcontroller) creates the Workload object, - Controller creates all groups in the scheduling hierarchy, from root
(
CompositePodGroup) to leaves (PodGroups), - Workload’s Pods are getting created (by e.g. the Job controller),
- kube-scheduler tends to scheduling the Pods,
- User deletes the true workload,
- Pods are deleted by the GC controller in kube-controller-manager,
- Groups in the scheduling hierarchy are deleted by the GC controller, from leaves to the root.
Object ownership and garbage collection
Workload and PodGroup objects continue to be owned by true workloads. Same
approach is applied to the CompositePodGroup objects.
To ensure “bottom-up” garbage collection of the scheduling groups hierarchy, we
extend the idea introduced in KEP-4671
that leverages finalizers to
additionally take CompositePodGroups into account. Specifically:
- the
PodGroupProtectionadmission plugin adds a dedicated finalizer to newly createdCompositePodGroups, - the
PodGroupprotection controller removes that finalizer from aCompositePodGroupwhen it has a deletion timestamp and no child groups exist for thatCompositePodGroupanymore.
API validation
This section contains complicated validation that need to be executed when the new API is being used. Simple and obvious checks that can be easily covered today by the declarative validation are left out on purpose here since they are already embedded in the API snippets in paragraphs above.
Workload
Workload object now contains a hierarchy of templates that could have a large
depth. While some workloads might have convoluted hierarchy, we do not want to
allow arbitrarily large tree structures. We start with supporting the depth of
group template hierarchy of up to 4 levels. This should suffice for all use
cases that we are aware of today - if future proves otherwise, however, we could
revisit this limit and bump it up further.
Apart from that, we also need to validate uniqueness of template names within
the whole template hierarchy in a single Workload object - otherwise, template
references would be ambiguous.
To verify both of these conditions, we will add a new hand-written validation that
targets new Workload objects and performs both of these checks.
Group hierarchy
Because Workload API embeds the whole template hierarchy, we can statically
verify its depth in kube-apiserver. Unfortunately, we cannot perform analogous
checks for the group hierarchy in a way that completely eliminates race
conditions - due to the eventually consistent nature of Kubernetes, cross-object
validation can be performed only in a best-effort manner.
That said, a misbehaving controller might create a Workload object and a set
of group objects that form a hierarchy which is not reflected in that
Workload. In such case, the controller can create a group hierarchy that:
- Is deeper than allowed,
- Contains a cyclical parent reference relationship,
- References to more than a single
Workload.
Each of these should be treated as a failure mode since it is essentially a manifestation of the API misuse. Because of that we will make kube-scheduler responsible for discovering them in runtime. Specifically, if scheduler notices any of these modes, it will update the status of all the groups within the group hierarchy accordingly (i.e. deeming those groups invalid) and will not proceed to scheduling it at all.
Runtime validation in Beta
A complete implementation of runtime hierarchy validation — including verifying that the group
tree matches the definition in the corresponding Workload object — will be tackled in the
Beta release. While the high-level goals are established, the detailed design requires
careful consideration to guarantee consistent decisions and prevent race conditions.
The initial outline and key challenges to address for the Beta design include:
- Watching
Workloadobjects: To validate whether the runtime group tree matches the actualWorkload-defined tree, the scheduler will potentially need to list and watchWorkloadobjects. Their total number is strictly bounded by the number of active pods (and is typically much lower). Furthermore, becauseWorkloadobjects are generally static, their update churn is extremely low. Introducing this watch is therefore expected to have a negligible impact on the scheduler’s scalability. - Validation Loop in the Scheduling Queue: Validating group hierarchies against
Workloaddefinitions can be computationally heavier than simple single-pod validations. To ensure this does not affect the scheduler’s main loop performance, this validation will happen in the scheduling queue as a separate loop. It will target stalled or long-standing hierarchies (e.g., those residing in the queue for too long and failing to become schedulable) that have spent excessive time in the unschedulable cache. - Race Conditions & Atomic Transitions: A central challenge is the thread-safe
transition of group hierarchies. We must ensure that moving hierarchies between the
queue’s “unschedulable hierarchies” (represented in memory by
pendingPodGroups) and the active queue (activeQ) is done atomically under a single lock to prevent race conditions during concurrent updates and scheduling attempts.
Changes in kube-scheduler
Multi-level gang scheduling
Below we describe the high-level changes in kube-scheduler required to
support multi-level gang scheduling.
Prerequisites
To enable multi-level gang scheduling, we must generalize internal data structures, extend the core scheduling queue, and adapt plugin extension points:
Polymorphic
PodGroupInfoGeneralization: In the internal scheduler implementation, the existingPodGroupInfostruct (which represents a scheduling group in queue memory and cache) is generalized to polymorphically represent both leafPodGroupsandCompositePodGroups. This unified representation significantly reduces code and interface duplication, allowing scheduling plugins to process all hierarchy levels uniformly.Scheduling Queue Support for CPGs: The core scheduling queue is extended to natively support root
CompositePodGroups(CPGs without a parent reference) and standalonePodGroupsas the sole root scheduling units. To support this, theQueuedEntityInfowrapper struct (introduced in PR #138567 ) is generalized polymorphically to wrap either a standalone Pod, a standalonePodGroupInfo, or a nested parentCompositePodGrouphierarchy, allowing the queue to sort and pop them uniformly. To preserve this root-only queue property in the presence of unsynchronized object arrivals:- Unobserved groups are tracked in a dedicated
pendingPodGroupsstructure and only moved into the active scheduling queue when the root of their hierarchy (having no parent reference) is successfully observed and cached. - Member pods belonging to any nested child groups inside an unobserved
hierarchy are placed and held inside the queue’s
pendingPodGroupMemberscache. These pods are blocked from active scheduling passes and only promoted when the root of their hierarchy (a PG or CPG without a parent reference) is successfully observed and enqueued.
- Unobserved groups are tracked in a dedicated
PreEnqueueExtension Point: Currently, this extension point is defined strictly at the individualPodlevel. Under KEP-6012, this prerequisite remains unchanged:PreEnqueuewill operate strictly at the Pod level (where the Pod-level plugin check recursively resolves parent CPG tree admissibility for the member pod’s hierarchy).(Note: Alternatively, one could introduce a group-level PreEnqueue extension point. However, this would require adding support for group-level queuing hints in the scheduler queue backend to react only to relevant events. This adds substantial complexity, making it too risky to deliver in the v1.37 milestone. We therefore select the Pod-level PreEnqueue as the preferred choice for Alpha, and will explore group-level abstractions in Beta).
PlacementFeasibleExtension Point: Currently, this extension point exists at thePodGrouplevel (introduced in PR #138643 ). Under KEP-6012, we extend this extension point under our polymorphicPodGroupInforepresentation to support the validation of hierarchical constraints at theCompositePodGrouplevel.To support this cleanly, we refactor the return statuses of
PlacementFeasibleto be more semantically precise and aligned with scheduling framework conventions. This refactoring is highly beneficial for both flatPodGroupand hierarchicalCompositePodGroupworkloads. In the existing flat gang scheduling design, thePlacementFeasiblecheck merges preemptable and unpreemptable scheduling failures under a singleUnschedulablestatus. As a result, the scheduler cannot distinguish between a soft failure (e.g., a PodGroup or CompositePodGroup can be scheduled if we preempt other workloads in the cluster) and a hard failure (e.g., the group is mathematically impossible to schedule even if we preempt all other workloads). This leads to unnecessary resource simulation and costly preemption sweeps that are mathematically guaranteed to fail.By introducing the refactored status space, the scheduler immediately aborts the evaluation of a nested
PodGroupInfosubtree as soon as itsPlacementFeasiblecheck returnsUnschedulableorUnschedulableAndUnresolvable. The parent CPG then receives this status, which may or may not trigger a further cascading abort up the hierarchy stack, potentially terminating the entire active scheduling cycle early and saving significant CPU cycles.Additionally, these statuses explicitly dictate preemption behavior:
Success: Constraints are fully satisfied (simulated scheduled count $\ge MinCount$ or $MinGroupCount$).Wait: Currently unsatisfied, but possible to satisfy purely with free capacity as remaining members are simulated.Unschedulable: Unsatisfied with free capacity, but resolvable via preemption. This indicates that the group should be actively considered during the workload preemption phase.UnschedulableAndUnresolvable: Irreversibly unsatisfied; preemption cannot help. This indicates that the group should not be considered for preemption, allowing the scheduler to completely skip preemption evaluation.
For the exact algorithm determining how these statuses are returned and evaluated during recursive scheduling, see GangScheduling Plugin Changes .
PermitExtension Point: Currently, thePermitextension point is defined strictly at thePodlevel. Under KEP-6012, this remains unchanged for Alpha: we do not introduce any group-level Permit extension points at the framework level. Instead, we reuse the existing Pod-levelPermitextension point to implement hierarchical checks within theGangSchedulingplugin. We recognize that this Pod-level approach is suboptimal for nested hierarchies since it requires recursively validating the entire parent CPG tree structure for every individual member pod. Furthermore, following the introduction of thePlacementFeasiblecheck, it is no longer clear whether aPermit-stage verification is strictly necessary at all. During the Beta phase, we will re-evaluate this requirement; if aPermit-stage check remains necessary, we will explore introducing a dedicated, framework-level group/hierarchyPermitextension point (operating at thePodGrouporCompositePodGrouplevel) to optimize and deduplicate these validations.
GangScheduling Plugin Changes
PreEnqueue: Executed at the Pod-level during the enqueue stage for each member pod of a popped hierarchy unit. The plugin climbs parent references up to the root CPG ancestor, traversing the tree to recursively verify that the subtree contains the required minimum quantities:- For a leaf
PodGroup: Verifies if the group isadmissible(total pending or running member pods in the cluster $\ge$minCount). - For a
CompositePodGroup: Verifies that the number ofadmissiblechild groups in its subtree $\ge$minGroupCount. - If the admissibility check for the root CPG fails, the individual pod’s enqueuing is rejected, and it remains inside the scheduling queue.
- For a leaf
PlacementFeasible: Executed for eachPodGroupInfonode in the popped hierarchy tree as a part of thegroupRecursiveSchedulingDefaultAlgorithmroutine during in-memory simulation. Under this KEP, we extendPlacementFeasibleto support both flatPodGroupsand hierarchicalCompositePodGroupsusing a unified status evaluation model.To define the status transition logic uniformly for both
PodGroupandCompositePodGroup, we introduce the following variables evaluated during the in-memory scheduling iteration:M: The required minimum count. For a flatPodGroup, $M = $minCount. For aCompositePodGroup, $M = $minGroupCount.S: The count of child elements successfully scheduled in memory with theSuccessstatus.R: The count of remaining, untried child elements that are potentially admissible.U: The count of child elements that returned anUnschedulablestatus.UU: The count of child elements that returnedUnschedulableAndUnresolvable.
[!NOTE] Historically, pod-level
UnschedulableAndUnresolvablemay have been used to denote pods that cannot be scheduled even if other pods in the cluster are deleted (preempted). In the context of pod groups, preemption may also cause some pods to be assigned to different nodes, which may break plugin assumptions. Until that contract is explicitly defined, at the leaf level,Uwill be the count of pods that returnedUnschedulableorUnschedulableAndUnresolvable, andUUwill be 0.Using these variables, the
PlacementFeasiblestatus is resolved as follows:Success: $S \ge M$. The constraints are fully satisfied.Wait: $S < M$, but $S + R \ge M$. Currently unsatisfied, but satisfying the constraints purely with free capacity remains possible.Unschedulable: $S + R < M$, but $S + R + U \ge M$. The constraints cannot be satisfied purely with free capacity, but triggering preemption on behalf of theUnschedulablechild elements can resolve the constraints.UnschedulableAndUnresolvable: $S + R + U < M$. Even with maximum preemption of allUnschedulableelements, it is mathematically impossible to satisfy the minimum constraints because there are not enough child elements to schedule or too many child elements failed with theUnschedulableAndUnresolvablestatus.
This status logic is applied identically at all levels of the tree:
- For a leaf
PodGroup: The child elements are the individual member pods. The status of each pod is checked (whether it successfully placed, failed due to soft resource constraints, or failed due to hard selector/topology mismatches). - For a
CompositePodGroup: The child elements are its nested child groups, and their status is checked recursively using the returnedPlacementFeasiblevalues.
Permit: Executed at the Permit stage of the scheduling cycle strictly at the individualPodlevel. We extend the implementation of the existingPermitplugin to climb the parent group references and traverse the tree structure to verify constraints before releasing member pods for final binding:- If the Pod belongs to a standalone group (not part of a hierarchy):
We do not introduce any changes. The plugin relies on the pre-existing
behavior to hold member pods in a waiting state and release them once
the group’s
minCountmember pods are successfully scheduled. - If the Pod belongs to a group hierarchy tree (nested under a parent
CPG):
We override the flat group-level checks. When a member pod is evaluated,
the plugin climbs parent references up to the root CPG ancestor. It
traverses the entire hierarchy tree structure in the permit cache
(ensuring all nested child groups satisfy parent
minGroupCountand childminCountthresholds) starting at this root level, releasing the waiting member pods for final binding only when the entire tree’s constraints are satisfied.
- If the Pod belongs to a standalone group (not part of a hierarchy):
We do not introduce any changes. The plugin relies on the pre-existing
behavior to hold member pods in a waiting state and release them once
the group’s
EventsToRegister: Currently, the flatGangSchedulingplugin’sEventsToRegistermethod registers a subscription forPodGroupADD events to promote blocked units. To support multi-level hierarchies, we extend this method to additionally subscribe toCompositePodGroupADD events.Even though the scheduling specs (such as
minGroupCount) are immutable under Alpha to limit design complexity, subscribing toCompositePodGroupAdd events remains strictly required: as a controller dynamically creates and adds new nested childCompositePodGroupobjects to the API server, their arrival modifies the runtime hierarchy tree structure, satisfying the parent CPG’sminGroupCountthreshold and promoting blocked root CPGs from the unschedulable queue.
Recursive Scheduling Cycle Execution
In schedule_one_podgroup.go, the scheduler processes a popped root unit (a root
PodGroupInfo) by running groupRecursiveSchedulingDefaultAlgorithm, which is
the recursive version of podGroupSchedulingDefaultAlgorithm routine.
Throughout this recursive simulation phase, all pod-to-node assignments
are tracked strictly in memory in the nodeInfoSnapshot as temporary state
before final binding:
- If the active node is a leaf
PodGroup: Runs the same logic as in the standard, flat, single-levelpodGroupSchedulingDefaultAlgorithmroutine to simulate member pod placements in memory, but uses the refactoredPlacementFeasiblestatus interpretation logic. The scheduling loop reacts to theWait,Success,Unschedulable, andUnschedulableAndUnresolvablestatuses in an identical, analogical manner to the recursive child-group simulation described below forCompositePodGroups. - If the active node is a
CompositePodGroup: It iterates through its nested child groups in their pre-sorted order, executing the recursivegroupRecursiveSchedulingDefaultAlgorithmsequentially. After in-memory scheduling each child group, the scheduler invokes the extendedPlacementFeasiblechecker under the parent CPG’sPodGroupInfoto evaluate its overall state:Success: The CPG’s nested minimum constraints (minGroupCount) are met. Under the greedy Alpha phase, the scheduler continues simulating subsequent sibling groups to maximize cluster utilization.Wait: TheminGroupCountis not yet satisfied, but satisfying the constraint purely with free capacity remains possible. The scheduler continues processing.Unschedulable: The parent CPG constraints cannot be met with free capacity, but are resolvable via preemption. The scheduler immediately aborts its child-group evaluation loop, reverts all of its in-memory changes, and returnsUnschedulableup the stack.UnschedulableAndUnresolvable: The parent CPG is mathematically impossible to satisfy. The scheduler immediately aborts its child-group evaluation loop, reverts all of its in-memory changes, and returnsUnschedulableAndUnresolvableup the stack.
- Commit Bindings: If the root-level recursion resolves and returns
Success, the scheduler commits and writes the entire tree’s resolved pod bindings from memory to the API server.
[!NOTE] No-Backtracking: Sibling child groups under a
CompositePodGroupare simulated sequentially in their pre-sorted order without backtracking. If a child group placement (e.g.PG-1) consumes resources in a way that subsequently blocks its sibling (e.g.PG-2) from meeting itsminCountminimum requirement, which in turn prevents the parent CPG from satisfying itsminGroupCountthreshold, the scheduler does not retroactively evaluate alternative placements for the earlier child group.Enforcing a greedy recursive choice without backtracking prevents exponential scheduling complexity at the cost of sub-optimal decisions that may trigger preemption. While sufficient for the Alpha phase, these greedy scheduling algorithm trade-offs will be re-evaluated for the Beta release to explore bounded backtracking heuristics (such as restricted search depth or bounded branches) that optimize overall scheduling success rates.
In-memory simulation state revert across the recursion stack
In the existing flat gang scheduling implementation,
podGroupSchedulingDefaultAlgorithm is fully self-contained. When a member
pod is assumed, a revertFn is registered and executed via defer upon
function exit, restoring the nodeInfoSnapshot to its pre-execution state.
Under a nested CompositePodGroup hierarchy, deferred local reverts on
function exit would prematurely clear assumed pod allocations of a
successfully simulated child group (e.g. PG-1) before its sibling (e.g.
PG-2) is evaluated. Sibling groups would fail to see the consumed capacity
in the memory snapshot, leading to resource over-commitments and deadlocks.
To resolve this, the recursive algorithm does not defer execution of the
revert closures locally. Instead, as each child group runs its in-memory
simulation, the registered revertFn closures are returned and accumulated
([]revertFn) up the recursion stack to the root CPG. Upon exit from the
root-level groupRecursiveSchedulingDefaultAlgorithm execution pass, the
accumulated revert closures are always executed all-at-once, cleanly
restoring the shared nodeInfoSnapshot to its pre-execution state before the
separate, asynchronous binding cycle triggers. To preserve the cache’s
transactional integrity, these accumulated reverts must be executed in the
exact reverse order of their registration (matching how native deferred
execution operates), ensuring the last registered revert is called first to
cleanly roll back node allocations.
Scheduling sequence for PodGroups
To ensure a deterministic processing sequence, child groups under a
CompositePodGroup are sorted and cached inside the scheduling queue when the
group objects are added to queue memory. During the scheduling cycle, the
scheduler evaluates descendant child groups in this pre-sorted order.
For the Alpha release, we can start with something simple, like e.g. sorting
PodGroups / CompositePodGroups by their creation timestamp, and changing that
that logic in beta if necessary.
Preemption triggering rules
Preemption is strictly evaluated and executed only at the root level of the group hierarchy, preventing isolated and competing preemption passes at intermediate levels.
Consistent with flat gang scheduling (KEP-4671
), binding and preemption
never occur inside the same scheduling cycle. The preemption triggering
rules under this KEP are updated to align with the refactored PlacementFeasible
statuses:
- If the root-level
PlacementFeasiblereturnsSuccess: If the recursive in-memory simulation successfully satisfies the minimum scheduling requirements (at leastminGroupCountchild groups under a CPG tree, orminCountpods under a flatPodGroup), the scheduling cycle succeeds. The scheduler does not trigger preemption. Instead, it commits bindings for all successfully placed member pods (comprising the minimal gang and any extra pods that placed under Alpha’s greedy pass). Any remaining member pods that failed scheduling are returned to the scheduling queue, which subsequently places the entire hierarchy back into the queue for re-evaluation. - If the root-level
PlacementFeasiblereturnsUnschedulable: If the simulation fails to satisfy the minimum requirements purely with free capacity, but the failure is resolvable (i.e.Unschedulable), no pod bindings are committed. The scheduler triggers the workload preemption engine at the root level of the hierarchy to release cluster capacity for the failed member pods. - If the root-level
PlacementFeasiblereturnsUnschedulableAndUnresolvable: If the simulation reveals that the constraints are mathematically impossible to satisfy (even if maximum preemption is used, e.g. due to hard node selectors or lack of admissible member pods), the scheduler does not commit any bindings and does NOT trigger preemption. The scheduling cycle aborts early, saving wasteful CPU processing, and the hierarchy is sent back to the queue. - During subsequent cycles: In subsequent scheduling cycles, when the root CPG with pending extra or newly scaled member pods pops from the queue, the standard recursive scheduling algorithm is executed for the root CPG hierarchy. If the recursive algorithm fails to place any (for gang policy) or all (for basic policy) of the pending member pods, the scheduler triggers the preemption engine at the root CPG level to release capacity for the remaining unschedulable pods, even if some other pending pods were successfully placed in the current cycle.
In summary, workload preemption is triggered at the root level if and only if:
the root-level PlacementFeasible returns Unschedulable (i.e., the scheduling
policy is not satisfied but is resolvable via preemption), OR the scheduling
policy is satisfied, the hierarchy was already scheduled in a previous cycle,
and the scheduler failed to place some (for gang policy) or all (for basic policy)
pending member pods in the current cycle.
Inadmissible child groups
A root CompositePodGroup might successfully pass the PreEnqueue queue filter,
yet contain child groups that are currently inadmissible (e.g., they do not
have enough active member pods in the cluster queue to reach their minCount).
For example, consider a root CPG (minGroupCount=2) containing three nested child
groups: CPG-1, PG-2, and PG-3. Both PG-2 and PG-3 are fully admissible and
schedulable. However, CPG-1 has minGroupCount=2 and contains only one active child
group in the cluster: PG-11 (minCount=100) with 100 pending member pods.
Without a pre-simulation check, the scheduling algorithm would execute as follows:
- The scheduler starts simulating the root CPG’s children, starting with
CPG-1. - To simulate
CPG-1, the scheduler sequentially places all 100 member pods of its first child,PG-11, in memory. - After
PG-11finishes, the scheduler invokesPlacementFeasibleonCPG-1. SinceCPG-1only scheduled 1 child group ($S=1$) but requires $M=2$, and has no more remaining children ($R=0$),PlacementFeasiblereturnsUnschedulableAndUnresolvable. - The simulation of
CPG-1aborts, and the status is returned up to the root CPG. - The scheduler invokes
PlacementFeasibleon the root CPG. One child failed withUU=1, but two remain untried ($R=2$). Since $S+R \ge M$ ($0+2 \ge 2$), the root status isWait. The scheduler continues processing sibling groups. - The scheduler successfully simulates
PG-2andPG-3. The root CPG satisfies itsminGroupCountand the cycle succeeds, committing bindings forPG-2andPG-3. However, immense CPU cycles were completely wasted simulating the 100 pods ofPG-11.
Under the refactored PlacementFeasible status model, preemption triggering is
already handled correctly in these scenarios: the status evaluation will naturally
resolve to UnschedulableAndUnresolvable once it determines that the minimum
threshold cannot be satisfied, preventing futile preemption loops entirely.
However, as a performance optimization, we can skip evaluating the simulation of inadmissible branches entirely:
- Alpha: Pre-simulation feasibility is not executed in the Alpha phase.
The scheduler performs the sequential child group simulations, relying on the
post-evaluation
PlacementFeasiblecheck to trigger early aborts. - Beta: We will implement an optimized pre-simulation check. The scheduler
will invoke the
PlacementFeasiblechecker before starting the recursive in-memory simulation of child groups. If the pre-simulation check determines that a subtree is inadmissible (e.g., a nested child group is missing too many member pods to ever satisfy itsminCount), it immediately returnsUnschedulableAndUnresolvableearly, bypassing all child pod placements entirely and saving costly CPU cycles.
Resource stealing under greedy evaluation
When a CPG is evaluated, child groups are processed sequentially in their
pre-sorted order. Under the greedy evaluation, child groups try to
schedule as many member pods as possible (potentially exceeding their minCount
requirements).
This can lead to resource stealing in capacity-constrained clusters: an
early, greedy child group consumes all available slots, preventing a sibling
child group from reaching its minCount and causing the entire root CPG gang
to fail scheduling.
For the Alpha phase, we do not optimize or solve this resource stealing
challenge for complex, constrained layouts. Instead, we focus on ensuring that
the most common and typical use-cases work out-of-the-box: scenarios
where the minimum constraints equal the actual size (minCount = actualCount
and minGroupCount equals the total child group count). Under this standard
baseline, this problem doesn’t exist.
For the Beta release, we will evaluate two distinct alternatives to solve resource stealing for arbitrary, multi-level layouts:
- Double-run in a single cycle: The scheduler runs the recursive algorithm twice within a single scheduling cycle: a non-greedy pass first to place the minimal gang, followed by a greedy pass for extra pods. This approach is highly responsive; if active member pods are deleted and cause a scheduled CPG to fall below its minimum thresholds, a single active cycle can immediately recover the gang via a non-greedy pass. In terms of complexity, running two passes in a single cycle does not significantly degrade latency, since the CPU cost is strictly dominated by individual pod schedulings (each pod is still evaluated exactly once end-to-end in both models).
- Distinct scheduling cycles: The scheduler runs a single non-greedy pass in the
active cycle, commits the minimal gang, and lets extra pending member pods
trigger separate, subsequent scheduling cycles (running in greedy mode) to
place the remaining pods. This approach reduces scheduling latency under single
passes and mitigates Head-of-Line (HoL) queue blocking by allowing
higher-priority workloads to interleave and schedule in between the separate
greedy passes. A potential drawback is that cycle mode is not a one-off
transition from non-greedy to greedy. If active member pods are deleted (e.g. due to
node failures) and cause the CPG tree to drop below its
minGroupCountthreshold, the scheduling policy is no longer satisfied. The scheduler must then dynamically oscillate the cycle mode back to non-greedy to re-secure the minimal gang. Coordinating this behavior across separate, decoupled scheduling passes in the queue introduces complexity in the code.
Under topology-aware scheduling, a multi-pass approach also introduces a key trade-off: a non-greedy first pass may select and lock in an optimal topology placement for the minimal gang that subsequently prevents the second greedy pass from placing optional pods (whereas a single pass could identify a globally feasible topology for all pods). This trade-off is unavoidable and would occur even in single-pass models when optional pods are added later, representing a fundamental trade-off between securing placements for the minimal gang and finding the globally optimal topology configuration. We will evaluate these topological trade-offs in detail during the Beta phase.
Handling new pods for scheduled hierarchies
If a controller scales up a scheduled CompositePodGroup hierarchy by
creating new member pods, the queue backend places the root CPG back into the
scheduling queue. When this root CPG pops from the queue, the scheduler
executes the standard recursive scheduling algorithm starting at this root
level. Sibling child groups are evaluated in greedy mode, meaning that
successfully scheduled child groups are allowed to exceed their minCount
limits, accommodating the new member pods.
If the recursive algorithm succeeds in placing these new pods, they are bound to nodes. However, if the recursive algorithm fails to schedule them (due to saturated capacity):
- The root CPG scheduling policy remains satisfied (since the minimal gang was already successfully scheduled and remains active in the cluster).
- The individual member pods fail scheduling and trigger the preemption engine at the root CPG level.
Suboptimal scheduling decisions
As already mentioned, optimal multi-level scheduling is an NP-hard computational
problem underneath, so solving it at a large scale is infeasible.
kube-scheduler mitigates this through operating on a single Pod at a time
while making scheduling decisions for a whole group hierarchy. This is
essentially a heuristic that relaxes the requirement for global optimum in
exchange for drastically reduced computational complexity.
In particular, this implies that kube-scheduler might fail to find a placement
for a multi-level gang even despite the sufficiency of resources in the cluster.
This can be a side effect of suboptimal placement decisions that were made for
individual Pods, e.g. due to the sequence in which placement for individual Pods
of a gang was established. This problem already exists in case of scheduling a
heterogeneous PodGroup gang but it might manifest itself to a larger degree.
For Beta, we will decide whether or not we need additional heuristics to increase the chance of getting a group hierarchy scheduled. That said, regardless of what we eventually do, we will not overcome the problem’s NP-hardness with heuristics. Regardless of the ultimate path taken, we will comprehensively document these scheduling limitations to ensure users are fully aware of potential sub-optimal placement scenarios.
Integration with workload-aware preemption
If a root PodGroupInfo (representing a root CPG or standalone PG) is
unschedulable and triggers preemption (according to the conditions defined in
Preemption triggering rules
), the scheduler
performs the standard preemption steps to calculate victims and spawn
evictions. To support multi-level workloads under this unified preemption
framework:
- Group running pods into collective preemption victims: When evaluating
preemption costs, the preemption algorithm groups victim pods into collective
victim objects. For each victim pod candidate, the scheduler traverses up its
parent group hierarchy (following parent references under the
PodGroupInfocache) to resolve the highest ancestor configured with theAlldisruption mode. This highest ancestor CPG node defines the indivisible preemption unit. - Adapt PreEnqueue and plugin lifecycles: The
PreEnqueuemethod of theDefaultPreemptionplugin and internal queue backoff structures are extended to track pending rootPodGroupInfopreemptors while they wait in the queue for their calculated victim evictions to successfully delete and free up cluster capacity.
Multi-level topology-aware scheduling
In single-level topology-aware scheduling (KEP-5732
), the scheduler generates a flat list
of candidate placements for a leaf PodGroup, runs in-memory simulations across these
placements, and scores them using PlacementScorerPlugins to select the optimal placement.
For multi-level scheduling using CompositePodGroups (which do not own pods directly but
instead act as parents for nested CompositePodGroups or leaf PodGroups), the scheduler
resolves placements recursively down the hierarchy tree:
CompositePodGroup Scheduling Algorithm
The multi-level topology-aware scheduling (TAS) algorithm
(groupRecursiveSchedulingPlacementAlgorithm) is a direct extension of the recursive
groupRecursiveSchedulingDefaultAlgorithm execution cycle defined in
Multi-level gang scheduling
, augmented to
determine appropriate topology placements.
The algorithm recursively evaluates and identifies feasible placements (topology domains satisfying the CPG’s topology constraints), scoring and selecting the best resolved configuration:
- Placement Generation: The scheduler generates candidate topology domains
matching the CPG’s topology constraint. If a parent CPG has already been
assumed in a specific topology domain (e.g.,
net-block-A), candidate placement generation for its descendants is strictly restricted to domains located within that parent domain (e.g., racks withinnet-block-A). - Candidate Placement Evaluation & Filtering: For each candidate parent
domain, the scheduler:
- Temporarily assumes the candidate parent domain in the
nodeInfoSnapshotas the active scheduling context. - Recursive Child Group Resolution: Sequentially invokes the recursive
groupRecursiveSchedulingPlacementAlgorithmcycle for each child group, confining their placement candidates strictly to nodes within the assumed parent domain scope. Sibling child groups are evaluated in their pre-sorted order without sibling backtracking, taking into account the assumed pod assignments of already simulated siblings inside thenodeInfoSnapshot. These sibling assignments are automatically reverted when the parent domain assumption is reverted. - Group Constraint Verification: Invokes the extended
PlacementFeasiblechecker underPodGroupInfo. If it returnsSuccessafter processing all children, the parent placement is marked as feasible and stored in the list of feasible placements. - Reverts the temporary parent domain assumption in the
nodeInfoSnapshot.
- Temporarily assumes the candidate parent domain in the
- Best Placement Selection: The scheduler runs the registered
PodGroupInfoplacement scorer plugins, which are extended to supportCompositePodGroupsalongsidePodGroups, over the entire list of saved feasible placements and selects the one with the highest score. The method returns pod-to-node assignments from the best placement together with theSuccessstatus to the parent group.
At the root level, the scheduler commits and writes the entire tree’s resolved pod bindings to the API server. Analogous to multi-level gang scheduling, if direct scheduling fails for the root CPG, the scheduler invokes the preemption algorithm strictly at the root CPG level (if needed).
[!NOTE] Backtracking Search Space in TAS: Compared to flat capacity evaluations under gang scheduling, the topology placement search space under TAS is multidimensional and exponentially larger, as each tree level generates and evaluates multiple physical topology domains.
To manage the scheduling latency trade-off, the scheduling cycle avoids backtracking when evaluating child groups under a target parent candidate domain. At any single level in the tree, this reduces the search complexity for placing $C$ child groups, each with $D$ placement options, from exponential ($\mathcal{O}(D^C)$) to linear ($\mathcal{O}(C \cdot D)$).
While this greedy search trade-off helps prevent severe scheduling latency degradation in the Alpha phase, it increases placement failure rates in capacity-constrained environments. Bounded backtracking heuristics and their latency trade-offs will be thoroughly evaluated for Beta.
Example
Consider a workload consisting of a root CompositePodGroup (CPG-root)
configured with a gang scheduling policy (minGroupCount=2), containing two
child PodGroups (PG-1 and PG-2). Both child groups are gangs requiring a
minimum pod count (minCount=5) of homogeneous pods. CPG-root defines a
topology constraint of block (demanding that all groups land inside a single
net-block), while both PG-1 and PG-2 require a rack topology constraint.
The cluster physical topology is configured as follows:
block-Acontainsrack-A1(has 3 free slots) andrack-A2(has 5 free slots).block-Bcontainsrack-B1(has 5 free slots) andrack-B2(has 5 free slots).
The scheduling algorithm resolves this hierarchy recursively:
- CPG-root Evaluation:
- Generates block placements for
CPG-root:block-Aandblock-B. - Evaluate Candidate
block-A:- Temporarily assumes
block-AinnodeInfoSnapshot. - Resolve child
PG-1underblock-A:- Restricted to
block-A. Invokes standard flat PG scheduling cycle. PG-1generates rack candidate placements inblock-A:rack-A1andrack-A2.- The scheduler simulates
PG-1under candidates innodeInfoSnapshot:- Simulates
PG-1underrack-A1$\to$ fails (only 3 slots free). - Simulates
PG-1underrack-A2$\to$ succeeds (5 slots free).
- Simulates
- PG-level scorer plugins evaluate feasible candidates: only
rack-A2is feasible, and is selected. - The cycle returns
PG-1’s resolved 5 pod-to-node assignments underrack-A2along withSuccessstatus toCPG-root. CPG-roottemporarily reservesPG-1’s returned assignments underrack-A2in memory snapshot.
- Restricted to
- Resolve child
PG-2underblock-A:- Restricted to
block-A(under the active memory snapshot containingPG-1inrack-A2). PG-2generates rack candidates inblock-A:rack-A1andrack-A2.- The scheduler simulates
PG-2under candidates innodeInfoSnapshot:- Simulates
PG-2underrack-A1$\to$ fails (only 3 slots free). - Simulates
PG-2underrack-A2$\to$ fails (0 slots left, greedily assumed by siblingPG-1in memory snapshot).
- Simulates
- Since
PG-2cannot find any feasible placement, its scheduling cycle returns a failure. - No Backtracking: Sibling child groups are simulated sequentially
without backtracking. The scheduler does not evaluate alternative rack
configurations for
PG-1(e.g. attempting to schedulePG-1onrack-A1to see ifPG-2could fit onrack-A2).
- Restricted to
- CPG Constraint Verification: The scheduler invokes the CPG-level
PlacementFeasiblechecker on candidateblock-A. Since onlyPG-1succeeded, the total feasible child group count is 1. BecauseminGroupCount=2,PlacementFeasiblereturns failure and candidateblock-Ais marked as infeasible. - Reverts
block-Aassumption innodeInfoSnapshot.
- Temporarily assumes
- Evaluate Candidate
block-B:- Temporarily assumes
block-BinnodeInfoSnapshot. - Resolve child
PG-1underblock-B:- Restricted to
block-B. Invokes standard flat PG scheduling. PG-1generates rack candidates inblock-B:rack-B1andrack-B2.- The scheduler simulates
PG-1under candidates innodeInfoSnapshot:- Simulates
PG-1underrack-B1$\to$ succeeds (5 slots free). - Simulates
PG-1underrack-B2$\to$ succeeds (5 slots free).
- Simulates
- PG-level scorer plugins scores both feasible candidates:
rack-B1scores 90.rack-B2scores 50.
- The cycle selects
rack-B1(highest score), and returnsPG-1’s resolved assignments underrack-B1along withSuccessstatus. CPG-roottemporarily reservesPG-1’s returned assignments underrack-B1in memory snapshot.
- Restricted to
- Resolve child
PG-2underblock-B:- Restricted to
block-B(under active memory snapshot containingPG-1inrack-B1). PG-2generates rack candidates inblock-B:rack-B1andrack-B2.- The scheduler simulates
PG-2under candidates innodeInfoSnapshot:- Simulates
PG-2underrack-B1$\to$ fails (0 slots left, greedily assumed by siblingPG-1). - Simulates
PG-2underrack-B2$\to$ succeeds (5 slots free).
- Simulates
- PG-level scorer plugins evaluate feasible candidates: only
rack-B2is feasible, and is selected. - The cycle returns
PG-2’s resolved assignments underrack-B2along withSuccessstatus. CPG-roottemporarily reservesPG-2’s returned assignments underrack-B2in memory snapshot.
- Restricted to
- CPG Constraint Verification: The scheduler invokes the CPG-level
PlacementFeasiblechecker on candidateblock-B. Both child groups (PG-1andPG-2) simulated successfully, so feasible child count is 2. SinceminGroupCount=2,PlacementFeasiblereturns success and candidateblock-Bis saved as a feasible placement. - Reverts
block-Bassumption innodeInfoSnapshot.
- Temporarily assumes
- Generates block placements for
- Feasible Placement Scoring: The scheduler runs
CPG-rootplacement scorer plugins on the saved feasible placement (block-B). The scorer plugins evaluate the overall resolved assignments layout (both child groups placed in their respective racks underblock-B) and return a score (e.g.,95). - Final Selection:
- The scheduler completes CPG evaluations and selects the feasible placement
with the highest score (
block-B, score: 95). - It locks in the resolved placement path:
[CPG-root: block-B], [PG-1: rack-B1], [PG-2: rack-B2]. - The scheduler commits the resolved layout and proceeds to bind the pods of
PG-1andPG-2to their physical target nodes insideblock-B.
- The scheduler completes CPG evaluations and selects the feasible placement
with the highest score (
Preemption in topology-aware scheduling
Workload preemption under topology constraints is the domain of [KEP-5710] (Workload-Aware Preemption) and KEP-5732 (Topology-Aware Workload Scheduling).
Under KEP-6012, this topology-aware preemption behavior works for CompositePodGroups
out-of-the-box without major changes, based on the following architectural factors:
- Decoupled Simulation Framework: The preemption algorithm evaluates victim selection
by running in-memory simulations and invoking the workload’s scheduling callback. The
algorithm is completely decoupled from the scheduling internals: it does not care whether
the callback is placing a single flat
PodGroupor recursively resolving aCompositePodGrouphierarchy. - Acceptable Complexity Trade-offs: Resolving a multi-level hierarchical group
with nested topology constraints (e.g., parent and child groups requiring specific
physical placements) is inherently more computationally complex than scheduling a flat
PodGroupof the same size, as the scheduler must evaluate parent-child placement combinations. Because preemption relies on the recursive scheduling callback, this increased placement complexity could potentially impact preemption throughput at scale. For Alpha, we believe that no preemption-specific architectural changes are required as the simulation model is fully decoupled. However, if scale testing and performance feedback during the Alpha phase reveal bottlenecks due to recursive CPG checks during preemption, we will address necessary optimizations in the Beta phase.
Consequently, since topology-aware preemption is designed and implemented to work for a flat
PodGroup as a part of [KEP-5710], it should automatically work for a hierarchical
CompositePodGroup with no significant architectural modifications or new preemption algorithms.
Test Plan
[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
N/A
Unit tests
k8s.io/kubernetes/pkg/apis/scheduling/validation:2026-05-20- 90.6%k8s.io/kubernetes/pkg/registry/scheduling/workload:2026-05-20- 95.1%k8s.io/kubernetes/pkg/registry/scheduling/podgroup:2026-05-20- 90.9%k8s.io/kubernetes/pkg/scheduler:2026-05-20- 76.8%k8s.io/kubernetes/pkg/scheduler/backend/queue:2026-05-20- 92.1%k8s.io/kubernetes/pkg/scheduler/backend/cache:2026-05-20- 84.9%k8s.io/kubernetes/pkg/scheduler/framework:2026-05-20- 73.0%k8s.io/kubernetes/pkg/scheduler/framework/preemption:2026-05-20- 76.5%k8s.io/kubernetes/pkg/scheduler/framework/plugins/defaultpreemption:2026-05-20- 89.6%k8s.io/kubernetes/pkg/scheduler/framework/plugins/topologyaware:2026-05-20- 91.5%k8s.io/kubernetes/pkg/scheduler/framework/plugins/queuesort:2026-05-20- 60.0%k8s.io/kubernetes/pkg/scheduler/framework/runtime:2026-05-20- 82.8%
Integration tests
We will create new integration tests (and extend the existing PodGroup integration test
suite in test/integration/scheduler/) to cover the hierarchical and multi-level aspects of
the CPG API and the recursive scheduling resolutions:
- CPG Queueing and Requeueing:
- Verify that CPG hierarchies with unobserved parents are buffered inside
pendingPodGroupsand not promoted to the active scheduling queue until the root CPG object is observed. - Verify that the arrival of cluster events successfully triggers queueing
hints to move blocked CPG hierarchies from the unschedulable queue back
to the active queue (
activeQ) or backoff queue (backoffQ).
- Verify that CPG hierarchies with unobserved parents are buffered inside
- Multi-level Gang Scheduling:
- Verify that CPG hierarchies satisfying their nested child group
minCountandminGroupCountrequirements are enqueued, and those failing are rejected and remain in the unschedulable queue. - Verify that CPG parent nodes satisfy simulation checks and successfully
schedule when the simulated child group count $\ge$
minGroupCount, and fail when they fall below this threshold.
- Verify that CPG hierarchies satisfying their nested child group
- Workload-Aware Preemption for Multi-level Workloads:
- Verify that the preemption victim selection logic correctly respects
disruption boundaries across different hierarchical layouts (under various
configurations of
All,Single, and mixed nested combinations), ensuring correct cascading subtree evictions or allowing partial on-demand preemption. - Verify that preemption is evaluated and triggered strictly at the root CPG
level when direct scheduling fails, and that the
PreEnqueuemethod ofDefaultPreemptionsuccessfully backs off and queues CPG preemptors awaiting evictions. - Verify that if the root CPG is feasible (
PlacementFeasible = Success) but extra member pods require preemption (NeedsPreemption = True), the scheduler successfully commits and writes the minimal gang bindings even if preemption fails to clear space for the extra members.
- Verify that the preemption victim selection logic correctly respects
disruption boundaries across different hierarchical layouts (under various
configurations of
- Multi-level TAS:
- Verify that the scheduler successfully schedules a hierarchical CPG workload’s pods strictly on nodes satisfying the nested combination of topology constraints when valid placement paths exist.
- Verify that scheduling fails for the CPG workload when the cluster state cannot satisfy the nested topology constraints.
We will also add and extend the existing scheduler performance benchmarks in test integration/scheduler_perf/ to measure the scheduling throghput of multi-level workload
scheduling, including:
- Multi-level gang and basic policies
- Multi-level preemptions
- Multi-level TAS
e2e tests
We will add basic API tests for the new CompositePodGroup API, that will later
be promoted to conformance. These tests will cover CompositePodGroup creation,
validation, status updates and lifecycle management.
More tests will be added for beta release.
Graduation Criteria
Alpha
- New
CompositePodGroupAPI is introduced behind theCompositePodGroupfeature gate. - New fields in
WorkloadandPodGroupAPIs are introduced behind theCompositePodGroupfeature gate. - Multi-level gang scheduling is supported.
- Multi-level gang disruption mode is supported.
- Multi-level topology-aware scheduling is supported.
- Initial e2e tests are implemented and enabled.
Beta
CompositePodGroupobject is protected against deletion if any group refers to it.- At least one true workload controller (e.g.
JobSet) is integrated with theCompositePodGroupAPI. - Scheduler detects invalid runtime group hierarchies (i.e. hierarchies which are too deep, have a cycle, refer to two or more Workloads, or have an invalid combination of scheduling policies or disruption modes at different levels of the hierarchy).
- The recursive greedy scheduling search trade-offs are re-evaluated, and a decision on incorporating advanced backtracking heuristics (such as restricted search depth or bounded branches) is made to optimize scheduling success rates for multi-level gangs.
- Scheduler bypasses futile scheduling cycles for inadmissible nested child
groups during recursion by extending
PlacementFeasibleto execute checks prior to in-memory scheduling (to protect performance and avoid redundant preemption passes). - A non-greedy
CompositePodGroupscheduling cycle mode is introduced and re-evaluated to mitigate resource stealing and gang deadlock occurrences among sibling child groups in capacity-constrained environments. - Support for differing group-level priorities across a single hierarchy tree under basic scheduling policies and separating queueing priority from preemption priority is re-evaluated.
- The
minGroupCountfield of theCompositePodGroupobjects becomes mutable at runtime (aligning with the pre-existing mutableminCountfield inPodGroupobjects). - Scheduler diagnostics and recommendations with regards to the scheduling order are re-evaluated to improve troubleshooting and scheduling success rates.
- GangScheduling’s
PlacementFeasibleextension point is changed to propagate pod-levelUnschedulableAndUnresolvable, making the logic identical across all levels. - The logic for triggering preemption for subsequent scheduling is re-evaluated in case
the scheduling policy is not initially satisfied (e.g., PG has been disrupted or
minCounthas changed).
GA
- All e2e tests for the
CompositePodGroupAPI are added and graduated to conformance tests. - TBD in for Beta release
Upgrade / Downgrade Strategy
Standard procedures for features introducing new APIs and API fields should be used. The components involved in this feature are:
- Alpha:
kube-apiserverandkube-scheduler. - Beta:
kube-controller-manager(KCM) will also be involved (e.g., for adding and deletingCompositePodGroupfinalizers).
For details about the required feature gates and their dependencies, see the Feature Enablement and Rollback section.
Upgrade sequence:
kube-apiservermust be upgraded first before any other components that use the new API (such askube-schedulerand, in Beta,kube-controller-manager).
Downgrade sequence:
- On downgrade,
kube-scheduler(and in Beta,kube-controller-manager) must be downgraded first (to stop processing the new fields and objects) beforekube-apiserveris downgraded.
Upon downgrading kube-apiserver or disabling the CompositePodGroup feature gate:
- Existing
CompositePodGroupobjects will remain in etcd but will be ignored by the control plane components. - Newly introduced fields within
PodGroupandWorkloadobjects will remain in etcd but will not be processed.
Version Skew Strategy
The feature is limited to the control plane, so the version skew with nodes (kubelets) doesn’t matter.
For the API changes (introduction of CompositePodGroup API and the field
changes in the PodGroup and the Workload APIs), the old version of
components (in particular kube-apiserver) may not handle those. Thus, users
should not set those fields before confirming all control plane instances were
upgraded to the version supporting those.
For the multi-level scheduling features themselves, the version skew across
multiple kube-scheduler instances (e.g., during a rolling upgrade where the active
leader might run the old version while kube-apiservers are already upgraded and
the feature is in use) behaves as follows:
- The old version of
kube-schedulerdoes not recognize the newCompositePodGroupobjects or fields. It will ignore the parent references and fall back to scheduling member pods strictly at the flat, individual or standalonePodGrouplevel. - While the old scheduler will continue to run safely and will not crash, it will
not satisfy the multi-level topology, gang, or preemption constraints. For
topology constraints specifically, this will likely lead to invalid, flat
placement decisions (e.g., placing member pods across different racks instead of
satisfying CPG-level topology constraints). Crucially, these invalid topology
placements are irreversible by the scheduler itself; once pods are bound to
nodes, the scheduler cannot reschedule them on its own, even after a new
scheduler leader is upgraded to the new version. Only newly scheduled pods (or
pods recreated after eviction) will be correctly resolved (if possible) under the
hierarchical constraints once the upgrade is complete.
Note that this is identical to the pre-existing version skew behavior for flat
PodGroupfeatures (e.g., flat topology-aware scheduling) during control plane rolling upgrades. Therefore, the standard recommendation applies: users should not use the new APIs/fields until the rolling upgrade ofkube-scheduleris fully completed.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name:
CompositePodGroup - Components depending on the feature gate:
kube-apiserverkube-controller-manager(starting from Beta)kube-scheduler
- Dependencies:
- The
CompositePodGroupAPI relies directly on both theGenericWorkloadandTopologyAwareWorkloadSchedulingfeature gates being enabled. All three feature gates must be enabled in order for the API and multi-level scheduling features to be fully functional. - This dependency is programmatically verified during component initialization (the
components will log a configuration error and disable
CompositePodGroupprocessing if any required dependency gate is missing). - We will re-evaluate this simplified feature gate dependency model in Beta if needed.
- The
- Feature gate name:
Does enabling the feature change any default behavior?
No. Any scheduling behavior changes that this KEP introduces require creating
a CompositePodGroup object in the first place or using non-default values in
the new fields in the Workload or PodGroup APIs and no core Kubernetes
component will do that.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
Yes - behavior changes in the workload scheduling algorithm can be disabled by simply disabling the feature gate in kube-scheduler.
The new API changes can also be disabled by disabling the feature gate in kube-apiserver. That doesn’t result in clearing out the new fields in PodGroups or Workloads that already have them set in the storage, however. Similarly, CompositePodGroup objects would be preserved in storage as well.
What happens if we reenable the feature if it was previously rolled back?
The feature starts working again.
Are there any tests for feature enablement/disablement?
The scheduler algorithm changes are purely in-memory and don’t require any dedicated enablement/disablement tests - the logic will be covered by regular feature tests.
For the newly introduced API fields, dedicated enablement/disablement tests at the kube-apiserver registry layer will be added in Alpha.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
What specific metrics should inform a rollback?
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
How can someone using this feature know that it is working for their instance?
- Events
- Event Reason:
- API .status
- Condition name:
- Other field:
- Other (treat as last resort)
- Details:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- Other (treat as last resort)
- Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?
Dependencies
Does this feature depend on any specific services running in the cluster?
Scalability
Will enabling / using this feature result in any new API calls?
Yes.
Watching for CompositePodGroups:
- API call type: LIST+WATCH CompositePodGroups
- estimated throughput: < XX/s
- originating component: kube-scheduler, kube-controller-manager (GC controller, PodGroup protection controller)
Status updates (potentially not in Alpha):
- API call type: PUT/PATCH CompositePodGroups status
- estimated throughput < XX/s
- originating component: kube-scheduler
Watching for Workloads for validation (Beta):
- API call type: LIST+WATCH Workloads
- estimated throughput < XX/s
- originating component: kube-scheduler
Will enabling / using this feature result in introducing new API types?
Yes:
- API type:
CompositePodGroup - Supported number of objects per cluster: XX,000
- Supported number of objects per namespace: XX,000
The above numbers will eventually depend on the numbers for out-of-tree workload APIs that will integrate with the new API (e.g. JobSets, LeaderWorkerSets, …).
Will enabling / using this feature result in any new calls to the cloud provider?
No.
Will enabling / using this feature result in increasing size or count of the existing API objects?
Yes - new fields are added to the Workload and PodGroup APIs.
The exact size increase will be small, however:
PodGroupis extended with a single string field,- Templates definition in the
Workloadobject is evolved into a tree-like structure and we enforce an explicit limit on the depth (4) and width (branching factor of 8) of that tree.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
Although the recursive greedy scheduling algorithm was designed with performance in mind, the scheduling latency and Pod Startup SLO may potentially increase, especially for large clusters, complex multi-level workloads, and fine-grained topology constraints.
Due to the recursive nature of multi-level scheduling, the latency impact may be slightly higher than in the flat scheduling model. We will measure the exact impact using performance benchmarks and scalability tests, and update this section accordingly.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
For large clusters and fine-grained topology constraints we may observe some increase in CPU and RAM usage for kube-scheduler. The exact scale of this increase will be explored in the scalability tests.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No.
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
What are other known failure modes?
What steps should be taken if SLOs are not being met to determine the problem?
Implementation History
- 2026-04: Initial KEP-6012 proposal.
Drawbacks
Alternatives
API shape
Numerous discussions took place within the community about the API and how it should evolve in the future to support hierarchical workloads. Some of them were driven in documents linked below345. WAS Design Summit6, which took place right before the KubeCon Europe 2026, has helped reach the consensus regarding the design - this proposal is essentially a realization of the design that was agreed on during the summit.
For completeness, we distill the main ideas considered previously in those discussions below and provide rationale why they were eventually abandoned.
PodGroup as a recursive API type
An alternative approach to model hierarchical workloads would be to evolve the
PodGroup API into a recursive type itself.
However, the primary drawback of this approach is that a group of Pods and a group of nested groups represent semantically distinct concepts. The core issue is not necessarily about them having different policies, but that they represent fundamentally different concepts at the API level:
PodGrouprepresents a group of pods that we should treat as a single entity.CompositePodGroupno longer represents a simple group of pods; instead, it represents a complex structure of potentially nested groups.
Grouping the lowest-level primitives (Pods) is a conceptually different
operation than grouping more complex structures. While it is true that the
underlying scheduling algorithm might collapse these hierarchies into a common
abstraction to process them, this does not mean we should model them with the
same abstraction at the API level. Introducing a dedicated CompositePodGroup
type preserves this qualitative difference and provides a much clearer semantic
boundary for users defining complex hierarchical workloads.
New API type per hierarchy level
Conceptually, this design idea is on the opposite side of the one described above. Main advantage of this is having a strongly typed API with validation per hierarchy level.
This approach was eventually abandoned due to the high complexity and cost of implementation that is required to add support for every new level in the scheduling hierarchy level.
In principle, our proposal is a tradeoff between this approach and the one that
proposes to just extending the PodGroup API.
PodSubGroup and PodSet
The initial idea discussed in the community35 was to make the PodGroup a
root of the scheduling group hierarchy and create additional APIs called
PodSubGroup and PodSet. PodGroup would be a grouping entity for either
Pods or PodSubGroup objects, PodSubGroup would be a grouping entity for
either Pods or PodSet objects and the PodSet would be a group of homogeneous
Pods.
Unfortunately, this approach has drawbacks that are common with both of the ideas described above.
Naming of the new API
Aside from the CompositePodGroup name, there were a couple of different naming
ideas in the past for the API this KEP introduces:
PodGroupSetNestedPodGroupPodGroupCollectionPodGroupAggregate
The “set” word might suggest that it contains objects of the same type, similar
to how StatefulSet, DaemonSet and ReplicaSet own the homogeneous replicas.
PodGroupSet could be a grouping entity not just for the PodGroup objects but
also for further PodGroupSet objects, so it violates this unwritten rule.
NestedPodGroup would make more sense if the bottom level entity in the group
hierarchy was called so. That said, even if we did such renaming, it would not
make sense for the flat workloads using one level hierarchy because there would
be no nesting at all there.
PodGroupCollection doesn’t grasp the hierarchy in its name anyhow which is the
essence of workloads this proposal aims to extend the support for.
PodGroupAggregate was the runner-up among the naming candidates. In the end,
CompositePodGroup was selected instead because we are essentially following
the composite design pattern here - and that name expresses this more explicitly
than PodGroupAggregate.
Validation of CompositePodGroup
In 1.36, we introduced an admission plugin called PodGroupWorkloadExists. That
plugin targeted PodGroup creations and checked the following two conditions
for any incoming object:
- If the
PodGrouphas a reference to aWorkloadobject, check if thisWorkloadactually exists - and if not, reject thePodGroup, - If the referred
Workloadexists, check if thatWorkloadactually defines the template that thePodGroupobject refers - and if not, reject thePodGroup.
Initially, we planned to extend the scope of that admission plugin to perform
analogous checks for the incoming CompositePodGroups. However, this plugin was
removed in the early stage of the 1.37 release cycle7 because of the
performance-related concerns and the fact that cross-object admission
enforcement is always best effort.
Infrastructure Needed (Optional)
JobSetAPI documentation: https://jobset.sigs.k8s.io/docs/overview/ . ↩︎LeaderWorkerSetAPI documentation: https://lws.sigs.k8s.io/docs/overview/ . ↩︎See the “Part 2: Future Evolution & Compatibility Study” tab for the relevant discussion in PodGroup as top-level object . ↩︎ ↩︎