KEP-4671: Gang Scheduling

Implementation History
ALPHA Implementable
Created 2025-09-17
Latest v1.36
Milestones
Alpha v1.35
Beta v1.37
Stable v1.38

KEP-4671: Gang Scheduling using Workload Object

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
    • e2e Tests for all Beta API Operations (endpoints)
    • (R) Ensure GA e2e tests meet requirements for Conformance Tests
    • (R) Minimum Two Week Window for GA e2e tests to prove flake free
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • “Implementation History” section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

In this KEP, kube-scheduler is modified to support gang scheduling1. We focus on framework support and building blocks, not the ideal gang-scheduling algorithm - it can come as a follow-up. We start with simpler implementation of gang scheduling, kube-scheduler identifies pods that are in a group and waits until all pods reach the same stage of the scheduling/binding cycle before allowing any pods from the group to advance past that point. If not all pods can reach that point before a timeout expires, then the scheduler stops trying to schedule that group, and all pods release all their resources. This allows other workloads to try to allocate those resources.

New core types called Workload and PodGroup are introduced to tell the kube-scheduler that a group of pods should be scheduled together and to define policy options related to gang scheduling. Pods may have an object reference in their spec to the PodGroup they belong to. The Workload and PodGroup objects are intended to evolve2 via future KEPs to support additional kube-scheduler improvements, such as topology-aware scheduling.

In 1.36, we redesigned the API in order to clearly decouple the Workload API from the runtime PodGroup API 3. In the updated design:

  • Workload represents a static template defining the scheduling hierarchy and scheduling policy definition that specifies what workload behavior should be applied.
  • PodGroup becomes a standalone, self-contained runtime scheduling unit for a group of pods that encapsulates both the scheduling policy and status. True workload4 owners are responsible for creating PodGroup objects (together with Workload objects). PodGroups are expected to be created based on the podGroupTemplates defined in the Workload.
  • Pods reference PodGroup which is their immediate execution context.

Motivation

Parallel applications can require communication between every pod in order to begin execution, and then ongoing communication between all pods (such as barrier or all-reduce operations) in order to make progress. Starting all pods as close to the same time is necessary to run these workloads. Otherwise, either expensive compute resources are idle, or the application may fail due to an application-level communication timeout.

Gang scheduling has been implemented outside of kube-scheduler at least 4 times5. Some controllers are starting to support multiple Gang Schedulers in order to be portable across different clusters. Moving support into kube-scheduler makes gang scheduling support available in all Kubernetes distributions and eventually may allow workload controllers to rely on a standard interface to request gang scheduling from the standard or custom schedulers. A standard API may also allow other components to understand workload needs better (such as cluster autoscalers).

Workloads that require gang scheduling often also need all members of the gang to be as topologically “close” to one another as possible, in order to perform adequately. Existing Pod affinity rules influence pod placement, but they do not consider the gang as a unit of scheduling and they do not cause the scheduler to efficiently try multiple mutually exclusive placement options for a set of pods. The design of the Workload object introduced in this KEP anticipates how Gang Scheduling support can evolve over subsequent KEPs into full Topology-aware scheduling support in kube-scheduler.

The original design embedded PodGroups within the Workload spec, which creates several architectural challenges:

  • Workload represents long-lived configuration-intent, whereas PodGroups represent transient units of scheduling. Tying runtime execution units to the persistent definition object violates separation of concerns.
  • Lifecycle coupling prevents standalone PodGroup objects from owning other resources (e.g., ResourceClaims) for garbage collection with specific scheduling units, rather than the entire Workload or individual Pods.
  • Extending the Workload object to track runtime status for all PodGroups leads to significant scalability issues:
    • Size Limit: Large Workloads (i.e., large number of PodGroups) may easily hit the 1.5MB etcd object limit.
    • Contention: Updating the status of a single PodGroup would require read-modify-write on the central massive Workload object.

By decoupling PodGroup as a standalone runtime object:

  • Workload becomes a scheduling policy object that defines scheduling constraints and requirements.
  • PodGroupTemplate provides the blueprint for runtime PodGroup creation.
  • PodGroup is a standalone runtime object with its own lifecycle, typically managed by a controller, that represents a single scheduling unit.

The PodGroup object will reflect the intended Workload internal structure and allow kube-scheduler to schedule workload pods accordingly. Those workloads include builtins like Job (KEP-5547 ) and StatefulSet, and custom workloads, like JobSet, LeaderWorkerSet, MPIJob and TrainJob. All of these workload types are used for AI training and inference use cases.

Goals

  • Introduce a concept of a Workload as a primary building block for workload-aware scheduling vision
  • Implement the first version of Workload API necessary as a mechanism for defining scheduling policies
  • Introduce a concept of a PodGroup positioned as runtime counterparts for the Workload
  • Ensure that decoupled model of Workload and PodGroup provide clear responsibility split, improved scalability and simplified lifecycle management
  • Enhance status ownership by making PodGroup status track podGroup-level runtime state
  • Enable automatic lifecycle management and resource cleanup for PodGroup objects through integration with Kubernetes garbage collection
  • Ensuring that we can extend Workload API in backward compatible way toward north-star API
  • Ensure the Workload API provides a clear integration path for true workload4 controllers, both built-in and third-party usable for both built-in and third-party workload controllers and APIs
  • Implement first version of gang-scheduling in kube-scheduler supporting (potentially in non-optimal way) all existing scheduling features.
  • Provide full backward compatibility for all existing scheduling features

Non-Goals

  • Take away responsibility to create pods from controllers.
  • Bring fairness or multiple workload queues in kube-scheduler. Kueue and Volcano.sh will continue to provide this.
  • Map all the declarative state and behaviors into Workload object. It is focused only on scheduling-related parts.
  • Graduate the old model of using Workload API (without decoupled PodGroup object) to Beta.

The following are non-goals for this KEP but will probably soon appear to be goals for follow-up KEPs:

  • Integrate cluster autoscaling with gang scheduling.
  • Introduce a concept of Reservation that can be later consumed by pods.
  • Workload-level preemption.
  • Address resource contention between different schedulers (including possible deadlocks).
  • Address the problem of premature preemptions in case the higher priority workloads does not eventually schedule.

See Future plans for more details.

Proposal

This KEP introduces both the Workload and PodGroup APIs in scheduling.k8s.io/v1alpha2. The 1.36 revision, as detailed in the original design 3, addresses feedback regarding the ambiguity of the original “monolithic” Workload. By decoupling policy (Workload) from runtime grouping (PodGroup), we improve clarity, scalability and lifecycle management. This approach also provides a more readable and intuitive structure for complex workloads like JobSet and LeaderWorkerSet.

The Workload API defines the scheduling policy and references one or more podGroupTemplates. Each PodGroup is a standalone runtime object created from those templates, representing a self-contained scheduling unit that encapsulates the runtime state.

To maintain long-term API quality and ensure a clean path to GA, the Workload API remains in Alpha for the 1.36 cycle. We have decided to completely abandon the original v1alpha1 version of the API in favor of v1alpha2, where only the new decoupled model is supported. This allows for the finalization of the architectural changes and ensures that the graduated Beta API will be clean and free of transitional technical debt or legacy semantics.

The spec.workloadRef field was introduced in v1.35 to identify the scheduling context. In the v1.36 revision, we are replacing this with a new field, spec.schedulingGroup, to align with the decoupled architecture:

  • Immediate Execution Context: The Pod now points strictly to the immediate execution context (the runtime PodGroup). We are removing the direct Workload reference because it is not strictly necessary for the scheduler’s operation. We may re-introduce a direct workload reference in the future (most probably in status, not spec) if concrete use cases (e.g., enhanced debuggability, UX) emerge.
  • Future Extensibility: We anticipate the API will evolve to include concepts like PodSubGroup6. Therefore, we structure this reference as a PodSchedulingGroup object. This allows us to easily extend the API in the future (e.g., via a oneOf pattern) to support hierarchical scheduling.

A sample pod with these new fields looks like this:

apiVersion: v1
kind: Pod
spec:
  ...
  # In 1.36 schedulngGroup replaces workloadRef.
  schedulingGroup:
    podGroupName: pg1  # Points to the standalone PodGroup object
  ...

The above pod might be one of several pods created by a Job like this.

apiVersion: batch/v1
kind: Job
metadata:
  name: job-1
spec:
  completions: 100
  parallelism: 100
  completionMode: Indexed
  template:
    spec:
      schedulingGroup:
        podGroupName: pg1
      restartPolicy: OnFailure
      containers:
      - name: ml-worker
        image: awesome-training-program:v1 
        command: ["python", "train.py"]
        resources:
          limits:
            nvidia.com/gpu: 1
        env:
        - name: JOB_COMPLETION_INDEX
          valueFrom:
            fieldRef:
              fieldPath:
               "metadata.annotations['batch.kubernetes.io/job-completion-index']"

The Workload resource is a new core resource that provides scheduling policy templates. It does not manage pod lifecycles or interfere with the pod creation logic of controllers like Job, JobSet, or StatefulSet. Instead, it serves as a policy template, containing the PodGroupTemplates with their corresponding scheduling policies (e.g., gang scheduling) that should be applied to the resulting PodGroups.

The Workload object defines these templates:

apiVersion: scheduling.k8s.io/v1alpha2
kind: Workload
metadata:
  namespace: ns-1
  name: job-1
spec:
  # In 1.36 (v1alpha2) renamed from podGroups to podGroupTemplates.
  podGroupTemplates:
    - name: "worker"
      # In 1.36 (v1alpha2) renamed from policy to schedulingPolicy.
      schedulingPolicy:
        gang:
          minCount: 100

A sample PodGroup instantiated from the above template would look like this:

apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
  name: training-worker-0
spec:
  podGroupTemplateRef:
    workload:
      workloadName: training-policy
      podGroupTemplateName: worker
  schedulingPolicy:
    gang:
      minCount: 100

User Stories (Optional)

Story 1: Gang-scheduling of a Job

I have a tightly-coupled job and I want its pods to be scheduled and run only when the resources for all of them can be found in the cluster.

Story 2: Gang-scheduling of a custom workload

I have my own workload definition (CRD) and controller managing its lifecycle. I would like to be able to easily benefit of gang-scheduling feature supported by the core Kubernetes without extensive changes to my custom controller.

Story 3: Independent PodGroup Lifecycle

As a user running LWS (LeaderWorkerSet), I want to observe and manage a leader pod and its associated worker pods as a single unit.

Story 4: PodGroup-Level Status

I have a large-scale training job with multiple replicas, and want to observe the scheduling status of each PodGroup independently, so I can identify which specific replica is having scheduling issues.

Story 5: Controller Scalability

As a workload controller author, I want PodGroup status to be stored in a separate object, so that per-replica scheduling updates do not require read-modify-write operations on a large, shared Workload object, which would otherwise create scalability and contention issues at scale.

Risks and Mitigations

The API needs to be extended in an unpredictable way

We try to mitigate it by an extensive analysis of usecases and already sketching how we envision the direction in which the API will need to evolve to support further usecases. You can read more about it in the extended proposal document.

Exacerbating the race window by proceeding directly to binding

Since the entire Workload Scheduling Cycle operates on a single cluster snapshot, a long-running cycle means decisions are based on snapshotted state that may become stale. This implies that if the cluster state changes in the meantime (e.g., a Node suffers a hardware failure or is deleted), the binding phase could fail for some pods in the workload, potentially causing the entire gang to fail.

However, assuming all scheduling decisions go through kube-scheduler, the primary source of race conditions is external infrastructure events (e.g., Node health changes). While this is a valid concern, this race window exists in the standard scheduling cycle as well. Although the Workload Scheduling Cycle extends this window, the propagation latency of Node status updates or deletions is typically non-trivial, meaning the marginal increase in risk is acceptable compared to the benefits of atomic scheduling.

Increased API call volume

More objects means more API calls for creation, updates, and watches. The mitigation is to split the responsibility: the Workload object is rarely updated (as a template object) while PodGroup handles runtime state. In addition, PodGroups allow per-replica sharding of status updates.

Consistency across multiple objects

State is spread across multiple objects (Workload and PodGroup). The mitigation is that the PodGroup inlines all runtime state making it self-contained.

Race conditions during object creation

While the design requires controllers to create objects in order (Workload -> PodGroup -> Pods), there is still a possibility of race conditions. The mitigation is to introduce an admission controller to validate the object creation order. In addition, UnschedulableAndUnresolvable status will be set to serve as last line of defense if Pods are created before PodGroup is created or the PodGroup object was deleted in the meantime.

Increased etcd object count

New object per replica means more objects in etcd. The mitigation is that PodGroups are owned by controllers with ownerReferences, so they are automatically garbage collected when the replica is deleted. Also, each PodGroup object is small (~1KB) compared to a potentially large Workload object (~1.5MB) with the embedded PodGroup design.

Design Details

Naming

  • Workload, PodGroup are the resource Kinds.
  • scheduling.k8s.io is the ApiGroup.
  • spec.schedulingGroup is the name of the new field in pod.
  • Within a Workload there is a list of groups of pods. Each group represents a top-level division of pods within a Workload. Each group can be independently gang scheduled (or not use gang scheduling). This group is named PodGroup and represented by the PodGroup API resource.
  • In a future , we expect that this group can optionally specify further subdivision into sub groups. Each sub-group can have an index. The indexes go from 0 to N, without repeats or gaps. These subgroups are called PodSubGroup.
  • In subsequent KEPs, we expect that a sub-group can optionally specify further subdivision into pod equivalence classes. All pods in a pod equivalence class have the same values for all fields that affect scheduling feasibility. These pod equivalence classes are called PodSet.

PodGroup Naming Conventions

  • PodGroup names must be unique within the namespace.
  • The name must be a valid DNS subdomain7.
  • The controller that creates the PodGroup is responsible for generating the name based on the above conventions.

Associating Pod into PodGroups

We propose introducing a SchedulingGroup field in PodSpec (replacing the previous WorkloadReference) to link the Pod to its scheduling context.

type PodSpec struct {
	...
	
    // WorkloadRef is tombstoned since the field in 1.36 was replaced with SchedulingGroup.
    // WorkloadRef *WorkloadReference
	
	// SchedulingGroup provides a reference to the immediate scheduling runtime grouping object that this Pod 
	// belongs to. In the current implementation, this is always a PodGroup, but it may evolve in the future to support
	// other concepts like PodSubGroups.
	// This field is used by the scheduler to identify the PodGroup and apply the
	// correct group scheduling policies. The PodGroup object referenced
	// by this field may not exist at the time the Pod is created.
	// This field is immutable, but a PodGroup object with the same name
	// may be recreated with different policies. Doing this during pod scheduling
	// may result in the placement not conforming to the expected policies.
	//
	// +featureGate=GenericWorkload
	// +optional
	SchedulingGroup *PodSchedulingGroup
}

// PodSchedulingGroup identifies the runtime scheduling group instance that a Pod belongs to. 
// The scheduler uses this information to apply workload-aware scheduling semantics.
type PodSchedulingGroup struct {
    // PodGroupName specifies the name of the standalone PodGroup object 
    // that represents the runtime instance of this group.
    // +optional
    // +oneOf=GroupSelection
    PodGroupName *string `json:"podGroupName,omitempty"`
}

At least for Alpha, we start with PodSchedulingGroup to be immutable field in the Pod. In further phases, we may decide to relax validation and allow for setting some of the fields later. Moreover, the visibility into issues (debuggability) will depend on #5501 , but we don’t treat it as a blocker.

Why is podGroupName an explicit field in PodSpec rather than using ownerReferences or labels? This decision was mainly based on the immutability requirement for this field. So far, we don’t see any use case where Pods would need to move between PodGroups. Therefore, the decision was to make PodGroupName an immutable field. If we allow for mutations, we need to handle many corner cases (e.g., scheduling a gang, finding nodes for all pods, but suddenly one of the pods was removed from the PodGroup).

If PodTemplate is immutable in the true workload object, how should controllers set PodGroupName per-pod? There are two main cases:

(a) Controller-managed PodGroups: when a controller creates a Pod, it determines the creation context that allows it to define the PodGroup this Pod should belong to. This is similar to the pattern in the DaemonSet controller , where during pod creation we explicitly set the NodeAffinity for each pod. For hierarchical controllers (e.g., JobSet), when there’s a 1:1 mapping between lower-level workload and PodGroup, the higher-level controller can manage PodGroups and set podGroupName in the PodTemplate of the child workloads.

(b) User-managed PodGroups: users can manage PodGroup themselves by setting podGroupName directly in the PodTemplate. Note this is distinct from “bring your own Workload” where a user might reference a custom Workload (to change scheduling policy, gang configuration, TAS constraints, etc.) but still expect the controller to create PodGroups based on that Workload’s template. User-managed PodGroups is specifically for cases where the user wants to control PodGroup creation.

The example below shows how this could look with the decoupled architecture for a simple job-like workload.

A Workload object defines the static PodGroup template:

apiVersion: scheduling.k8s.io/v1alpha2
kind: Workload
metadata:
  name: jobset
spec:
  podGroupTemplates:
    - name: "job-1"
      schedulingPolicy:
        gang:
          minCount: 100

A standalone PodGroup object is created to define the scheduling policy and track a specific runtime instance:

apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
  name: job-instance-worker-0
spec:
  podGroupTemplateRef:
    workload:
      workloadName: jobset
      podGroupTemplateName: job-1
  # schedulingPolicy is copied from template on PodGroup creation.
  schedulingPolicy:
    gang:
      minCount: 100

And finally, the Pod references the immediate scheduling group (PodGroup):

apiVersion: v1
kind: Pod
metadata:
  name: jobset-job-1-abc123
spec:
  ...
  schedulingGroup:
    podGroupName: job-instance-worker-0
  ...

We decided for this option because it is more succinct and makes the role of a pod clear just from inspecting the pod (and simple/efficient to group). We acknowledge the fact that this option may require additional minor changes in the controllers to adopt this pattern (e.g. for LeaderWorkerSet we will need to populate the pod template similarly that we currently populate the labels).

The primary alternative we consider was to introduce the PodGroupSelector on each PodGroup to identify pods belonging to it. However, with this pattern:

  • there are additional corner cases (e.g. a pod links to a workload but none of its PodGroups match that pod)
  • for replicated gang, we can’t use the full label selector, but rather support specifying only the label key, similar to MatchLabelKeys in pod affinity

Decoupling Workload from PodGroup (in 1.36) clearly separates the role of a PodGroup (runtime grouping, status and scheduling policy) from its template (Workload). We decided on this approach because it improves etcd scalability (sharding status updates across PodGroup objects) and clarifies object lifecycle management as described in the original design 3.

API

The Workload resource is defined as a collection of pod group templates. This ensures that the policy definition remains static and decoupled from individual runtime instances.

// Workload allows for expressing scheduling constraints that should be used
// when managing the lifecycle of workloads from the scheduling perspective,
// including scheduling, preemption, eviction and other phases.
// Workload API enablement is toggled by the GenericWorkload feature gate.
type Workload struct {
	metav1.TypeMeta
	// Standard object's metadata.
	//
	// +optional
	metav1.ObjectMeta

	// Spec defines the desired behavior of a Workload.
	//
	// +required
	Spec WorkloadSpec
}

// WorkloadMaxPodGroups is the maximum number of pod groups per Workload.
const WorkloadMaxPodGroups = 8

// WorkloadSpec defines the templates for pod groups within a workload.
type WorkloadSpec struct {
    // ControllerRef is an optional reference to the controlling object, such as a
    // Deployment or Job. This field is intended for use by tools like CLIs
    // to provide a link back to the original workload definition.
    // When set, it cannot be changed.
    //
    // +optional
    ControllerRef *TypedLocalObjectReference
    
    // PodGroupTemplates is the list of templates that make up the Workload.
    // The maximum number of podGroupTemplates is 8. This field is immutable.
    //
    // +optional
    // +listType=map
    // +listMapKey=name
    PodGroupTemplates []PodGroupTemplate
}

// TypedLocalObjectReference allows to reference typed object inside the same namespace.
type TypedLocalObjectReference struct {
	// APIGroup is the group for the resource being referenced.
	// If APIGroup is empty, the specified Kind must be in the core API group.
	// For any other third-party types, setting APIGroup is required.
	// It must be a DNS subdomain.
	//
	// +optional
	APIGroup string
	// Kind is the type of resource being referenced.
	// It must be a path segment name.
	//
	// +required
	Kind string
	// Name is the name of resource being referenced.
	// It must be a path segment name.
	//
	// +required
	Name string
}

// PodGroupTemplate represents a template for a set of pods with a scheduling policy.
type PodGroupTemplate struct {
    // Name is a unique identifier for the PodGroupTemplate within the Workload.
    // It must be a DNS label. This field is immutable.
    //
    // +required
    Name string
    
    // SchedulingPolicy defines the scheduling policy for this PodGroupTemplate.
    //
    // +required
    SchedulingPolicy PodGroupSchedulingPolicy
}

// PodGroupSchedulingPolicy defines the scheduling configuration for a PodGroup.
// Exactly one policy must be set.
type PodGroupSchedulingPolicy struct {
    // Basic specifies that the pods in this group should be scheduled using
    // standard Kubernetes scheduling behavior.
    //
    // +optional
    // +oneOf=PolicySelection
    Basic *BasicSchedulingPolicy
    
    // Gang specifies that the pods in this group should be scheduled using
    // all-or-nothing semantics.
    //
    // +optional
    // +oneOf=PolicySelection
    Gang *GangSchedulingPolicy
}

// BasicSchedulingPolicy indicates that standard Kubernetes
// scheduling behavior should be used.
type BasicSchedulingPolicy struct {
	// This is intentionally empty. Its presence indicates that the basic
	// scheduling policy should be applied. In the future, new fields may appear,
	// describing such constraints on a pod group level without "all or nothing"
	// (gang) scheduling.
}

// GangSchedulingPolicy defines the parameters for gang scheduling.
type GangSchedulingPolicy struct {
	// MinCount is the minimum number of pods that must be schedulable or scheduled
	// at the same time for the scheduler to admit the entire group.
	// It must be a positive integer.
	//
	// +required
	MinCount int32
}

The PodGroup resource is a separate API object in scheduling.k8s.io/v1alpha2:

// API Group: scheduling.k8s.io/v1alpha2

// PodGroup represents a runtime instance of pods grouped together.
// PodGroups are created by workload controllers (Job, LWS, JobSet, etc...) from
// Workload.podGroupTemplates.
// PodGroup API enablement is toggled by the GenericWorkload feature gate.
type PodGroup struct {
    metav1.TypeMeta
    
    // Standard object's metadata.
    // More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata
    //
    // +optional
    metav1.ObjectMeta
    
    // Spec defines the desired state of the PodGroup.
    // +required
    Spec PodGroupSpec
    
    // Status represents the current observed state of the PodGroup.
    // +optional
    Status PodGroupStatus
}

// PodGroupSpec defines the desired state of a PodGroup.
type PodGroupSpec struct {
    // PodGroupTemplateRef references the PodGroupTemplate within the Workload object that was used to create
    // the PodGroup.
    //
    // +optional
    PodGroupTemplateRef *PodGroupTemplateReference
    
    // SchedulingPolicy defines the scheduling policy for this instance of the PodGroup.
    // Controllers are expected to fill this field by copying it from a PodGroupTemplate.
    // This field is immutable.
    //
    // +required
    SchedulingPolicy *PodGroupSchedulingPolicy
}

// PodGroupStatus represents information about the status of a pod group.
type PodGroupStatus struct {
    // Conditions represent the latest observations of the PodGroup's state.
    //
    // Known condition types:
    // - "PodGroupScheduled": Indicates whether the scheduling requirement has been satisfied.
    // - "DisruptionTarget": Indicates whether the PodGroup is about to be terminated
    //   due to disruption such as preemption.
    //
    // Known reasons for the PodGroupScheduled condition:
    // - "Unschedulable": The PodGroup cannot be scheduled due to resource constraints,
    //   affinity/anti-affinity rules, or insufficient capacity for the gang.
    // - "SchedulerError": The PodGroup cannot be scheduled due to some internal error
    //   that happened during scheduling, for example due to nodeAffinity parsing errors.
    //
    // Known reasons for the DisruptionTarget condition:
    // - "PreemptionByScheduler": The PodGroup was preempted by the scheduler to make room for
    //   higher-priority PodGroups or Pods.
    //
    // +optional
    Conditions []metav1.Condition
}

// PodGroupTemplateReference references a PodGroup template defined in some object (e.g. Workload).
// Exactly one reference must be set.
type PodGroupTemplateReference struct {
    // Workload references the PodGroupTemplate within the Workload object that was used to create
    // the PodGroup.
    // +optional
    Workload *WorkloadPodGroupTemplateReference
}

// WorkloadPodGroupTemplateReference references the PodGroupTemplate within the Workload object.
type WorkloadPodGroupTemplateReference struct {
    // WorkloadName defines the name of the Workload object.
    // +required
    WorkloadName string

    // PodGroupTemplateName defines the PodGroupTemplate name within the Workload object.
    // +required
    PodGroupTemplateName string
}

// PodGroupStatus represents information about the status of a pod group.
type PodGroupStatus struct {
	// Conditions represent the latest observations of the PodGroup's state.
	//
	// Known condition types:
	// - "PodGroupScheduled": Indicates whether the scheduling requirement has been satisfied.
	//
	// Known reasons for the PodGroupScheduled condition:
	// - "Unschedulable": The PodGroup cannot be scheduled due to resource constraints,
	//   affinity/anti-affinity rules, or insufficient capacity for the gang.
	// - "SchedulerError": The PodGroup cannot be scheduled due to some internal error
	//   that happened during scheduling, for example due to nodeAffinity parsing errors.
	//
	// +optional
	Conditions []metav1.Condition
}

Individual PodGroup objects are treated as independent scheduling units. If a Workload defines multiple templates or if multiple PodGroup objects are created referencing the same template, each PodGroup instance is scheduled independently. A LeaderWorkerSet is a good example of this, where a controller creates a standalone PodGroup instance for each replica (consisting of a leader and its workers) to form an atomic scheduling and runtime unit. If the underlying user intention is to have multiple groups run together, they should use the future hierarchical model.

Note: Similarly to PodSchedulingGroup, all fields in PodGroupTemplateReference and PodGroupTemplateRef field itself are intentionally made optional. The validation logic for those fields being set will be implemented in the code to allow for extending this structure if needed in the future.

PodGroup Status Lifecycle

The PodGroup.Status is managed by kube-scheduler to reflect the scheduling status. For alpha, we introduce a Conditions field with the PodGroupScheduled condition type, but more fields may be added for beta and GA.

PodGroup status mirrors Pod status semantics rather than defining PodGroup-specific reasons:

  • If pods are unschedulable (i.e., timeout, resources, affinity, etc.), the scheduler updates the PodGroupScheduled condition to False and sets the reason fields accordingly.
  • If pods are scheduled, the scheduler updates the PodGroupScheduled condition to True after the group got accepted by the Permit phase.

For basic scheduling policy, when the pod related to the PodGroup gets scheduled, the scheduler updates the PodGroupScheduled condition to True.

Alpha Status Transition Rules

Once a PodGroup transitions to PodGroupScheduled=True, it is treated as a terminal scheduling state and does not revert to False. Specifically, once the group’s scheduling constraint (minCount) has been satisfied, subsequent failed scheduling cycles for additional pods beyond minCount do not regress the condition. On same-status transitions (e.g., TrueTrue), the condition message may be updated, but LastTransitionTime remains unchanged.

This is a deliberate simplification for alpha. In practice, pods may later become unschedulable (for example, due to node failures or evictions), but the PodGroup status will not reflect that. A more robust status lifecycle that can capture post-scheduling state changes must be designed for beta graduation.

Implementation Notes (Alpha)

  • Synchronous status updates: Status updates are performed synchronously within the scheduling cycle. Asynchronous updates will be explored once the AsyncAPICalls scheduler feature is available.
  • Strategic merge patch: Status updates use StrategicMergePatch (not Server-Side Apply) to match the approach used for pod status updates in the scheduler and avoid the performance overhead of SSA in core controllers.
  • Informer cache staleness: The scheduler reads the existing PodGroup condition from the informer cache before deciding whether to skip an update. There is a small race window where the cache may not yet reflect a recent status write. This is acceptable for alpha (similar to pod status updates) but may need to be addressed if a PodGroup “assume” mechanism is introduced later.

PodGroup Deletion Protection

The PodGroup lifecycle needs to ensure that a PodGroup will not be deleted while any pod that references it is in a non-terminal phase (i.e. not Succeeded or Failed).

PodGroup objects are created with a dedicated finalizer that a dedicated controller for PodGroup is responsible for removing only when the deletion-safe condition is met. The mechanism for this is:

  • Each PodGroup is created with a dedicated finalizer. If PodGroup objects exist without this finalizer (i.e., created before the feature), the controller adds it when processing them.
  • The controller watches PodGroup and Pod objects. For a PodGroup that has deletionTimestamp set and still has the finalizer (a deletion candidate), it checks whether all pods that reference this PodGroup have reached a terminal phase (Succeeded or Failed).
  • If all referencing pods are terminal, only then the controller removes the finalizer, allowing the PodGroup to be deleted.
  • If any referencing pod is non-terminal, the controller leaves the finalizer in place and re-enqueues (i.e., on pod updates).
  • To find the referencing pods, we can use an index keyed by schedulingGroup.podGroupName (and optionally namespace) so the controller can efficiently list pods that reference a given PodGroup.

Deletion protection is not required for alpha (nice-to-have), however it is required for beta graduation.

SchedulingPolicy Reference vs. Copy/Inline in PodGroup

We evaluated two architectural approaches for linking PodGroup to its scheduling policy:

  • Reference: where PodGroup points to Workload.PodGroupTemplates[x]
  • Copy/Inline: where PodGroup contains an inline copy of the policy (snapshot on creation)

The Reference model offers a single source of truth and lower write amplification, but introduces “action at a distance” semantics where modifying a Workload can break all existing PodGroups.

The Copy/Inline model makes PodGroup a self-contained object, matching the familiar ReplicaSet.spec.template -> Pod pattern. It reduces blast radius (Workload changes only affect newly created PodGroups) and simplifies debugging.

We propose adopting Copy/Inline for Alpha. If scalability concerns emerge, the model can be extended by adding an optional reference field alongside the inline policy (with validation ensuring exactly one is set), preserving a mitigation path.

While this argument works both ways, stability and extensibility are concrete risks we should address from the start, whereas performance concerns remain theoretical.

PodGroup Creation Ordering

Since PodGroup is a runtime object created by true workload4 controllers, strict creation ordering (PodGroup must exist before Pods) is required to ensure the consistency of the scheduling policy.

Semantics:

  • Pods with schedulingGroup.podGroupName set to a non-existent PodGroup are marked as UnschedulableAndUnresolvable.
  • The scheduler re-enqueues these pods when the PodGroup is created (via informer Add event).

This allows controllers to handle transient race conditions during object creation.

Controller Responsibility: True workload4 controllers are responsible for creating PodGroup and Workload objects before creating Pods. The required order is:

  1. Create Workload object
  2. Create PodGroup runtime object
  3. Create Pods with schedulingGroup.podGroupName set to the name of the newly created PodGroup

Ownership and Object Relationship

The PodGroup API introduces an ownership hierarchy within Workload, PodGroup, and Pod objects.

graph TB
    TW["Job / JobSet / LWS"]
    subgraph Objects
      W[Workload API]
    end
    
    PG[PodGroup]
    P[Pods]

    P -.->|ref| PG

    TW ==>|"1. creates and owns"| W
    TW ==>|"2. creates and owns"| PG
    TW ==>|"3. creates and owns"| P
   

    PG -.->|ref| W  

The PodGroup object is created and owned by the true workload controller4. When the controller needs to create pods that require gang scheduling, it first creates the Workload object if it does not exist yet and then creates a PodGroup based on the podGroupTemplate that is defined in this Workload. This ensures automatic garbage collection when the parent object is deleted.

Pods reference their PodGroup via schedulingGroup.podGroupName, which allows the scheduler to look up the PodGroup object. The scheduler requires PodGroup object to exist before scheduling pods that reference them.

Scheduler Changes

The kube-scheduler will add a new informer to watch PodGroup objects. If the PodGroup is missing, the pod remains unschedulable until the PodGroup is created and observed by the scheduler.

In the initial implementation, we expect users to create the Workload and PodGroup objects. In the next steps controllers will be updated (e.g. Job controller in KEP-5547 ) to create an appropriate Workload and PodGroup objects themselves whenever they can appropriately infer the intention from the desired state. Note that given scheduling policies are stored in the PodGroup object, pods linked to the PodGroup object will not be scheduled until this PodGroup object is created and observed by the kube-scheduler.

North Star Vision

The north star vision for gang scheduling implementation should satisfy the following requirements:

  1. Ensure that pods being part of a gang are not bound if all pods belonging to it can’t be scheduled.
  2. Provide the “optimal enough” placement by considering all pods from a gang together.
  3. Avoid deadlock and livelock scenario when multiple workloads are being scheduled at the same time by kube-scheduler.
  4. Avoid deadlock and livelock scenario when multiple workloads are being scheduled at the same time by different schedulers.
  5. Avoid premature preemptions of already running pods in case a higher priority gang will be rejected.
  6. Support gang-level (or workload-level in general) level preemption (if pods form a gang also from a runtime perspective, they can’t be preempted individually).
  7. Updating workload status and triggering rescheduling when a gang failed binding in the all-or-nothing fashion.
  8. Support gang-scheduling even if part of the infrastructure needs to be provisioned (by Cluster Autoscaler, Karpenter or other solutions).

Addressing all these requirements in a single shot would be a huge change, so as part ot this KEP we will only focus on a subset of those. However, we very briefly sketch the path towards the vision to ensure that this KEP is moving in the right direction.

GangScheduling Plugin

For Alpha, we are focusing on introducing the concept of the PodGroup and plumbing it into kube-scheduler in the simplest possible way. The GangScheduling plugin will maintain a lister for PodGroup and check if the PodGroup object exists. We will implement a new plugin implementing the following hooks:

- PreEnqueue: used as a barrier to wait for the PodGroup object and minimum number of pods to be observed by the scheduler before even considering them for actual scheduling. The extension will check if the PodGroup object exists. If not, it will return UnschedulableAndUnresolvable status. Then it verifies that at least minCount pods have been observed, ensuring there are enough pods to consider before enqueuing them. - WaitOnPermit: used as a barrier to wait for the pods to be assigned to the nodes before initiating potential preemptions and their bindings. The extension waits for all pods in the PodGroup to reach permit stage by using each pod’s schedulingGroup.podGroupName to identify the PodGroup that the pod belongs to.

- EventsToRegister (Enqueue): The extension will register events for when a PodGroup object is created and when an unscheduled pod is added.

This seems to be the simplest possible implementation to address the requirement (1). We are consciously ignoring the rest of the requirements for Alpha phase.

Future plans

We will continue with further improvements on top of it with follow-up KEPs. We are planning to introduce the concept of Reservation that will allow to treat distributed subset of resources as a single unit from scheduling perspective. With that, the proposed placement being a result of the scheduling decision of the PodGroup phase will become a Reservation. This will become the coordination point and a mechanism for multiple schedulers to share the underlying infrastructure addressing the requirement (4). This will also be a critical building block for workload-level preemption and addressing requirement (6). Finally, this will allow to address the few remaining corner cases around unnecessary preemption - requirement (5), such as blocking DRA resources (which we can’t solve with NominatedNodeName). Further extensions to Reservation with different states (e.g. not yet block resources) will help with improving the scheduling accuracy. Finally making the binding process aware of gangs will allow to make sure the process is either successful or triggers workload rescheduling satisfying requirement (7).

Addressing requirement (8) is the biggest effort as it requires much closer integration between scheduler and autoscaling components. So in the initial steps we will only focus on mitigating this problem with existing mechanisms (e.g. reserving resources via NominatedNodeName).

However, approval for this KEP is NOT an approval for this vision. We only sketch it to show that we see a viable path forward from the proposed design that will not require significant rework.

We plan to extend PodGroup and Workload APIs to support hierarchical PodGroups structure for advanced batch workloads. Potential features include:

  • Allow PodGroup objects to reference parent PodGroup for hierarchical scheduling structures.
  • Design hierarchical PodGroup lifecycle management and status tracking.

Scheduler Changes for v1.36

For the Alpha phase in v1.35, we focused on plumbing the Workload API and implementing the GangScheduling plugin using simple barriers (PreEnqueue and Permit). While this satisfied the correctness requirement for “all-or-nothing” scheduling, it did not address performance or efficiency at scale, scheduling livelocks, nor did it solve the problem of partial preemption application.

For v1.36, we propose introducing a Workload Scheduling Cycle. This mechanism processes all Pods belonging to a single PodGroup in one batch, rather than attempting to schedule them individually in isolation using the traditional pod-by-pod approach. While introduction of this phase itself won’t fully address the “optimal enough” part of requirement (2), it provides the necessary foundation for applying workload scheduling algorithms to process the entire gang together. The single scheduling cycle, together with blocking resources using nomination, will address requirement (3).

We will also introduce Delayed Preemption . Together with the introduction of a dedicated Workload Scheduling Cycle, this will address requirement (5).

The Workload Scheduling Cycle

We introduce a new phase in the main scheduling loop (scheduleOne). This phase replaces the standard pod-by-pod scheduling cycle for all Pods belonging to a PodGroup. This means that these individual Pods do not enter the standard scheduling queue for independent processing. Instead, when the loop pops a PodGroup from the active queue, it initiates the Workload Scheduling Cycle.

Since the PodGroup instance (defined by the group name and replica key) is the effective scheduling unit, the Workload Scheduling Cycle will operate at the PodGroup instance level, i.e., each instance will be scheduled separately in its own cycle.

If new Pods belonging to an already scheduled PodGroup instance (i.e., one that already passed WaitOnPemit) appear, they are also processed via the Workload Scheduling Cycle, which takes the previously scheduled Pods into consideration. This is done for safety reasons to ensure the PodGroup-level constraints are still satisfied. However, if the PodGroup is being processed, these new Pods must wait for the ongoing pod group scheduling to be finished (pass WaitOnPermit), before being considered. This can simplify the preemption, where we can be sure the decision won’t be changed, while the previous attempt hasn’t finished yet.

The cycle proceeds as follows:

  1. The scheduler takes pod group from the scheduling queue. The retrieved object contains the list of all pending pods belonging to this group. The order of processing is determined by the queueing mechanism (see Queuing and Ordering below).

  2. A single cluster state snapshot is taken for the entire group operation to ensure consistency during the cycle.

  3. The scheduler runs a specialized algorithm (detailed below) to find placements for the group.

  4. Outcome:

    • If the group (i.e., at least minCount Pods) can be placed, these Pods proceed directly to the binding bycle with their selected nodes.
    • In case preemption is required, the PodGroup is moved back to the scheduling queue to wait for the preemption to take effect. This requires a subsequent Workload Scheduling Cycle to verify that the released resources make the placement feasible.
    • If minCount cannot be met (even after calculating potential preemptions), the scheduler considers the PodGroup unschedulable. Standard backoff logic applies (see Failure Handling), and Pods are returned to the scheduling queue.

Queuing and Ordering

Workload-aware preemption (an Alpha effort in KEP-5710 ) will introduce a specific scheduling priority for a workload. Having that in mind, it is beneficial to design a queueing mechanism open for taking a workload’s scheduling priority into account. However, as we need to support ordering before that feature can be enabled, we also need to derive the priority from the pod group’s pods. One such formula can be to set it to the lowest priority found within the pod group, what will be effectively the weakest link to determine if the whole pod group is schedulable and reduce unnecessary preemption attempts.

To ensure that we process the PodGroup instance at an appropriate time and don’t starve other pods from being scheduled, we need to have a good queueing mechanism for pod groups.

We have decided to make the scheduling queue explicitly workload-aware. The queue will support queuing PodGroup instances alongside individual Pods.

  1. When Pods belonging to a PodGroup are added to the scheduler, if a corresponding QueuedPodGroupInfo is not yet present in the scheduling queue, it is created and enqueued. This object will have an aggregated PreEnqueue check, evaluating conditions for all its members. Crucially, the individual Pods themselves are not stored in any standard scheduling queue data structure (active, backoff, or unschedulable), but they are effectively managed via the QueuedPodGroupInfo.

  2. Once the number of accumulated Pods meets the scheduling requirements (e.g., minCount), a QueuedPodGroupInfo object is moved to the activeQ, following the logic similar to individual pods.

  3. The scheduleOne loop will pop the highest-priority item from the queue, which may now be either a single Pod (triggering the standard cycle) or a PodGroup (triggering the Workload Scheduling Cycle).

  4. During a Workload Scheduling Cycle, all member Pods are retrieved from the QueuedPodGroupInfo. Based on the cycle’s outcome:

    • Success: Pods are moved directly to the binding cycle.
    • Failure/Preemption: The QueuedPodGroupInfo (containing the unschedulable pods) is returned to the unschedulablePodInfos structure. The PodGroup enters a backoff state and is eligible for retry only when a relevant cluster event wakes up at least one of its member pods.

While this represents a significant architectural change to the scheduling queue and scheduleOne loop, it provides a clean separation of concerns and establishes a necessary foundation for future Workload Aware Scheduling features.

Scheduling Algorithm

Note: The algorithm described below is a simplified default version based on baseline scheduling logic. It is expected to evolve to more effectively handle complex scenarios and specific features in future iterations.

The internal algorithm for placing the group utilizes the optimization defined in Opportunistic Batching (KEP-5598 ) for improved performance. The approach described below allows mitigating some restrictions of that feature, e.g., by sorting the Pods appropriately by their signatures. In case Opportunistic Batching is disabled or not applicable, this falls back to non-optimized filtering and scoring for each Pod. The list and configuration of plugins used by this algorithm will be the same as in the pod-by-pod cycle.

  1. The scheduler iterates through the retrieved Pods and groups them into homogeneous sub-groups (using the signatures defined in KEP-5598 ). This aggregation can be done in the scheduler’s cache earlier to optimize performance.

  2. These sub-groups are sorted. Initially, we sort by the highest priority of the sub-group (assuming homogeneity enforces uniform sub-group priority). In the future, sorting may use the size of the sub-group (larger groups first) to tackle the hardest placement problems early. Crucially, the ordering should be deterministic and saable if the pod group state doesn’t change This sorting can be done in the scheduler’s cache earlier to optimize performance.

  3. The scheduler iterates through the sorted sub-groups. It finds a feasible node for each pod from a sub-group using standard filtering and scoring phases. It also utilizes the Opportunistic Batching feature where possible, reducing overall scheduling time.

    • If a pod fits, it is temporarily assumed and reserved on the selected node.

    • If a pod cannot fit, the scheduler tries preemption by running the PostFilter extension point. Note: With workload-aware preemption this phase will be replaced by a workload-level algorithm that will be run after trying to schedule all pod group’s pods.

      • If calculated preemption is successful, the pod is temporarily assumed and reserved on the selected node. Victim pods are not preempted yet, but just marked as nominated for removal. Subsequent pods from this group won’t see victims on the nodes in this workload cycle. Delayed Preemption feature is used to delay the actuation until after all group’s pods are considered.

      • If preemption fails, the pod is considered unscheduled for this cycle. However, the scheduling of subsequent pods continues as long as the minCount constraint remains satisfiable. The processing can also be optimized by rejecting all subsequent pods from the same homogeneous sub-group, as their failed scheduling outcome will be the same.

    The phase can effectively stop once minCount pods have a placement, though attempting to schedule the full group is preferred to maximize utilization.

  4. The scheduler checks if the number of schedulable (including those after delayed preemption) Pods meets the minCount.

    • If schedulableCount >= minCount, the cycle succeeds.

      • If preemptions are needed: The removal of all nominated victims is actuated as described in Delayed Preemption . The pods are nominated to their chosen nodes but are moved to the unschedulable queue, waiting for victim removal to complete. They can be moved back to the active queue and retried even before the victims are fully removed, but they must pass through the Workload Scheduling Cycle again. Crucially, initiating new preemptions will be forbidden during this retry. This ensures that the pod group can be scheduled in a different location if resources become available earlier, but cannot cause additional disruption to do so.

      • If preemptions are not needed: Pods proceed directly to their binding cycles using the nodes selected during the Workload Scheduling Cycle.

      The WaitOnPermit gate is retained to ensure that the minCount pods are successfully admitted before binding occurs. Additionally, the minCount check can consider the number of pods that have passed the Workload Scheduling Cycle to ensure that Pods do not wait unnecessarily if some have been rejected while new pods have been added to the cluster.

    • If schedulableCount < minCount, the cycle fails. Preemptions computed but not actuated during this cycle are discarded. Pods go through traditional failure handlers and nominations for them are cleared to ensure the other workloads (pod groups) can be attempted on that place. See Failure Handling.

    Gang Scheduling is currently implemented as a plugin, meaning the minCount constraint is enforced at the plugin level. The proposed Workload Scheduling Cycle algorithm needs to know if this constraint is met to decide whether to commit the results. To achieve this, we will reuse the existing Permit extension point, but without the suspension phase (WaitOnPermit). Crucially, this check has to support two modes:

    • Validation: Check whether the currently scheduled pods meet the requirements, e.g., if the minCount pods from a pod group was successfully scheduled.

    • Feasibility: Given the number of pods that have already failed scheduling in this cycle, check whether is it still possible to meet the constraint. If not, the cycle should abort early to save time.

While this algorithm might be suboptimal, it is a solid first step for ensuring we have a single-cycle workload scheduling phase. As long as PodGroups consist of homogeneous pods, opportunistic batching itself will provide significant improvements. Future features like Topology Aware Scheduling can further improve other subsets of use cases.

Algorithm Limitations

Default algorithm proposed above relies on specific sorting and may fail to find a valid placement that could have been discovered by processing the group’s pods in a different order. While resolving this limitation could be desirable, implementing a generalized solver for arbitrary constraints would introduce excessive complexity for the default implementation. The current proposal addresses the vast majority of standard use cases (specifically homogeneous workloads). Future improvements for this should be delivered via specialized algorithms based on specific pod group constraints, such as Topology Aware Scheduling (TAS).

Since the scheduler cannot exhaustively analyze all possible placement permutations, we will advise users via documentation regarding which pod group types are well-supported and which scenarios are handled on a best-effort basis (where a successful placement is not guaranteed, even if one theoretically exists).

In particular:

  • For basic homogeneous pod groups without inter-pod dependencies, this algorithm is expected to find a placement whenever one exists.
  • For heterogeneous pod groups, finding a valid placement is not guaranteed.
  • For pod groups with inter-pod dependencies (e.g., affinity/anti-affinity or topology spreading rules), finding a valid placement is not guaranteed.

Moreover, if a pod using these features is rejected by the Workload Scheduling Cycle, its rejection message (exposed via Pod status) will explicitly indicate that the rejection may be due to the use of features for which finding an existing placement cannot be guaranteed. This will be accompanied by a specific failure reason, distinguishing it from a generic Unschedulable condition. distinguishing it from a generic Unschedulable reason. This distinction is particularly relevant for Cluster Autoscaler or Karpenter, which can act differently based on the new reason.

In addition to the above, for cases involving intra-group dependencies (e.g., when the schedulability of one pod depends on another group member via inter-pod affinity), this algorithm may fail to find a placement regardless of cluster state, due to the deterministic processing order.

Users will be advised that such dependencies are discouraged. However, they could mitigate this by assigning a lower priority to the dependent pods. Since the algorithm processes higher-priority pods first, this ensures that the required pods are scheduled earlier, to satisfy the affinity rules of the subsequent dependent pods.

All pods belonging to a single pod group must share the same .spec.schedulerName. Divergent scheduler names would complicate reasoning about placement decisions and make future pod group-based constraints more difficult to manage. The scheduler will validate this condition: if a mismatch is detected, all pod group’s pods will be rejected as unschedulable.

Interaction with Basic Policy

For pod groups using the Basic policy, the Workload Scheduling Cycle is optional. In the v1.36 timeframe, this cycle will be applied to Basic pod groups to leverage the batching performance benefits, but the “all-or-nothing” (minCount) checks will be skipped; i.e., we will try to schedule as many pods from such PodGroup as possible.

Delayed Preemption

A critical requirement for moving Gang Scheduling to Beta is the integration with Delayed Preemption, which allows the scheduler to avoid unnecessary preemptions. However, the current model of preemption, when preemption is triggered immediately after the victims are decided (in PostFilter), doesn’t achieve this goal. The reason for that is that the proposed placement (nomination) can actually appear to be invalid and not proceed. In such cases, we will not even proceed to binding and the preemption will be completely unnecessary disruption.

Note that this problem already exists in the current gang scheduling implementation. A given gang may not proceed with binding if the minCount pods from it can’t be scheduled. But, the preemptions are currently triggered immediately after choosing a place for individual pods. So similarly as above, we may end up with completely unnecessary disruptions.

We will address it with what we call delayed preemption mechanism as following:

  1. We will modify the DefaultPreemption plugin to just compute preemptions, without actuating them. We advise maintainers of custom PostFilter implementations to do the same.

  2. We will extend the PostFilterResult to include a set of victims (in addition to the existing NominationInfo). This will allow us to clearly decouple the computation from actuation.

    We believe that while custom plugins may want to provide their custom preemption logic, the actuation logic can actually be standardized and implemented directly as part of the framework. If that proves incorrect, we will introduce a new plugin extension point (tentatively called Preempt) that will be responsible for actuation. However, for now we don’t see evidence for this being needed.

    Relying on the actuation logic is optional for plugins. For example, the DynamicResources plugin can still actuate its decision (claim deallocation) in the PostFilter phase. However, any pod-based removals in other plugins should be delegated to the delayed actuation phase.

  3. For individual pods (not being part of a workload), we will adjust the scheduling framework implementation of schedulingCycle to actuate preemptions of returned victims if calling PostFilter plugins resulted in finding a feasible placement.

  4. For pods being part of a workload, we will rely on the Workload Scheduling Cycle. We still have two subcases here:

    1. In the legacy case (without workload-aware preemption), we call PostFilter individually for every pod from a PodGroup. However, the victims computed for already the already processed pods may affect placement decisions for the next pods. To accommodate for that, if a set of victims was returned from a PostFilter in addition to keeping them for further actuation, we will additionally store them in CycleState. More precisely, the CycleState will store a new entry containing a map from a nodeName to a list of victims that were already chosen. With that, the DefaultPreemption plugin will be extended to remove all already chosen victims from a given node before processing that node.

    2. In the target case (with workload-aware preemption), we will have no longer be processing pods individually, so the additional mutations of CycleState should not be needed.

  5. In both above cases, we will introduce an additional step to the scheduling algorithm at the end. If we managed to find a feasible placement for the PodGroup, we will simply take all the victims and actuate their preemption. If a feasible placement was not found, the victims will be dropped. In both cases, the scheduling of the whole PodGroup (all its pods) will be marked as unschedulable and got back to the scheduling queue.

  6. To reduce the number of unnessary preemptions, in case a preemption has already been triggerred and the already nominated placement remains valid, no new preemptions can be triggerred. In other words, a different placement can be chosen in a subsequent (workload) scheduling cycles only if it doesn’t require additional preemptions or the previously chosen placement is no longer feasible (e.g. because higher priority pods were scheduled in the meantime). This can be done by ignoring the pods with deletionTimestamp set in these preemption attempts (when the previous preemption is ongoing for the preemptor).

The rationale behind the above design is to maintain the current scheduling property where preemption doesn’t result in a commitment for a particular placement. If a different possible placement appears in the meantime (e.g. due to other pods terminating or new nodes appearing), subsequent scheduling attempts may pick it up, improving the end-to-end scheduling latency. Returning pods to scheduling queue if these need to wait for preemption to become schedulable maintains that property.

Workload-aware Preemption

Workload-aware preemption (KEP-5710 ) aims to enable preemption for a whole pod group at once. In the context of this cycle, it means that if the cycle determines preemption for a single pod is necessary, it won’t run the PostFilter phase, but defer that to the end of the workload scheduling phase, running a new, single workload-aware preemption step.

Read more about the proposal in KEP-5710: Workload Aware Preemption PR.

Failure Handling

If a Workload Scheduling Cycle fails (e.g., minCount is not met, preemption fails, or a timeout occurs), the scheduler must handle the failure efficiently.

  1. Rejection

When the cycle fails, the scheduler rejects the entire group.

  • All Pods in the group are moved back to the scheduling queue (stored in the unschedulablePodGroups data structure). Their status is updated the event with failure reason is sent.
  • Crucially, any .status.nominatedNodeName entries set during the failed attempt (or from previous cycles) must be cleared. This ensures that the resources tentatively reserved for this gang are immediately released for other workloads.
  1. Backoff strategy

Backoff mechanism has to be applied for a pod group similarly as we do for individual pods. Initially, we will apply the standard Pod backoff logic to the group.

At the same time, we should consider increasing the maximum backoff duration for pod groups or potentially scaling it based on the number of pods within the group. The current default of 10 seconds has proven insufficient in large clusters, so this might be the case for workloads. Crucially, because the Workload Scheduling Cycle can be computationally expensive, retrying it too frequently risks starving individual pods. Moreover, retries triggered by the Delayed Preemption feature may further strengthen the problem.

  1. Retries

We rely on the existing Queueing Hints mechanism to determine when to retry the gang. It is considered for a retry when at least one member Pod receives a Queue hint (indicating a relevant cluster event, such as a Node addition or Pod deletion, has made that specific Pod potentially schedulable).

While checking a single Pod does not guarantee the whole gang can fit, calculating gang-level schedulability inside the event handler can be difficult at the moment. Therefore, we optimistically retry the Workload Scheduling Cycle if any member’s condition improves.

It might be beneficial to retry the pod group without being triggered by any cluster event, because single Workload Scheduling Cycle cannot determine the placement doesn’t really exists, especially for heterogeneous workloads or inter-pod dependencies. To avoid introducing subtle errors in the initial implementation, we can start by skipping the Queueing Hints mechanism and relying solely on the backoff time.

Test Plan

[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary

Prerequisite testing updates

N/A

Unit tests
  • k8s.io/kubernetes/pkg/apis/scheduling/v1alpha1: 2025-10-02 - 62.7%
  • k8s.io/kubernetes/pkg/apis/scheduling/validation: 2025-10-02 - 97.8%
  • k8s.io/kubernetes/pkg/scheduler: 2025-10-02 - 81.7%
  • k8s.io/kubernetes/pkg/scheduler/backend/queue: 2025-10-02 - 91.4%
  • k8s.io/kubernetes/pkg/scheduler/framework: 2025-10-02 - 81.7%
  • k8s.io/kubernetes/pkg/scheduler/framework/preemption: 2025-10-02 - 64.2%
  • k8s.io/kubernetes/pkg/scheduler/framework/util/assumecache: 2025-10-02 - 86.2%
Integration tests

Initially, we created integration tests to ensure the basic functionalities of gang scheduling including:

  • Pods linked to the non-existing podGroup is not scheduled
  • Pods get unblocked when podGroup is created and observed by scheduler
  • Pods are not scheduled if there is no space for the whole gang
  • PodGroup status is updated correctly
  • PodGroup is garbage collected when the replica is deleted

With Workload Scheduling Cycle and Delayed Preemption features, we will significantly expand test coverage to verify:

  • Pods referencing a PodGroup (both gang and basic policies) are correctly processed via the Workload Scheduling Cycle.
  • PodGroup queuing ensures that all available members are retrieved and processed correctly.
  • Deadlocks and livelocks do not occur when multiple gangs compete for resources or interleave with standard pods.
  • Delayed Preemption feature doesn’t break pod-by-pod (non-workload) scheduling.
  • Delayed Preemption ensures atomicity, i.e., victims are deleted only if the scheduler determines the entire gang can fit, otherwise, the cycle aborts with zero disruption.
  • Failed pod groups are requeued correctly and retry successfully when resources become available.

We will also benchmark the performance impact of these changes to measure:

  • The scheduling throughput of the workload scheduling, including gang and basic policies, and preemptions.
e2e tests

We will add basic API tests for the new Workload and PodGroup APIs, that will later be promoted to conformance. These tests will cover PodGroup creation, validation, status updates, and lifecycle management. More tests will be added for beta release.

Graduation Criteria

Alpha

  • Workload API is introduced behind GenericWorkload feature flag
  • API tests for Workload API (that will be promoted to conformance in GA release)
  • kube-scheduler implements first version of gang-scheduling based on groups defined in the Workload object

In 1.36:

  • Introduction of the decoupled Workload API (Templates) and PodGroup API (Instances) in v1alpha2
  • PodGroup API added with validation
  • kube-scheduler implementation switched to be based on PodGroup API
  • e2e tests for PodGroup are added and passing

Beta

  • Providing “optimal enough” placement by considering all pods from a gang together
  • Avoiding livelock scenario when multiple workloads are being scheduled at the same time by kube-scheduler
  • Implementing delayed preemption to avoid premature preemptions
  • Workload-aware preemption design to ensure we won’t break backward compatibility with it.
  • Both Workload and PodGroup APIs are integrated (alpha) with at least one true workload4 controller.
  • A deletion protection mechanism is implemented for PodGroup objects and finalizer is added to the API.
  • All e2e tests for PodGroup are added and graduate to conformance tests.
  • Performance tests are created and are being run in CI to protect against regressions.

GA

  • TBD in for Beta release

Upgrade / Downgrade Strategy

This KEP is completely additive and can safely fallback to the original behavior on downgrade.

This KEP effectively boils down to two separate functionalities:

  • the Workload API and new field in Pod API that allows linking Pods to Workloads
  • scheduler changes implementing the gang scheduling functionality

When user upgrades the cluster to the version that supports these two features:

  • they can start using the new API by creating Workload objects and linking pods to it via explicitly specifying their new spec.schedulingGroup field
  • scheduler automatically uses the new extensions and tries to schedule all pods from a given gang in a scheduling group based on the defined PodGroup objects

When user downgrades the cluster to the version that no longer supports these two features:

  • the PodGroup objects can no longer be created (the existing ones are not removed though)
  • the spec.schedulingGroup field can no longer be set on the Pods (the already set fields continue to be set though)
  • scheduler reverts to the original behavior of scheduling one pod at a time ignoring existence of PodGroup objects and pods being linked to them
  • On downgrade, kube-scheduler should be downgraded first (to stop processing the new fields) before kube-apiserver is downgraded. Existing PodGroup objects remain in etcd but are ignored.

Version Skew Strategy

The feature is limited to the control plane, so the version skew with nodes (kubelets) doesn’t matter.

For the API changes (introduction of Workload API and the new field in Pod API), the old version of components (in particular kube-apiserver) may not handle those. Thus, users should not set those fields before confirming all control-plane instances were upgraded to the version supporting those.

For the gang-scheduling itself, this is purely kube-scheduler in-memory feature, so the skew doesn’t really matter (as there is always only single kube-scheduler instance being a leader).

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?
  • Feature gate (also fill in values in kep.yaml)
    • Feature gate name: GenericWorkload (alternatives: NativeWorkload/Workload)
    • Components depending on the feature gate:
      • kube-apiserver
      • kube-scheduler
    • Feature gate name: GangScheduling
    • Components depending on the feature gate:
      • kube-scheduler
    • Feature gate name: DelayedPreemption
    • Components depending on the feature gate:
      • kube-scheduler
  • Other
    • Describe the mechanism:
    • Will enabling / disabling the feature require downtime of the control plane?
    • Will enabling / disabling the feature require downtime or reprovisioning of a node?
Does enabling the feature change any default behavior?

No. Gang scheduling is triggered purely via existence of Workload and PodGroup objects and those are not yet created automatically behind the scenes.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. The GangScheduling features gate need to be switched off to disabled gang scheduling functionality. If additionally, the API changes and admission need to be disabled, the GenericWorkload feature gate needs to also be disabled. However, the content of spec.schedulingGroup fields in Pod objects will not be cleared, as well as the existing Workload objects will not be deleted.

What happens if we reenable the feature if it was previously rolled back?

The feature should start working again. However, the user need to remember that some Workload objects could already be stored in etcd and may affect the behavior of some of the existing workloads.

Are there any tests for feature enablement/disablement?

No. The enablement/disablement for the new field in Pod API will be added similarly to this PR: https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282

Note that gang-scheduling itself is purely in-memory feature, so feature themselves are enough.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?
What specific metrics should inform a rollback?
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?
How can someone using this feature know that it is working for their instance?
  • Events
    • Event Reason:
  • API .status
    • Condition name: PodGroupScheduled
    • Other field:
  • Other (treat as last resort)
    • Details:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
  • Metrics
    • Metric name:
    • [Optional] Aggregation method:
    • Components exposing the metric:
  • Other (treat as last resort)
    • Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

No dependendies other than the components where the feature is implemented (kube-apiserver and kube-scheduler).

Scalability

Will enabling / using this feature result in any new API calls?

Yes:

Watching for workloads:

  • API call type: LIST+WATCH Workloads
  • estimated throughput: < XX/s
  • originating component: kube-controller-manager (GC)

Watching for PodGroups:

  • API call type: LIST+WATCH PodGroups
  • estimated throughput: < XX/s
  • originating component: kube-scheduler

PodGroup status updates:

  • API call type: PUT/PATCH PodGroups
  • estimated throughput: < XX/s
  • originating component: kube-scheduler
Will enabling / using this feature result in introducing new API types?

Yes:

  • API type: Workload

    • Supported number of objects per cluster: XX,000
    • Supported number of objects per namespace: XX,000
  • API type: PodGroup

    • Supported number of objects per cluster: XX,000
    • Supported number of objects per namespace: XX,000

The above numbers should eventually match the numbers for built-in workload APIs (e.g. Deployments, Jobs, StatefulSets, …).

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

Yes. New field (spec.schedulingGroup) is added to the Pod API:

  • API type: Pod
  • Estimated increase in size: XX-XXX bytes per object (depending on the final choice described in the Associating Pod into PodGroups section above).
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Pod startup SLI/SLO may be affected and should be adjusted appropriately. The reason is that scheduling a pod being part of a gang will now be blocked on all pods from a gang to be created and observed by the scheduler (which from large gangs can take non-negligible amount of time).

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Since the scheduler adds a new informer for PodGroup objects, kube-scheduler and kube-apiserver load may grow with PodGroup cardinality. The increase is expected to remain reasonable under typical use but could be non-negligible on clusters with very large numbers of concurrent PodGroups.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?
What are other known failure modes?
What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

  • 2025-09: Initial KEP-4671 proposal.
  • 2026-01: KEP-5832 created for PodGroup API alpha release.
  • 2026-02: Structural revision for 1.36 to decouple Policy (Workload) and State (PodGroup). The API remains in Alpha to finalize the architecture.
  • 2026-02: KEP-5832 updated to sync with API decision of keeping Workload API in alpha release.
  • 2026-03: KEP-5832 merged into KEP-4671 as a single consolidated KEP.

Drawbacks

There are already multiple implementations of gang scheduling in the ecosystem. However:

  • the other implementations don’t address all the issues (e.g. different kinds of races/deadlocks) that this proposal paves the way for addressing
  • the introduced concepts are fundamental enough in AI era, that we believe that our users shouldn’t need to install any extensions to have them addressed

Alternatives

API

The longer version of this design describing the whole thought process of choosing the above described approach can be found in the extended proposal document.

It’s maybe worth noting that we started the KEP with a different API definition of PodGroup, but based on the community discussions and feedback decided to change it. The original API definition for PodGroup was as following:

type GangMode string
const (
	// GangModeOff means that all pods in this PodGroup do not need to be scheduled as a gang.
	GangModeOff GangMode = "Off"

	// GangModeSingle means that all pods in this PodGroup need to be scheduled as one gang.
	GangModeSingle GangMode = "Single"

	// GangModeReplicated means that there is a variable number of identical copies of this PodGroup,
    //  as specified in Replicas, and each copy needs to be independently gang scheduled.
	GangModeReplicated GangMode = "Replicated"
)

// GangSchedulingPolicy holds options that affect how gang scheduling of one PodGroup is handled by the scheduler.
type GangSchedulingPolicy struct {
    // SchedulingTimeoutSeconds defines the timeout for the scheduling logic.
    // Namely it's timeout from the moment when the first  pod show up in
    // PreEnqueue, until those pods are observed in WaitOnPermit - for context
    // see https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/#interfaces
    // If the timeout is hit, we reject all the waiting pods, free the resources
    // they were reserving and put all of them back to scheduling queue.
    //
    // We decided to drop the field for Alpha because:
    // 1) it won't be obvious for majority of users how to set it
    // 2) it's usefulness after Beta is unclear - see:
    //   https://github.com/kubernetes/enhancements/pull/5558#discussion_r2400876903
    SchedulingTimeoutSeconds *int
    MinCount *int
}

// PodGroup is a group of pods that may contain multiple shapes (EqGroups) and may contain
// multiple dense indexes (RankedGroups) and which can optionally be replicated in a variable
// number of identical copies.
//
// TODO: Decide on the naming: PodGroup vs GangGroup.
type PodGroup struct {
    Name *string
    GangMode *GangMode // default is "Off"

    // Optional when GangMode = "ReplicatedGang".
    // Forbidden otherwise.
    Replicas int

    // GangSchedulingPolicy defines the options applying to all pods in this gang.
    // Forbidden if GangMode is set to "Off".
    GangSchedulingPolicy GangSchedulingPolicy
}

Pod group queueing in scheduler

In selecting the optimal pod group queuing mechanism, we evaluated several alternatives:

Alternative 0 (Keep current queueing and ordering):

We can minimize changes by retaining the current queueing and ordering logic. When a Pod is popped, the scheduler can check if it belongs to a PodGroup requiring a Workload Scheduling Cycle. As we add scheduling priorities for pod groups later, this alternative naturally evolves into Alternative 1.

  • Pros: Fits the current architecture. Retains current reasoning about the scheduling queue. Minimizes implementation effort.
  • Cons: Might be problematic when some of the pod groups’s pods are in the backoffQ or unschedulablePods and need to be retrieved efficiently. Makes it hard to further evolve the Workload Scheduling Cycle. Observability, currently suited for pod-by-pod scheduling, may not accurately reflect the state of the queue (e.g., pending gangs). Likely harder to support future extensions and won’t work well if PodGroup becomes a separate top-level resource. The pod group will be likely scheduled based on the highest priority member, meaning the latter pod-by-pod cycles might be visibly delayed for lower priority Pods.

Alternative 1 (Modify sorting logic):

Modify the sorting logic within the existing PriorityQueue to put all pods from a pod group one after another.

  • Pros: Fits the current architecture.
  • Cons: Might be problematic when some of the pod groups’s pods are in the backoffQ or unschedulablePods and need to be retrieved efficiently. Makes it hard to further evolve the Workload Scheduling Cycle. Would need to inject the workload priority into each of the Pods or somehow apply the lowest pod’s priority to the rest of the group.

Alternative 2 (Store a PodGroup instance):

Modify the scheduling queue’s data structures to accept QueuedPodGroupInfo alongside QueuedPodInfo. This allows reusing existing queue logic while extending it to PodGroups. All queued members would be stored in a new data structure and retrieved for the Workload Cycle when the PodGroup is popped.

  • Pros: Makes it easier to obtain all pods in a group and reduces queue size. Reuses current logic for popping, enforcing backoff, and processing unschedulable entities.
  • Cons: Requires adapting the scheduling queue to handle PodGroups as queueable entities, which is non-trivial and might clutter the code.

Alternative 3 (Dedicated PodGroup queue):

Introduce a completely separate queue for PodGroups alongside the activeQ for Pods. The scheduler would pop the item (Pod or PodGroup) with the highest priority/earliest timestamp. Pods belonging to an enqueued PodGroup won’t be allowed in the activeQ.

  • Pros: Clean separation of concerns. Can easily use the Workload scheduling priority. Can report dedicated logs and metrics with less confusion to the user.
  • Cons: Significant and non-trivial architectural change to the scheduling queue and scheduleOne loop.

Ultimately, Alternative 3 (Dedicated PodGroup queue) was chosen as the best long-term solution.

Embedded PodGroups (Status Quo)

PodGroups remain embedded within the Workload object, with no standalone PodGroup API.

Pros:

  • Single object to learn and look up, synchronize, and manage mutations
  • No coordination required across API objects
  • Fastest time to market (graduate to beta)

Cons:

  • Lifecycle management is getting complex
  • DRA integration is difficult
  • Scalability is limited by Workload object size (1.5MB etcd limit)
  • Per-PodGroup status within a large Workload may be misleading to users and hit scalability limits

Support both embedded and standalone PodGroup

Support both embedded PodGroups inside Workload and external standalone PodGroups.

Pros:

  • Allows sharding when using external PodGroups
  • Decoupled lifecycle supported for external PodGroups

Cons:

  • Two top-level object types without clear responsibility split
  • Workload is an aggregating object but can also contain PodGroups
  • Users who created internal/embedded PodGroups are stuck if they need to change (requires workload recreation)
  • Exposed to all limitations of embedded option, combined with unintuitive additional external PodGroups
  • Most complex to reason about and maintain

For more details about the alternatives, please refer to the PodGroup as top-level object document .

Infrastructure Needed (Optional)


  1. The Kubernetes community uses the term “gang scheduling” to mean “all-or-nothing scheduling of a set of pods” [1,2,3,4,5,6,7,8,9,10,11,12,13]. In the Kubernetes context, it does not imply time-multiplexing (in contrast to prior academic work such as Feitelson and Rudolph , and in contrast to Slurm Gang Scheduling ). ↩︎

  2. API Design for Gang and Workload-Aware Scheduling  ↩︎

  3. API Proposal: Decoupled PodGroup and Workload API  ↩︎ ↩︎ ↩︎

  4. The true workload controller refers to either in-tree or out-of-tree objects controllers like Job, JobSet, LeaderWorkerSet, etc. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎

  5. Volcano.sh, Co-scheduling plugin, Preferred Networks Plugin, and Kueue all implement gang scheduling outside of kube-scheduler. Additionally, two previous proposals have been made on this KEP’s issue. These alternatives are compared in detail in the Background tab of the API Design for Gang Scheduling↩︎

  6. Evolution of the Runtime Object  ↩︎

  7. DNS subdomain is a naming convention defined in RFC 1123 that Kubernetes uses for most resource names. ↩︎