KEP-5547: Integrate Workload APIs with Job Controller
KEP-5547: Integrate Workload APIs with Job Controller
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests within one minor version of promotion to GA
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
This KEP introduces native integration between the Job controller and the gang scheduling1 APIs (Workload and PodGroup ).
The Job controller will automatically create Workload and PodGroup objects before creating pods for parallel Jobs, enabling native gang scheduling support in Kubernetes.
Motivation
The Kubernetes Job Controller currently creates pods independently without workload-aware scheduling constraints. This creates challenges for parallel applications (i.e., AI/ML training workloads, MPI jobs) that require all pods to be scheduled and run together or none(gang scheduling1). Since there is now a native mechanism to express gang scheduling requirements now via Workload and PodGroup APIs, this KEP brings gang scheduling feature to Job Controller by integrating these APIs directly into the Job controller lifecycle.
Goals
- Job controller automatically creates
WorkloadandPodGroupobjects for Jobs that require gang scheduling - Support opt-out mechanism to define when Job controller skips creating
WorkloadandPodGroupobjects - Use
GangSchedulingPolicywithminCount = parallelismfor Jobs withparallelism > 1,completionMode: Indexed, andparallelism = completions. - Jobs that don’t qualify for gang scheduling will not have
WorkloadandPodGroupobjects created. - Ensure proper ordering of
WorkloadandPodGroupcreation before pods creation - Existing Jobs without gang scheduling continue to work normally
Non-Goals
- Supporting dynamic changes to
minCountor gang membership at runtime - Complex workload structures with multiple nested PodGroups are not supported in alpha.
- Support for scaling up/down gang scheduling Jobs is not supported in alpha.
Proposal
This KEP depends on:
The Job controller will be extended to create Workload and PodGroup objects as part of its pod management lifecycle.
This integration ensures that pods belonging to a Job are scheduled according to the appropriate scheduling policy (gang or basic) before they are created. If Job.spec.template.spec.schedulingGroup is set, the Job controller does not create or update Workload/PodGroup (opt-out due to preexisting or parent-managed controller).
For the alpha release, this feature is optimized for static batch workloads with a flat API structure where
minCount is immutable. The key design principles are:
- One
Jobcreates oneWorkloadwith onePodGrouprepresenting a single homogeneous group of pods. - The automatic policy selection is based on
JobType- Jobs with
parallelism > 1,completionMode: Indexedandparallelism = completionsuse gang scheduling policy whereminCountequals the Job’s parallelism. - Other Jobs cases will not have
WorkloadandPodGroupobjects created and will keep scheduling as is (pod-by-pod scheduling).
- Jobs with
- Elastic Jobs (changing parallelism at runtime) are not supported when gang scheduling is active.
- The Job controller does not create
Workload/PodGroupwhenJob.spec.template.spec.schedulingGroupis set. Higher-level controllers that want to own theWorkloadandPodGroup(i.e., JobSet) set this field when they create the Job. - Jobs created by CronJob do not have
schedulingGroupset, the Job controller creates oneWorkloadand onePodGroupper Job for them if they match the gang scheduling criteria.
An example of the Job and the corresponding Workload/PodGroup creation:
apiVersion: batch/v1
kind: Job
metadata:
name: <job-name>
namespace: training
spec:
parallelism: 8
completions: 8
completionMode: Indexed
template:
spec:
containers:
- name: trainer
image: training-image:latest
resources:
limits:
nvidia.com/gpu: 1
---
apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
metadata:
name: <job-name>-<hash>
namespace: training
ownerReferences:
- apiVersion: batch/v1
kind: Job
name: <job-name>
uid: <job-uid>
controller: true
spec:
controllerRef:
apiVersion: batch/v1
kind: Job
name: <job-name>
podGroupTemplate:
- name: pg-w-distributed-training
schedulingPolicy:
gang:
minCount: 8 # Equal to Job.spec.parallelism
---
apiVersion: scheduling.k8s.io/v1alpha1
kind: PodGroup
metadata:
name: <workload-name>-<podGroup-template-name>-<hash>
namespace: training
ownerReferences:
- apiVersion: batch/v1
kind: Job
name: <job-name>
uid: <job-uid>
controller: true
- apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
name: <workload-name>
uid: <workload-uid>
spec:
podGroupTemplateRef:
workload:
workloadName: <workload-name>
podGroupTemplateName: <podGroup-template-name>
schedulingPolicy:
gang:
minCount: 8 # Equal to Job.spec.parallelism
Then, the Job Controller will create the corresponding pods and set the schedulingGroup field:
apiVersion: v1
kind: Pod
metadata:
name: <job-name>-<random-suffix>
namespace: training
ownerReferences:
- apiVersion: batch/v1
kind: Job
name: <job-name>
uid: <job-uid>
controller: true
- apiVersion: scheduling.k8s.io/v1alpha1
kind: PodGroup
name: <podGroup-name>
uid: <podGroup-uid>
spec:
schedulingGroup:
podGroupName: <workload-name>-<podGroup-template-name>-<hash>
containers:
- name: ...
User Stories
ML Training Job with Gang Scheduling
As a machine learning engineer, I want to run a distributed training job with 8 workers that must all be scheduled together. If only 7 workers can be scheduled, I don’t want any pods to start because partial training would waste resources.
Standard Batch Job with Workload Tracking
As a data engineer, I want to run a batch processing job that processes files sequentially without gang scheduling requirements.
Notes/Constraints/Caveats
Alpha Constraints
- The alpha release targets simple, static batch workloads where the workload requirements are known at creation time.
- Each Job maps to one
PodGroup. All pods in the Job are identical from a scheduling policy perspective. - The
minCountfield in the Workload’sGangSchedulingPolicymirrors the Job’s parallelism. - The opt-out mechanism is supported by setting
Job.spec.template.spec.schedulingGroupto reference an existingWorkload. In this case, the Job controller will not createWorkload/PodGroupobjects. - When gang scheduling is active (
GangSchedulingPolicy), changes tospec.parallelism(scaling up/down) are rejected via conditional validation depends on feature enablement. This is because this would require changingminCountin theWorkloadobject, which is immutable. This will disable Elastic Indexed Jobs .
Risks and Mitigations
JobSetor other higher-level controllers will createWorkload/PodGroupobjects for theirWorkload, the Job controller will duplicate or update the objects and create a conflict. We can mitigate this by ensuring the Job controller will not createWorkload/PodGroupobjects if the Job hasspec.template.spec.schedulingGroupindicating it is managed by higher-level controller.When the feature is enabled, Jobs that identify as gang scheduling cannot have
spec.parallelismchanged. That effectively disables Elastic Indexed Jobs for those Jobs. This is accepted for alpha with clear documentation and a committed path to beta to not break Elastic Indexed Jobs.Suspended Jobs and resource release rely only on GC, which does not address releasing resources (DRA) while a Job is suspended. This behavior is acceptable for alpha. Future work may require the controller to delete
PodGroup/Workloadon suspend and recreate on resume.
Design Details
Job Controller Changes
The Job controller reconciliation loop that processes each Job will be extended to ensure Workload and PodGroup objects exist before creating pods.
Workload and PodGroup Discovery
Discovery of those objects is based on references (workload.spec.controllerRef and podGroup.spec.podGroupTemplateRef), not on ownership.
ownerReference is used only for controller-created objects so that they are garbage-collected when the Job is
deleted. Workloads which are created by user or higher-level controller may not be given ownerReferences to the Job, so they are not deleted when the Job is deleted.
A Workload is considered the Workload for this Job object if:
- The
Workloadis in the Job’s namespace - It has
workload.spec.controllerReffield that is associated with this Job
Similarly, a PodGroup is considered the PodGroup for this Job if:
- The
PodGroupis in the Job’s namespace - Its
spec.podGroupTemplateReference.workload.workloadNameequals the name of theWorkloadfor this Job.
Controller Workflow
The Job controller attempts to create Workload and PodGroup only when the Job has no pods associated with it
(no active or terminal pods owned by the Job). If the Job already has one or more pods, the controller only
discovers and uses existing Workload/PodGroup if any and does not create new ones. This rule is important for
correctness when the controller restarts or is upgraded in the middle of the workflow (i.e., after creating
Workload but before creating PodGroup or pods). On the next sync, the controller will find the existing objects
via informers/listers and continue.
The controller discovers or creates Workload and PodGroup as follows:
- If the Job already has pods (pods owned by this Job), skip creation.
- Look up existing
Workload(s) in Job’s namespace whosespec.ControllerRefpoints to this Job. If theWorkloadwas created by the Job controller, it also has a controllerownerReferencepointing to this Job (controller: true)
- If none found, create a
WorkloadwithownerReferenceandspec.ControllerRefpointing to this Job - If more than one, treat as ambiguous and fall back (update a condition or trigger an event)
- If exactly one, that’s the
Workloadfor this Job. No changes to itsownerReference
- When creating a new
Workload, determine the appropriate scheduling policy and create theWorkloadobject with the determined policy based on Job configuration:
GangSchedulingPolicywithminCount=parallelism: whenparallelism > 1andcompletionMode: Indexedandparallelism = completions- For alpha, all other cases will not have
WorkloadandPodGroupobjects created, this includes:parallelism = 1completionsnot equal toparallelismcompletionModeis notIndexed(non-indexed Jobs)
- Look up
PodGroup(s) in Job’s namespace whosepodGroup.spec.podGroupTemplateRefis associated with theWorkloadfor this Job. If thePodGroupwas created by the Job controller, it has two ownerReferences; the Job controller and theWorkloadobject
- If none found, create a
PodGroupwith ownerReference toJobwithcontroller: trueand another ownerReference to theWorkload - If exactly one, that’s the
PodGroupfor this Job. No changes to itsownerReference - If multiple PodGroups, fall back as it’s not supported in alpha
- Execute existing pod management logic to create pods, include
schedulingGroup.podGroupNamein the pod spec to associate pods with thePodGroup.
Note that the controller will not update the Workload or PodGroup objects if they already exist.
The controller will require additional informers and listers for Workload and PodGroup objects. Both Workload and PodGroup are automatically garbage collected if they were created by the job-controller and the corresponding Job is deleted.
If the Workload was created by another actor (i.e. JobSet, User creates a Workload with BasicSchedulingPolicy
to opt-out of gang scheduling), the Job controller respects and uses it and its associated PodGroups if any. It does
not treat that as an opt-out. The Job controller uses the discovered Workload and PodGroup (if any) when creating pods.
The Job controller falls back (ignore the discovered Workload and PodGroup) when the discovered Workload
has an unsupported structure (for alpha, when the number of PodGroupTemplates != 1). In that case, a condition
or event should be triggered to inform the user.
OwnerReferences Relationship
The ownerReferences relationship between Job, Workload, PodGroup, and Pod is as follows:
flowchart BT
Pod[Pod]
PodGroup[PodGroup]
Workload[Workload]
Job[Job]
Workload -->|ownerRef| Job
PodGroup -->|ownerRef| Job
Pod -->|ownerRef| Job
PodGroup -->|ownerRef| Workload
Pod -->|ownerRef| PodGroup- The
Workloadobject has an ownerReference to theJobobject withcontroller: truein case it was created by the Job controller - The
PodGroupobject has an ownerReference to theJobobject withcontroller: truein case it was created by the Job controller and another ownerReference to theWorkloadobject - The
Podobject has an ownerReference to theJobobject withcontroller: trueand another ownerReference to thePodGroupobject
By this ownerReferences relationship, garbage collection will remove objects accordingly that avoids orphaned Pods with a stale PodGroup reference.
Object Creation Order
The Job controller creates objects in the following order so that references point to existing objects and to satisfy any API validation that Workload exists before PodGroup is created. The order is as follows:
Workloadobject which will reference theJobPodGroupobject which will reference theWorkloadand theJobPodobjects which will referencePodGroup
The kube-scheduler waits for PodGroup when Pods have schedulingGroup, so scheduling does not depend on this order, the order is for consistency and API validity.
Validation for Parallelism Changes
The Job API Validation rejects updates that change spec.parallelism when the feature gate is enabled and the Job uses gang scheduling. Since changing this field would require changing minCount in the Workload object, which is immutable.
Job API validation uses the same conditions the Job controller uses to create a Workload with gang scheduling.
When the feature gate is enabled, validation rejects updates that change spec.parallelism if the Job, after the update,
satisfies spec.parallelism > 1, spec.completionMode == Indexed, and spec.parallelism == spec.completions. If the controller’s
criteria for applying gang scheduling change in the future, this validation logic must be updated to match.
This additional validation will be removed in beta since Elastic Indexed Jobs must be supported.
Naming Conventions
We will not use naming for discovery due to limitations related to naming. Naming is for human readability
and logical linking between Job, Workload, and PodGroup. Because discovery does not depend on it, the
naming pattern can be changed in later releases if needed.
Following prior-art in Deployment , the naming convention can be as follows:
1. Workload
- Pattern:
<(truncated-if-needed)job-name>-<hash> - Truncation of the Job name is applied when necessary to respect object name length limits.
- The hash is used for collision avoidance (implementation may use a generated suffix or a hash of relevant identity).
- Object type (
WorkloadvsPodGroup) is identified by other metadata (ownerReferences[].kind), not by the name pattern.
2. PodGroup
- Pattern:
<(truncated-if-needed)workload-name>-<(truncated-if-needed)podGroup-template-name>-<hash> - Truncation of workload name and podGroup name is applied when necessary to respect name length limits.
- The hash allows multiple PodGroups within a
WorkloadandPodGroupTemplateto have distinct names. For alpha, the controller creates a singlePodGroupper Job, however, the pattern still supports future multi-PodGroup cases.
Deletion and Garbage Collection
The Job controller does not explicitly delete Workload or PodGroup. However, in the case of the controller
creating them, it sets ownerReferences so that garbage collection removes them when the Job is deleted.
No additional controller logic is required for deletion in the current design.
The Job controller does not add or adopt ownerReferences on objects it did not create (user-created or higher-level controller-created objects). Users or other controllers may create Workloads/PodGroups with the same ownerReferences as the Job controller would use.
To distinguish controller-created objects from user-created ones that may have the same ownerReferences,
the Job controller may set a managed-by annotation or equivalent metadata on Workload and PodGroup objects it creates.
This allows the controller to know which objects it created and is responsible for its lifecycle,
including GC. Similarly, for PodGroups, which is especially important as they may have multiple ownerReferences (Job and Workload).
Test Plan
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
Unit tests
k8s.io/kubernetes/pkg/controller/job_controller:2026-01-29-89.1%k8s.io/kubernetes/pkg/apis/batch/validation:k8s.io/kubernetes/pkg/registry/batch/job:- Add test that verifies
- SchedulingPolicy for various Job configurations
WorkloadandPodGroupcreation for gang-scheduled Jobs only (for alpha, all other Jobs will not haveWorkloadandPodGroupobjects created)- pod creation includes correct
schedulingGroup - Parallelism change is blocked for gang-scheduled Jobs and allowed for all otherJobs
- Job deletion cascades to
WorkloadandPodGroupdeletion - Feature gate disabled: Jobs work without
Workload/PodGroupcreation - Jobs with ownerReferences (managed by higher-level controllers) do not create
Workload/PodGroup - ownerReferences on controller-created
WorkloadandPodGroupmatch the expected structure:Workloadhas controller ownerRef to JobPodGrouphas controller ownerRef to Job and non-controller ownerRef toWorkload- Test naming abbreviations for
WorkloadandPodGroup
Integration tests
We will add the following integration tests to the Job controller https://github.com/kubernetes/kubernetes/blob/v1.35.0/test/integration/job/job_test.go:
- Gang Scheduling Lifecycle Test (create, update, delete Job, verify
WorkloadandPodGroupcreation, verify pods haveschedulingGroup, verify Job deletion cascades toWorkloadandPodGroupdeletion) - Failure Recovery Test (create Job with
WorkloadAPI unavailable, verify Job controller retries, verifyWorkloadis eventually created) - Feature gate disable/enable (Jobs work without
Workload/PodGroupcreation (Jobs with ownerReferences managed by higher-level controllers do not createWorkload/PodGroup)) - Jobs created by CronJob get one
Workloadand onePodGroupper Job:- CronJob does not set
schedulingGroupon the pod template - Verify
Workload/PodGroupcreation and GC when the CronJob-created Job completes or is deleted
- CronJob does not set
- When a Job is suspended, pods are deleted but
WorkloadandPodGroupremain. For alpha, check that on resume, the sameWorkload/PodGroupare used and pods are recreated with correctschedulingGroup - For a Job using gang scheduling, verify that parallelism change is rejected and all other Jobs are still allowed
- Verify controller-created
Workload/PodGrouphas the correct owner references
e2e tests
- End-to-end gang scheduling, all pods scheduled together or none
- Mixed workloads, gang and basic Jobs coexist
- Failure scenarios, i.e., insufficient resources for gang, partial failures
Graduation Criteria
Alpha (v1.36)
- Feature is implemented behind feature gate
WorkloadWithJob(default: disabled) - Job controller creates
WorkloadandPodGroupobjects for Jobs when feature gate is enabled - Gang scheduling policy applied to indexed parallel Jobs (
parallelism > 1,completions = parallelism,completionMode: Indexed) - non-gang scheduling Jobs will not have
WorkloadandPodGroupobjects created - Jobs managed by higher-level controllers skip
Workload/PodGroupcreation - API validation rejects updates that change
spec.parallelismfor gang scheduling Jobs - Unit tests for all new Job controller logic
- Integration tests for
Workload/PodGroupcreation flow - Documentation for enabling and using the feature
Beta
- Before beta, the Job API must clearly define:
- How users opt-in or opt-out of gang scheduling. Disable gang scheduling must not rely on turning off the feature gate
- When and how
WorkloadandPodGroupare created/updated (i.e., only at creation vs. also when scaling) - Define the scaling up/down mechanism for gang scheduling Jobs (elastic indexed jobs are not supported)
- Evaluate whether the Job controller’s current batch-create pods should be changed when gang scheduling is active (it slows down the pod creations), and document the decision.
- Elastic Indexed Jobs must be supported (beta blocker)
- The controller needs a mechanism to delete
PodGroup/Workloadin the case of suspended Jobs and recreate on resume. - Create
WorkloadandPodGroupobjects for all non-gang scheduling Jobs withBasicSchedulingPolicy - Feature gate
WorkloadWithJobis enabled by default - Address feedback from alpha
- E2e tests covering gang scheduling scenarios
- Metrics for monitoring
Workload/PodGroupcreation and scheduling outcomes - Performance testing to validate no significant impact on Job creation latency
GA
TBD after beta release
Deprecation
N/A for alpha release
Upgrade / Downgrade Strategy
Upgrade:
- Upgrade kube-apiserver
- Enable feature gate and upgrade kube-controller-manager
- New Jobs automatically get
Workload/PodGroupobjects - Existing Jobs continue to work (no
Workloadcreated for them)
Downgrade:
- Disable feature gate and downgrade kube-controller-manager
- New Jobs no longer get
Workload/PodGroupobjects - Existing
WorkloadandPodGroupobjects remain - Jobs with
schedulingGroup.podGroupNameon pods continue to run (field ignored) - New Pods for Jobs will not have
schedulingGroup.podGroupNameset
Migration for Existing Jobs:
- Existing Jobs before upgrade do not automatically get
Workloadobjects - To add gang scheduling to existing Jobs, delete and recreate them
- Jobs who do not have pods running yet get
Workload/PodGroupobjects created for them.
- Existing Jobs before upgrade do not automatically get
Controller restarts and upgrades:
- The Job controller only creates
Workload/PodGroupwhen the Job has no pods - If the controller restarts or is upgraded after creating
Workloadbut before creatingPodGroupor pods, on the next sync it will discover the existingWorkload(andPodGroupif present) via informers/listers and continue without creating duplicates. - No special handling is required for in-flight Jobs during controller upgrade or restart.
- The Job controller only creates
Version Skew Strategy
Gang scheduling only occurs when:
- The API server can serve the
WorkloadandPodGroupAPIs - The Job controller is able to create
Workload/PodGroupand setsschedulingGroupon pods - The scheduler supports
Workload/PodGroup
Therefore, for a safe rollout:
- kube-apiserver must be upgraded first so it can serve the
WorkloadandPodGroupAPIs. - kube-controller-manager can be upgraded before or after the scheduler
- kube-scheduler should be upgraded before or together with kube-controller-manager. Gang semantics will not apply until the scheduler is upgraded.
There are different Skew scenarios involving the kube-scheduler:
- kube-controller-manager new, kube-scheduler old: The controller creates
WorkloadandPodGroupand setsschedulingGroupon pods but the scheduler does not understand these objects and ignoresschedulingGroup. In this case, pods are scheduled normally (pod-by-pod) with no gang scheduling benefit until the scheduler is upgraded. - kube-controller-manager old, kube-scheduler new: The controller does not create
Workload/PodGroupand does not setschedulingGroupon pods. The scheduler has no workload information for those Jobs. Pods are scheduled normally with no gang scheduling benefit.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name:
WorkloadWithJob - Components depending on the feature gate:
- kube-controller-manager
- kube-apiserver
- Feature gate name:
- Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?
Does enabling the feature change any default behavior?
Yes. When the feature gate is enabled:
- The Job controller creates
WorkloadandPodGroupobjects for all qualified gang scheduling Jobs (parallelism > 1,completions = parallelism,completionMode: Indexed) before creating pods. - Jobs that match the gang scheduling criteria (
parallelism > 1,completions = parallelism,completionMode: Indexed) use gang scheduling, meaning all pods must be scheduled together or none are scheduled. - Updates to
Job.spec.parallelismare rejected for Jobs using gang scheduling. - The binding of pods referencing a
PodGroupis delayed until thePodGroupobject exists.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
Yes. The feature can be disabled (WorkloadWithJob: false).
What happens if we reenable the feature if it was previously rolled back?
When the feature is re-enabled:
- New Jobs will have
WorkloadandPodGroupobjects created - Existing Jobs that don’t have running pods will have
WorkloadandPodGroupobjects created on their next reconciliation cycle - Jobs that were running without gang scheduling will be evaluated again, if they match gang scheduling criteria and have no running pods, a
Workloadwith gang policy will be created. - Jobs that have running pods are not affected since gang scheduling only applies to newly created Jobs/Pods
Are there any tests for feature enablement/disablement?
Yes. We will add unit tests and integration tests for feature enablement/disablement.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
- If the API server doesn’t support
WorkloadandPodGroupAPIs, the Job controller will fail to create Jobs (error creatingWorkload). Jobs will be requeued until the API server is upgraded. - If the scheduler doesn’t have gang scheduling feature enabled, pods are scheduled normally in pod-by-pod manner.
- Already running Jobs are not affected by enabling the feature. Pods that are already scheduled and running continue to run. New Jobs or Jobs being reconciled will be affected.
- For the rollback, disabling the feature gate allows Jobs to work without
WorkloadandPodGroupcreation.WorkloadandPodGroupobjects don’t cause issues; they’re ignored when the feature is disabled
What specific metrics should inform a rollback?
The following metrics should be monitored:
job_sync_duration_seconds: If Job sync duration increases significantly, it may indicate issues withWorkload/PodGroupcreationjob_pods_creation_total: A drop in pod creation rate may indicate a problem in the Job controller’sWorkload/PodGroupflow
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
This will be tested manually as part of alpha release.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
kubectl get workloads -Awill showWorkloadobjects created by the Job controllerkubectl get podgroups -Awill showPodGroupobjects created by the Job controller
How can someone using this feature know that it is working for their instance?
- Events
- Event Reason:
WorkloadCreated- Emitted whenWorkloadobject is created for a Job - Event Reason:
PodGroupCreated- Emitted whenPodGroupobject is created for a Job
- Event Reason:
- API .status
- Condition name:
- Other field:
- Other (treat as last resort)
- Details:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
To be discussed after alpha release.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
To be discussed after alpha release.
- [] Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- Other (treat as last resort)
- Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?
Dependencies
Does this feature depend on any specific services running in the cluster?
scheduling.k8s.io/v1alpha1forWorkloadandPodGroupAPIskube-schedulerwith gang scheduling feature enabled
Scalability
Will enabling / using this feature result in any new API calls?
Yes. The Job controller uses informers and listers for Workload and PodGroup for lookups and watches. The following additional API calls are made when this feature is enabled:
CREATE Workload- 1 per Job creationCREATE PodGroup- 1 per Job creation
Will enabling / using this feature result in introducing new API types?
No.
Will enabling / using this feature result in any new calls to the cloud provider?
No.
Will enabling / using this feature result in increasing size or count of the existing API objects?
Yes. Jobs that match the gang scheduling criteria create 1 Workload(~500 bytes) and 1 PodGroup(~500 bytes) object. In addition to Each Pod gains a schedulingGroup field (~100 bytes).
For a cluster with 10,000 gang scheduling Jobs, this adds approximately:
- 10,000
Workloadobjects - 10,000
PodGroupobjects - ~10MB additional etcd storage
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
There is an expected increase in job sync duration due to creating Workload and PodGroup objects for each Job and for scheduler waiting time. We will measure the impact once we have an implementation.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
Yes.
- Kube-controller-manager: Additional memory for
WorkloadandPodGroupinformers. Estimated ~50MB for 10,000 objects. - Kube-scheduler: Additional memory for
WorkloadandPodGroupcaches. Estimated ~50MB for 10,000 objects. - etcd: Additional storage for
WorkloadandPodGroupobjects. Estimated ~10MB for 10,000 Jobs. - kube-apiserver: Additional watches for
WorkloadandPodGroupresources. Minimal CPU impact.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No. This feature is purely control-plane and does not affect node resources.
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
- Job Controller cannot create Workloads/PodGroups
- Retry with exponential backoff when kube-apiserver recovers
- Existing Jobs with Workloads continue to run
What are other known failure modes?
What steps should be taken if SLOs are not being met to determine the problem?
- Verify
WorkloadWithJobis enabled on all control plane components - Check controller-manager logs for errors related to
Workload/PodGroupcreation - Review existing metrics
job_sync_duration_seconds,workload_creation_duration_seconds - Check resource constraints since gang scheduling may fail if cluster doesn’t have sufficient resources
Implementation History
- 2026-01-29: KEP created
- 2026-02-10: KEP updated according to final API design for
WorkloadandPodGroup
Drawbacks
Alternatives
Infrastructure Needed (Optional)
The Kubernetes community uses the term “gang scheduling” to mean “all-or-nothing scheduling of a set of pods” [1,2,3,4,5,6,7,8,9,10,11,12,13]. In the Kubernetes context, it does not imply time-multiplexing (in contrast to prior academic work such as Feitelson and Rudolph , and in contrast to Slurm Gang Scheduling . ↩ ↩︎ ↩︎