KEP-6089: Workload Aware Scheduling Controller APIs
KEP-6089: Workload Aware Scheduling Controller APIs
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Core Principles & Assumptions
- Standardized Building Blocks Definitions (
scheduling.k8s.io) - Job Integration (batch/v1)
- Shared workloadbuilder Go Translation Library
- Reference Integration Examples: JobSet (Multi-Level)
- Recommendations for Multi-Level Composite Controllers
- Go Package Placement & Graduation Strategy
- Test Plan
- Graduation Criteria
- Upgrade / Downgrade Strategy
- Version Skew Strategy
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests within one minor version of promotion to GA
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
This KEP proposes a standardized set of reusable API building blocks (scheduling.k8s.io),
integration guidelines, and shared libraries to simplify how workload controllers (e.g., JobSet,
TrainJob, RayJob, LWS, as well as core workloads like Job) integrate with Workload-aware
Scheduling (WAS).
By providing common API primitives (such as topology constraints and disruption policies) and a shared library to handle boilerplate resource generation, we enable controller developers to easily expose WAS features natively within their APIs without reinventing the wheel, while ensuring a consistent user experience across the Kubernetes ecosystem.
Motivation
The Kubernetes ecosystem has steadily evolved its scheduling capabilities from a strictly pod-centric model towards a more robust, workload-centric approach. This transition successfully established foundational features in the recent v1.36 release, such as Gang Scheduling, Topology-aware Scheduling (TAS), and Workload-aware Preemption (WAP).
However, the Workload and PodGroup resources backing these features were designed primarily as
intermediate, scheduler-facing APIs. We have not yet addressed how end-users of higher-level
workload controllers (such as Job, LWS, JobSet, or RayJob) should express their scheduling
requirements to utilize these features.
For example, in the first alpha release of KEP-5547
(Job Integration), we intentionally bypassed the user-facing
API design challenge. Instead, the integration automatically creates a PodGroup with a hardcoded
Gang policy under specific conditions (e.g., for fully parallel static indexed Jobs). While this
unblocked initial adoption, it is fundamentally insufficient. Users have diverse use cases and
require the ability to express explicit intent—such as opting in or out of gang scheduling,
requesting specific topologies, or configuring disruption policies for their workloads.
Currently, there is no standardized way for workload controllers to expose these user intents, nor is there a standard mechanism for controllers to translate user intent into underlying scheduling objects. If every controller authors its own user-facing API structs and custom logic to manage scheduling objects, the ecosystem will suffer from inconsistent UX, duplicate effort, and varied levels of WAS support.
We need a standardized toolkit that provides common scheduling API structures, handles the boilerplate compilation, and establishes architectural guidelines to solve common integration challenges across the ecosystem. This proposal aims to fill these gaps, providing shared tooling and best practices while still allowing controller owners the flexibility to design their root APIs natively.
Goals
Define reusable API primitives (e.g., Scheduling Policies, Topology Constraints, Disruption Modes) under
scheduling.k8s.ioto be consumed by real-workload controllers.Provide a shared library (workloadbuilder) to handle the boilerplate of constructing underlying scheduling objects (
Workload,PodGroup, orCompositePodGroup) from controller-specific intents.Establish architectural guidelines for workload controllers to expose WAS features consistently.
Integrate these building blocks and the translation library with the core
JobAPI (batch/v1) to ensure we are not designing in a vacuum. StandardJobis the natural candidate to “blaze the path” for other workload controllers; it initially integrated with WAS in v1.36 in alpha, but intentionally bypassed the user-facing scheduling API aspect. Under this KEP, the coreJobintegration remains in Alpha in v1.37, but is enriched to give users the ability to express explicit scheduling intent, resolving usability gaps from the initial v1.36 alpha.Provide reference integration examples demonstrating how complex, multi-level composite controllers (such as
JobSet) can adopt WAS Controller APIs. Since standardJobserves as the production single-level implementation, we focus our reference designs purely on demonstrating multi-level hierarchical patterns.
Non-Goals
Define a single, mandatory and rigid scheduling API struct for all Kubernetes workload controllers.
Implement the actual integration of these new API blocks into other complex composite controllers (such as
JobSet,LeaderWorkerSet, or KubeflowTrainJob) as part of this KEP. While this KEP establishes the design guidelines and shared library for their integration, the implementation PRs for these out-of-tree controllers will be pursued independently in their respective repositories.Create or manage the lifecycle of
Workload,PodGroup, orCompositePodGroup.
Proposal
This proposal builds on the enhancements that have been recently introduced in the workload-aware scheduling space. We assume that the reader is already acquainted with the following KEPs:
- KEP-4671: Gang Scheduling using Workload Object
- KEP-5710: Workload-aware preemption
- KEP-5732: Topology-aware workload scheduling
- KEP-6012: CompositePodGroup API
- KEP-5547: Integrate Workload APIs with Job Controller
Reusable API Building Blocks
We propose introducing a set of standard, reusable structs in the scheduling.k8s.io API group.
Controller developers can embed these structs directly into their native APIs. This ensures that
when a user configures a TopologyConstraint on a RayJob, it uses the exact same schema and
semantics as a TopologyConstraint on a TrainJob.
Shared workloadbuilder Library
To prevent every controller from writing custom logic to translate these API blocks into underlying scheduling resources, we will provide a shared Go library. Controller developers will map their custom API surface to an intermediate representation, and the library will handle:
- Generating the correct
Workload,PodGroup, orCompositePodGrouphierarchies. - Applying sane scheduling defaults based on the controller’s semantic purpose (e.g., defaulting
to standard pod-by-pod scheduling for a core
Jobto explicitly prevent breaking existing CI/CD pipelines). - Handling standard validation logic.
Integration Recommendations & Controller Autonomy
Instead of forcing a one-size-fits-all API shape, we provide recommendations on how these building blocks can be exposed, leaving the final design decisions to the controller owners. This approach prioritizes local consistency over global uniformity. While this may introduce a degree of API fragmentation across the ecosystem, it is a necessary and acceptable trade-off to ensure each controller’s API remains idiomatic and intuitive for its specific users.
This autonomy is particularly crucial for complex, multi-level controllers that rely on resource
composition. If we mandated a strict, unified API shape that relied on downward API propagation,
we would introduce severe upstream dependency bottlenecks. For example, TrainJob relies on JobSet,
which in turn relies on the core Job API. Requiring bottom-up integration would block TrainJob
users for months while waiting for the underlying components to adopt the standard. By granting
controllers autonomy, they can implement workarounds native to their architecture—such as JobSet
using its established targetReplicatedJobs pattern to apply scheduling constraints to underlying
Jobs—delivering value to users immediately without waiting for the entire dependency chain to
resolve.
Job Integration - API Usage Examples
This KEP proposes enriching the core Job API to allow users to express their scheduling intents
through a composed scheduling configuration. The following examples show how this API represents
different Workload-aware Scheduling intents:
Example 1: Job with Gang Scheduling, Zone Topology, and Atomic Disruption
A batch ML training Job where all 4 pods must schedule together atomically (All-or-Nothing),
must co-locate within the same availability zone, and must be treated as a single unit for
disruptions (meaning if one pod is preempted, the entire group is disrupted together):
apiVersion: batch/v1
kind: Job
spec:
parallelism: 4
completions: 4
scheduling: # New API field - scheduling intent
policy:
gang: {} # MinCount is omitted: Job defaults MinCount = parallelism (4)
constraints:
topology:
- level: "topology.kubernetes.io/zone"
disruption:
all: {} # DisruptionMode resolves to All (entire group must be disrupted together)
template:
spec:
containers:
- name: train-node
image: training-image:v1
Example 2: Backward Compatibility and Sane Defaulting (Implicit Opt-Out)
A standard Job manifest where the scheduling block is omitted entirely. This natively defaults
to standard Kubernetes pod-by-pod scheduling (Basic mode), ensuring 100% backward compatibility
and eliminating the need for an explicit opt-out mechanism:
apiVersion: batch/v1
kind: Job
spec:
parallelism: 10
completions: 10
# The scheduling block is completely omitted (which defaults to Basic scheduling
# and single disruption).
# This effectively acts as an implicit opt-out from gang scheduling in the Job integration.
template:
spec:
containers:
- name: processor
image: processor-image:v1
User Stories
Story 1: The End-User
As a ML engineer submitting distributed training workloads to a cluster, I want to explicitly
define my scheduling requirements — such as requesting that all worker Pods are scheduled together
(gang scheduling) and placed within the same network rack (topology constraint) — directly within
my workload’s YAML manifest. I expect these scheduling configurations to be intuitive,
well-documented, and to use a similar structure and vocabulary whether I am submitting a JobSet,
a LWS resource, or a company-internal batch job.
Story 2: The Controller Maintainer
As a maintainer of a single-level workload controller, such as the core Job API, I want to add
Workload-aware Scheduling capabilities to my API without having to design custom struct fields
from scratch or write reconciliation logic to manage scheduler-specific objects like PodGroup. By
importing standard API primitives from scheduling.k8s.io into my API schema and using a shared
builder library in my controller’s reconcile loop, I can easily expose features like gang
scheduling to my users while ensuring consistency with the rest of the ecosystem.
Story 3: The Multi-Level Controller Maintainer
As a maintainer of a multi-level composite controller (e.g., JobSet which creates Jobs, or a
custom training operator composing LWS), I want to integrate WAS features using the same standard
API primitives. Furthermore, because my controller relies on composing other Kubernetes resources,
I expect this KEP to provide clear architectural guidelines on how to handle nested scheduling
intent. For example, I need recommendations on whether my parent controller should generate the
PodGroup directly, or if it should delegate that creation to the underlying child controllers.
Risks and Mitigations
API Fragmentation and Inconsistent UX: Because this proposal grants controller owners the autonomy to design and integrate their own API schemas to avoid upstream dependency bottlenecks, there is a risk that different controllers expose Workload-aware Scheduling (WAS) features differently, leading to a fragmented user experience across the ecosystem.
- Mitigation: This is a conscious and deliberate trade-off: we prioritize rapid out-of-tree
ecosystem adoption and native local consistency over delayed global uniformity (
local consistency > global uniformity/fragmentation). To minimize fragmentation, we provide strongly-typed, reusable building blocks (likeSchedulingConstraints,DisruptionMode,SchedulingMode) in thescheduling.k8s.ioAPI group. By following our design recommendations and using these building blocks, controller owners ensure that the JSON/YAML schema shapes remain highly consistent and intuitive for users.
- Mitigation: This is a conscious and deliberate trade-off: we prioritize rapid out-of-tree
ecosystem adoption and native local consistency over delayed global uniformity (
Split-Brain Configurations: Because we preserve controller autonomy, a situation can arise where a composite wrapper controller (such as
JobSetorTrainJob) implements its own custom wrapper-level fields or conventions to expose WAS features. In the meantime, the underlying child resource (such as the coreJobAPI) officially integrates with WAS and introduces its own scheduling fields. This creates a “split-brain” configuration problem where a user ofJobSetcan configure scheduling directives in two parallel, potentially conflicting ways: at the wrapper level, or directly inside the child’s nested template (e.g.,spec.replicatedJobs[*].template.spec.scheduling).- Mitigation: The composite controller remains in full control of its API and the
translation/propagation of its templates. Since the parent controller is the sole “compiler”
of the workload tree, it has several flexible options to resolve this duplication without
breaking backward compatibility:
- API Translation and Mapping: The parent controller can map its existing wrapper-level
fields to the compiled
Workloadresource, while explicitly stripping or ignoring the child’s nested scheduling fields in the generated templates before applying them to prevent conflicts. - Gradual Deprecation: The parent controller can choose to gradually deprecate its custom duplicate wrapper-level fields over several minor releases in favor of the child’s native embedded fields, guiding users to a unified configuration path.
- Conflict Validation: The parent controller’s validating webhooks can reject requests where a user attempts to populate both wrapper-level and child-template-level scheduling fields for the same workload, preventing ambiguous configurations.
- API Translation and Mapping: The parent controller can map its existing wrapper-level
fields to the compiled
- Mitigation: The composite controller remains in full control of its API and the
translation/propagation of its templates. Since the parent controller is the sole “compiler”
of the workload tree, it has several flexible options to resolve this duplication without
breaking backward compatibility:
Design Details
Core Principles & Assumptions
Integration of Workload-aware Scheduling (WAS) into workload controllers is guided by the following design principles:
- The Root Controller as the Compiler: Regardless of whether a workload is a simple,
single-level resource (like a core
Job) or a complex, multi-level composite resource (likeJobSetorTrainJob), the low-level scheduler-facingWorkloadresource is always compiled, created, and managed strictly by the root-most controller (the Root Controller):- Full Context Visibility: Only the root-most controller has the complete, high-level view
of the entire workload structure and its logical orchestration (e.g.,
JobSetknows all itsreplicatedJobsand their parallelism, whereas a single childJobonly knows its own pods). - Ownership & Skip Logic: Child controllers (like standard
Job) observe theirOwnerReferencepointing to a registered parent workload and explicitly bypass creating anyWorkloadobjects. This prevents duplicate resource creation and guarantees a single source of truth. However, becausePodGroupis the runtime representation of theWorkloadblueprint, child controllers may still be responsible for instantiating the correspondingPodGroupobjects themselves (or delegating this to the root controller depending on the integration design).
- Full Context Visibility: Only the root-most controller has the complete, high-level view
of the entire workload structure and its logical orchestration (e.g.,
- Separation of Structure and Policy: The integration strictly separates real-workload
structure from scheduling policies:
- The Controller API owns the Structure: The true workload API definition (e.g.,
JobSetorLWSschemas) fully defines its own shape, hierarchy, and replication mechanics. The user does not need to manually repeat this structure to the scheduler. - The User owns the Policy: The user knows how they want the workload to be scheduled based on their specific environment (e.g., “I want gang scheduling”, “I need these workers colocated on the same network rack”).
- The Controller acts as a Translator: The real-workload controller consumes the user’s
high-level policy intent, combines it with its own structural knowledge, and acts as a
compiler to generate the low-level
Workloadobjects for the scheduler.
- The Controller API owns the Structure: The true workload API definition (e.g.,
- Universal Representation: Legacy, standard pod-by-pod scheduling is represented natively as
a first-class citizen (
Basicmode). Controllers always generate the underlyingWorkloadobjects, using basic scheduling as the backward-compatible default for true workloads. - Sane Defaults and Escape Hatches: Controllers balance their native orchestration purpose
with backward compatibility by providing sensible defaults (e.g. standard
Jobdefaulting toBasic,LWSdefaulting to a Set of Gangs). Integrated Controllers must provide explicit escape hatches allowing users to override these default templates (e.g., opting out of LWS’s default local gang back toBasic).
Standardized Building Blocks Definitions (scheduling.k8s.io)
Following the structure of the PodGroup and CompositePodGroup APIs under development, the shared
building block primitives are categorized into distinct levels representing the layers of the
workload tree:
- Leaf Level (
PodGroup): Prefixed withWorkloadPodGroup.... These primitives group pods directly and represent standard execution boundaries. - Composite Level (
CompositePodGroup): Prefixed withWorkloadCompositePodGroup.... These primitives coordinate groups of workloads.
This level-specific categorization allows independent API evolution. As a general design
philosophy, when a structure represents a concrete, physical “real-world” scheduling concept used
verbatim by the scheduling stack (such as TopologyConstraint from KEP-5732
), we reuse it
directly across all levels. For higher-level policy abstractions introduced by this WAS layer, we
define distinct level-specific types (such as WorkloadPodGroupSchedulingPolicy) to ensure they
can evolve independently at each hierarchy level.
The WorkloadPodGroup and WorkloadCompositePodGroup prefixes are used to avoid name collisions
with other scheduling field structures defined directly in the scheduling.k8s.io group
(e.g., KEP-5732
’s PodGroup structures).
To keep this specification concise and focused, we only define the detailed Go API structs for
the leaf-level PodGroup specific types. An analogous set of types prefixed with
WorkloadCompositePodGroup... is provided under the same API group.
The Go definitions are structured as follows:
// API Group: scheduling.k8s.io/v1alpha3
// WorkloadPodGroupSchedulingConstraints defines leaf-level scheduling constraints, such as topology.
type WorkloadPodGroupSchedulingConstraints struct {
// Topology specifies desired topological placements for all pods
// within the scheduling group.
// +optional
Topology []TopologyConstraint `json:"topology,omitempty"`
}
// WorkloadPodGroupDisruptionMode defines how individual pods within a group can be disrupted.
// Exactly one mode must be set.
type WorkloadPodGroupDisruptionMode struct {
// Single specifies that pods can be disrupted independently from each other.
// +optional
Single *WorkloadPodGroupSingleDisruptionMode `json:"single,omitempty"`
// All specifies that all pods in the group must be disrupted together.
// +optional
All *WorkloadPodGroupAllDisruptionMode `json:"all,omitempty"`
}
// WorkloadPodGroupSingleDisruptionMode indicates that individual pods can be disrupted independently.
type WorkloadPodGroupSingleDisruptionMode struct {
// Intentionally empty for now.
}
// WorkloadPodGroupAllDisruptionMode indicates that all pods in the group must be disrupted together.
type WorkloadPodGroupAllDisruptionMode struct {
// Intentionally empty for now.
}
// WorkloadPodGroupSchedulingPolicy defines the scheduling policy for a group of pods.
// Exactly one policy must be set.
type WorkloadPodGroupSchedulingPolicy struct {
// Basic specifies that standard, pod-by-pod Kubernetes scheduling behavior should be used.
// +optional
Basic *WorkloadPodGroupBasicSchedulingPolicy `json:"basic,omitempty"`
// Gang specifies all-or-nothing scheduling semantics.
// +optional
Gang *WorkloadPodGroupGangSchedulingPolicy `json:"gang,omitempty"`
}
// WorkloadPodGroupBasicSchedulingPolicy indicates standard Kubernetes scheduling behavior.
type WorkloadPodGroupBasicSchedulingPolicy struct {
// Intentionally empty for now.
}
// WorkloadPodGroupGangSchedulingPolicy defines the parameters for gang (all-or-nothing) scheduling.
type WorkloadPodGroupGangSchedulingPolicy struct {
// MinCount is the minimum number of pods that must be scheduled
// at the same time for the scheduler to admit the entire group. must be >= 1 when set
// If omitted, the controller should inject a context-specific sane default.
// +optional
MinCount *int32 `json:"minCount,omitempty"`
}
// WorkloadPodGroupResourceClaim references dynamic resource claims for the group.
// Exactly one of ResourceClaimName or ResourceClaimTemplateName must be set.
type WorkloadPodGroupResourceClaim struct {
// Name uniquely identifies this resource claim inside the group.
Name string `json:"name"`
// ResourceClaimName is the name of a ResourceClaim object in the same namespace.
// +optional
ResourceClaimName *string `json:"resourceClaimName,omitempty"`
// ResourceClaimTemplateName is the name of a ResourceClaimTemplate object.
// +optional
ResourceClaimTemplateName *string `json:"resourceClaimTemplateName,omitempty"`
}
Job Integration (batch/v1)
To deliver native, typed Workload-aware Scheduling support in core Kubernetes, we propose
integrating the standardized building blocks directly into the core Job API (batch/v1).
The new fields in the Job API follow the standard process to graduate to a stable type:
the new fields are gated behind a feature gate and progress through the usual Alpha → Beta → Stable
maturity levels, with the field cleared on write and ignored on read while the gate is disabled.
This integration serves as the foundational implementation (“blazing the path”) that demonstrates the viability of these building blocks before out-of-tree controllers adopt them. More design details are covered in KEP-5547 .
API Changes
We will introduce a new Scheduling field inside JobSpec. This field embeds a curated, composed
structure consisting of the standardized building blocks:
// API Group: batch/v1
// JobSpec defines the desired state of a Job.
type JobSpec struct {
// ... existing fields ...
// Scheduling defines the Workload-aware Scheduling configuration for this Job.
// This field is alpha-gated by the WorkloadWithJob feature gate.
// +optional
Scheduling *JobSchedulingConfiguration `json:"scheduling,omitempty"`
}
// JobSchedulingConfiguration composes the reusable WAS building blocks.
type JobSchedulingConfiguration struct {
// Policy defines the gang or basic scheduling rules for this Job.
// +optional
Policy *schedulingv1alpha3.WorkloadPodGroupSchedulingPolicy `json:"policy,omitempty"`
// Constraints defines topology co-location constraints for the Job's pods.
// +optional
Constraints *schedulingv1alpha3.WorkloadPodGroupSchedulingConstraints `json:"constraints,omitempty"`
// DisruptionMode specifies how the pods in this Job should be disrupted (Single vs All).
// +optional
DisruptionMode *schedulingv1alpha3.WorkloadPodGroupDisruptionMode `json:"disruptionMode,omitempty"`
// ResourceClaims specifies dynamic resource claims shared across the Job's pods.
// +optional
ResourceClaims []schedulingv1alpha3.WorkloadPodGroupResourceClaim `json:"resourceClaims,omitempty"`
}
Shared workloadbuilder Go Translation Library
To prevent every workload controller (both core and out-of-tree) from writing custom, translation
and validation logic, we propose providing a shared Go library: workloadbuilder.
Package placement: The library ships from staging under
k8s.io/component-helpers/scheduling/schedulingv1. It is scoped as helpers shared by multiple
core binaries, keeps a minimal dependency surface (no external deps), and is meant for this
kind of scheduling-API translation. k8s.io/kube-scheduler was considered but
carries heavier dependencies and is a less natural import for out-of-tree controllers.
1. Design & Architecture
This library utilizes an Intermediate Representation (IR) tree pattern. The architecture adopts a
Polymorphic Bridge Pattern to reconcile the level-specific K8s API structures (leaf-level
PodGroup vs. composite-level CompositePodGroup) with a single, uniform tree definition inside
the library:
- Hierarchy-Agnostic Library IR: The library defines its own internal, polymorphic structures
(
workloadbuilder.SchedulingConfig,workloadbuilder.SchedulingPolicy, etc.) that represent scheduling configurations in a hierarchy-agnostic way. - Standard Mapping Helpers: To prevent controllers from writing custom translation boilerplate
to bridge K8s API types to the library IR, the library provides standard, built-in conversion
functions (
MapPodGroupConfigandMapCompositeGroupConfig). These helper adapters cleanly translate public, level-specific schemas into polymorphic IR models at runtime.
Controller authors construct a logical tree using WorkloadItem representing their workload
structure, populate DefaultConfig and the user’s UserConfig (using the standard mapping
helpers), and invoke the builder.
The library encapsulates the following logic:
- Policy Resolution: Merges default configurations with user-provided overrides (e.g.,
resolving escape hatches uniformly across the ecosystem) into each node’s
ResolvedConfig, then applies that node’sCallbacksso controllers can post-process the resolved configuration (e.g. defaulting gangMinCount). - Structural Resolution: Maps the logical tree hierarchy to the corresponding technical
structures in the low-level scheduler
WorkloadAPI, abstracting version variations (e.g. flat templates vs. nested sub-group templates). - Centralized Validation: Rejects invalid configurations early (e.g. ensuring a nested leaf group does not declare a conflicting disruption mode not supported by its parent).
2. Controller Opt-In for New Scheduling Capabilities
Because the building-block types under scheduling.k8s.io are shared across all controllers, new
scheduling options may be added in future releases (e.g. a new scheduling policy or disruption mode)
that do not make sense for every controller. For example, a new policy added in v1.3x might
be valid for JobSet but not for Job.
To prevent new options from silently leaking into controllers that have not been updated to support
them, the workloadbuilder library adopts an allow-list (opt-in) validation approach rather
than a deny-list (opt-out). Controllers declare the specific set of policies and modes they support,
and the library’s validation helpers reject anything not explicitly allowed. This means new
additions to the building-block API are denied by default until a controller explicitly updates
its allow-list.
The library provides per-field validation helpers that accept the supported options as arguments:
// In Job's API validation (pkg/apis/batch/validation):
allErrs = append(allErrs,
workloadbuilder.ValidateSchedulingPolicy(
spec.Scheduling.Policy, fldPath.Child("policy"),
workloadbuilder.BasicPolicy, workloadbuilder.GangPolicy))
allErrs = append(allErrs,
workloadbuilder.ValidateDisruptionMode(
spec.Scheduling.DisruptionMode, fldPath.Child("disruptionMode"),
workloadbuilder.SingleMode, workloadbuilder.AllMode))
This gives controllers opt-in semantics: when a new policy is introduced in a future release,
existing controllers (including Job) will reject it until their validation is explicitly updated
to include the new option. Out-of-tree controllers get the same guarantee by updating their vendored
library version and extending their allow-list.
Long-term, this pattern can migrate to Declarative Validation (DV) using +k8s:subfield markers,
eliminating the need for hand-written allow-list calls while preserving the same opt-in semantics:
type JobSpec struct {
// ...
// +k8s:subfield(disruptionMode)=+k8s:allowed=single,all
Scheduling *JobSchedulingConfiguration
}
Until DV support is available, the library-provided validation helpers serve as a lightweight, defensive bridge that keeps the overhead minimal for controller integrators.
3. Library API Definition
package workloadbuilder
import (
"context"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
schedulingv1alpha3 "k8s.io/api/scheduling/v1alpha3"
)
// SchedulingConfig is the polymorphic, hierarchy-agnostic IR model of the PodGroup/CompositePodGroup.
type SchedulingConfig struct {
Constraints *SchedulingConstraints
DisruptionMode *DisruptionMode
Policy *SchedulingPolicy
ResourceClaims []ResourceClaim
}
type SchedulingConstraints struct {
Topology []schedulingv1alpha3.TopologyConstraint
}
type DisruptionMode struct {
Single *SingleDisruptionMode
All *AllDisruptionMode
}
type SingleDisruptionMode struct {
// Intentionally empty for now.
}
type AllDisruptionMode struct {
// Intentionally empty for now.
}
type SchedulingPolicy struct {
Basic *BasicSchedulingPolicy
Gang *GangSchedulingPolicy
}
type BasicSchedulingPolicy struct {
// Intentionally empty for now.
}
type GangSchedulingPolicy struct {
MinCount *int32
}
type ResourceClaim struct {
Name string
ResourceClaimName *string
ResourceClaimTemplateName *string
}
// WorkloadItemFunc mutates a single WorkloadItem during Build that is
// used for controller-specific defaulting.
type WorkloadItemFunc func(*WorkloadItem)
// WorkloadItem represents a logical component of a workload (e.g., the whole JobSet,
// a specific ReplicatedJob role, or a single standalone Job).
type WorkloadItem struct {
// Name is the logical identifier of this component (e.g., "jobset-root", "driver").
Name string
// DefaultConfig defines the complete set of "sane defaults" assigned by the controller
// based on its specific orchestration domain logic.
DefaultConfig *SchedulingConfig
// UserConfig is the exact policy intent configured by the user at this specific level.
// Can be nil if the user left the scheduling block unconfigured.
UserConfig *SchedulingConfig
// Callbacks is a list of controller-supplied mutator functions that the
// controller can attach to this item. Callbacks are primarily intended
// as defaulting functions (e.g. MinCount), but they are general-purpose
// and may perform any controller-specific adjustment.
Callbacks []WorkloadItemFunc
// Children contains the logical sub-components of this workload.
// - If len(Children) > 0, the node is inferred as a structural group
// (i.e., represents a CompositePodGroupTemplate).
// - If len(Children) == 0, the node is inferred as a leaf (i.e. represents a PodGroup)
Children []*WorkloadItem
}
// WorkloadBuilder translates the logical WorkloadItem tree into a scheduler Workload object.
type WorkloadBuilder interface {
// Build translates the tree, merges defaults, validates policies,
// and generates the Workload resource.
Build(
ctx context.Context,
name, namespace string,
owner *metav1.OwnerReference,
) (*schedulingv1alpha3.Workload, error)
}
// NewBuilder initializes a builder with a specific root node.
func NewBuilder(root *WorkloadItem) WorkloadBuilder {
return &builderImpl{root: root}
}
// MapPodGroupConfig translates standard leaf building blocks into the library's polymorphic IR.
func MapPodGroupConfig(
policy *schedulingv1alpha3.WorkloadPodGroupSchedulingPolicy,
constraints *schedulingv1alpha3.WorkloadPodGroupSchedulingConstraints,
disruption *schedulingv1alpha3.WorkloadPodGroupDisruptionMode,
claims []schedulingv1alpha3.WorkloadPodGroupResourceClaim,
) *SchedulingConfig
// MapCompositeGroupConfig translates standard composite building blocks into the library's polymorphic IR.
func MapCompositeGroupConfig(
policy *schedulingv1alpha3.WorkloadCompositePodGroupSchedulingPolicy,
constraints *schedulingv1alpha3.WorkloadCompositePodGroupSchedulingConstraints,
disruption *schedulingv1alpha3.WorkloadCompositePodGroupDisruptionMode,
) *SchedulingConfig
4. Library Usage Example (Job)
This example demonstrates how the core Job controller integrates with the workloadbuilder
library to compile its flat Workload structure:
import (
"context"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
batchv1 "k8s.io/api/batch/v1"
schedulingv1alpha3 "k8s.io/api/scheduling/v1alpha3"
"k8s.io/utils/ptr"
)
func (r *JobReconciler) generateWorkload(
job *batchv1.Job,
) (*schedulingv1alpha3.Workload, error) {
// A Job's context-aware sane default is Basic scheduling (standard Kubernetes pod-by-pod)
defaultConfig := &workloadbuilder.SchedulingConfig{
Policy: &workloadbuilder.SchedulingPolicy{
Basic: &workloadbuilder.BasicSchedulingPolicy{},
},
}
// 2. Map the public Job.Spec.Scheduling wrapper directly using the library helper
var userConfig *workloadbuilder.SchedulingConfig
if job.Spec.Scheduling != nil {
userConfig = workloadbuilder.MapPodGroupConfig(
job.Spec.Scheduling.Policy,
job.Spec.Scheduling.Constraints,
job.Spec.Scheduling.DisruptionMode,
job.Spec.Scheduling.ResourceClaims,
)
}
// 3. Create the flat logical tree for Job (root node representing a single PodGroup).
// A callback defaults gang MinCount to the Job's parallelism when the user opts
// into Gang but leaves MinCount unset (see defaultMinCountForJob below).
rootNode := &workloadbuilder.WorkloadItem{
Name: "job-root",
DefaultConfig: defaultConfig,
UserConfig: userConfig,
Callbacks: []workloadbuilder.WorkloadItemFunc{
defaultMinCountForJob(job),
},
}
// 4. Let the workloadbuilder compile and generate the Workload object
builder := workloadbuilder.NewBuilder(rootNode)
workloadObj, err := builder.Build(
context.Background(),
job.Name,
job.Namespace,
metav1.NewControllerRef(job, jobKind),
)
if err != nil {
return nil, err
}
return workloadObj, nil
}
The callbacks attached above are ordinary functions the controller can set on a node. Their most common job is defaulting, but because they receive the whole node they can also apply arbitrary, controller-specific adjustments:
// defaultMinCountForJob fills in a sane default for gang MinCount (the Job's
// parallelism) when the resolved policy is Gang and MinCount was left unset.
func defaultMinCountForJob(job *batchv1.Job) workloadbuilder.WorkloadItemFunc {
return func(item *workloadbuilder.WorkloadItem) {
if item.Policy.Gang != nil &&
item.Policy.Gang.MinCount == nil {
item.Policy.Gang.MinCount = ptr.To(job.Spec.Parallelism)
}
}
}
// multiplyMinCountForAdjustedJob is an example of a non-defaulting adjustment: callbacks
// are free to implement arbitrary, controller-specific logic when needed.
func multiplyMinCountForAdjustedJob(job *batchv1.Job) workloadbuilder.WorkloadItemFunc {
return func(item *workloadbuilder.WorkloadItem) {
if job.Annotations["isAdjustedJob.example.com"] == "true" {
if item.Policy.Gang != nil {
item.Policy.Gang.MinCount *= 42
}
}
}
}
Reference Integration Examples: JobSet (Multi-Level)
This section provides non-normative reference examples demonstrating how a complex,
multi-level composite controller (such as JobSet) can integrate with the Workload-aware
Scheduling (WAS) building blocks and the workloadbuilder library.
These examples prove the viability and flexibility of the library for hierarchical workloads. The
final API design and integration details remain at the sole discretion of the JobSet project
maintainers.
We explore two different API representation options that JobSet could choose to adopt.
1. Option A: Template Delegation Model (Nested Configuration)
In this model, JobSet defines scheduling directives globally at the root
(JobSet.spec.scheduling) for policies that apply to the entire group. For leaf-level scheduling
(individual ReplicatedJobs), it directly leverages the nested scheduling fields already present
inside the embedded JobTemplateSpec (e.g., spec.replicatedJobs[*].template.spec.scheduling).
Example YAML Manifest
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
spec:
scheduling: # Global policy: applies to the entire JobSet
policy:
basic: {} # ESCAPE HATCH: Disable global "gang of gangs" so components start independently
replicatedJobs:
- name: driver
replicas: 1
template:
spec:
# Defaults to Basic (pod-by-pod) scheduling
containers:
- name: main
image: driver-image
- name: workers
replicas: 16
template:
spec:
scheduling: # Leaf-level policy declared inside the nested Job template
constraints:
topology:
- level: "topology.kubernetes.io/rack" # Co-locate workers on same rack
containers:
- name: worker
image: worker-image
2. Option B: Centralized ‘Targeted Policies’ Model (Root-only Configuration)
In this model, JobSet does not expose or use the nested child template fields. Instead, all
scheduling configurations—both global and local—are declared centrally inside a single root-level
spec.scheduling block. It uses a “shadow tree” pattern to map scheduling policies to specific
ReplicatedJobs by name (which directly follows the established targetReplicatedJob convention
already used in JobSet features like FailurePolicyRule).
Example YAML Manifest
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
spec:
scheduling: # All scheduling policies are defined here at the root
policy:
basic: {} # Global policy: components schedule independently
replicatedJobPolicies:
- targetReplicatedJob: "workers" # Policy target
constraints:
topology:
- level: "topology.kubernetes.io/rack" # Co-locate workers on same rack
replicatedJobs:
- name: driver
replicas: 1
template:
spec:
containers:
- name: main
image: driver-image
- name: workers
replicas: 16
template:
spec:
# Templates remain completely clean of scheduling directives
containers:
- name: worker
image: worker-image
3. Controller Integration and workloadbuilder Mapping Go Code
Regardless of which API model JobSet adopts, the controller can easily map its structural spec
into the workloadbuilder logical tree. For more details, see JobSet integration
.
Recommendations for Multi-Level Composite Controllers
Integrating Workload-aware Scheduling (WAS) into multi-level composite controllers (where
controllers orchestrate other controllers, such as JobSet creating core Jobs, or a Kubeflow
TrainJob composing a JobSet) introduces unique coordination challenges. Composite controllers
should adhere to the following guidelines:
1. Runtime PodGroup and CompositePodGroup Lifecycle Management
For single-level controllers (e.g., standard Job), the ownership boundaries are straightforward:
the Job controller manages both the static Workload resource and the corresponding runtime
PodGroup objects.
For multi-level composite controllers, two distinct lifecycle management strategies are available:
- Centralized Management: The root controller (e.g.,
JobSet) compiles theWorkloadand is also fully responsible for creating and managing all runtimePodGrouporCompositePodGroupobjects. - Delegated Management: The root controller only compiles and creates the n-level
Workloadresource, and delegates the creation and management of individual runtimePodGroupobjects to its child execution controllers (e.g., delegating to standardJobcontrollers).
Alpha Phase Strategy: For this initial alpha phase, we intentionally do not mandate a single recommended lifecycle management strategy for multi-level controllers. Controller maintainers and ecosystem integrators are encouraged to experiment with both centralized and delegated management patterns. The authors of this KEP will observe these patterns in the wild, gather user and operator feedback, and generalize these best practices into a standardized, unified lifecycle convention in a subsequent phase.
2. Downward Template and Parent Mapping via Well-Known Annotations
If a composite controller delegates runtime PodGroup management to child execution controllers,
we must solve a crucial multi-level coordination problem. The child controller needs two distinct
pieces of information to construct and place its runtime scheduling objects correctly:
- Template Mapping: Which
PodGroupTemplateorCompositePodGroupTemplateinside the parent’s compiledWorkloadcorresponds to this child’s pods (enabling correct policy/constraint compilation). - Parent Instance Linkage: Which specific runtime
CompositePodGroupinstance name in the namespace this newly created child must attach to (under its “parentRef”). This linkage is especially critical in multi-instantiated environments (such asLeaderWorkerSet/ LWS), where a composite controller may instantiate multiple separateCompositePodGroupobjects from the exact same template (one per replica).
The Solution: Downward Mapping Annotations
To resolve this template and hierarchy mapping without structural API schema changes, the root and
intermediate orchestrators must propagate these linkages downwards by injecting two well-known
metadata annotations directly into the created child objects (for example, the JobSet
controller sets these annotations on each standard Job resource it creates):
- Template Linkage Annotation:
- Annotation Key:
scheduling.k8s.io/group-template-name - Value: The unique name of the target
PodGroupTemplateorCompositePodGroupTemplatedefined inside the parentWorkloadresource (ensuring direct mapping, as all template names inside a Workload are guaranteed to be unique).
- Annotation Key:
- Parent Instance Linkage Annotation:
- Annotation Key:
scheduling.k8s.io/parent-composite-podgroup - Value: The exact resource name of the parent
CompositePodGroupobject in the same namespace that the child’s newly created group must connect to.
- Annotation Key:
We strictly use unstructured metadata annotations rather than introducing new structural fields in the child’s API schemas for this coordination. These mappings are transient, internal, and automatically managed by composite operators during compilation, not user-configurable scheduling intents.
Go Package Placement & Graduation Strategy
Embedding reusable building block Go structures (defined in a pre-stable package like
scheduling.k8s.io/v1alpha3) directly into a stable GA type (like batch/v1.JobSpec) during its
Alpha phase introduces package dependency and graduation challenges.
In the Go language, changing the import path of an embedded field inside a GA struct constitutes a breaking change in client libraries. To solve this graduation compatibility trap without forcing identical structure duplication across different apiGroups, we adopt the following approved transition pattern:
- Alpha Phase: The shared building blocks are defined in the pre-stable
scheduling.k8s.io/v1alpha3package. The standard Kubernetes import rules allow stable GA groups (batch/v1) to import pre-stable packages as long as the field itself remains gated in Alpha. - Graduation to Beta/GA: When the composed field is promoted to default-enabled (Beta/GA in
the
v1type), we bypass the intermediatev1beta1package version entirely (since wire-format compatibility is already committed at thev1resource level). We graduate the building block structs straight into the stablescheduling.k8s.io/v1package and update the field insidebatch/v1.JobSpecto reference thev1type. - Go Type Aliasing for Compatibility: To prevent breaking third-party Go controllers that
still import the older alpha package, we replace the physical structures in
v1alpha3with Go Type Aliases (=) pointing to the new stablev1types. This is a well-established, approved Kubernetes API pattern (previously used in theadmissionregistrationAPI group) that allows external codebases to compile seamlessly while gradually transitioning their imports over multiple releases.
Test Plan
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
Job-specific test plans are tracked in KEP-5547 .
Unit tests
- Add tests that verify:
workloadbuildercompiles aBasicpolicy into the expectedWorkload/PodGroupworkloadbuildercompiles aGangpolicy into the expectedWorkload/PodGroupworkloadbuildercorrectly maps topology constraints, disruption mode, and resourceClaimsworkloadbuildermerges controller defaults with user overrides (e.g. userGangoverrides controller defaultBasic)workloadbuilderruns nodeCallbacksafter merging config, and a defaulting callback fillsgang.minCount(e.g. from a Job’s parallelism) when omittedworkloadbuilderValidaterejects semantically invalid configurations- Single-level
WorkloadItem(flat, no children) produces a leafPodGrouponly - Multi-level
WorkloadItemtree (with children) produces aCompositePodGroupwith correct parent–child structure MapPodGroupConfigandMapCompositeGroupConfigcorrectly translate API types into the library IR
- Reference integration tests for multi-level controllers (e.g.
JobSet) verify that theworkloadbuilderproduces the expectedCompositePodGroupand childPodGroupobjects from a compositeWorkloadItemtree.
Integration tests
- Verify that a single-level controller (Job) can create the correct
Workload/PodGroupviaworkloadbuilder— covered in KEP-5547 - Verify that a multi-level controller (e.g.
JobSet) can produce aCompositePodGroupwith multiple childPodGroupsvia theworkloadbuilderlibrary - Verify that updating
gang.minCounttriggers recompilation of theWorkloadand re-sync of thePodGroup
e2e tests
- Gang scheduling end-to-end: all pods scheduled together or none via
workloadbuilder-compiledWorkload/PodGroup - Mixed workloads: gang and basic Jobs coexist without interference
Graduation Criteria
Alpha
- Reusable scheduling API building blocks (
SchedulingConstraints,DisruptionMode,SchedulingMode,ResourceClaim) introduced under thescheduling.k8s.ioAPI group. - The shared
workloadbuilderGo translation library implemented in thek8s.io/component-helpersstaging repository. - Comprehensive unit and integration tests added for the
workloadbuilderlibrary to verify correct resource translation and default-overriding logic. - Core
JobAPI (batch/v1) integrated with the standardized WAS building blocks and validated in the alpha phase.
Beta
- At least one multi-level composite workload controller (such as
JobSet,LeaderWorkerSet, or KubeflowTrainJob) successfully integrated using the standardized building blocks and theworkloadbuilderlibrary. - Clear recommendations on runtime
PodGroup/CompositePodGroupcreation and lifecycle management for multi-level composite controllers finalized and validated in practice. - User feedback gathered on usability, confirming that the proposed approach provides a natural and cohesive UX.
GA
- TBD once the KEP promoted to beta
Upgrade / Downgrade Strategy
The API building blocks are not top-level objects, and thus are not exposed directly
by kube-apiserver. Update/Downgrade strategy for top-level APIs (like Job) are
described in detail in corresponding KEPs.
Version Skew Strategy
workloadbuilder adds no version-skew constraints of its own, each component vendors a fixed
version at build time. Skew applies only to the runtime components of each integration.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name:
WorkloadWithJob- Components depending on the feature gate:
- kube-controller-manager
- kube-apiserver
- Components depending on the feature gate:
- Feature gate name:
- Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?
Does enabling the feature change any default behavior?
This KEP itself is a code-level change and a building block for few KEPs.
The API building blocks and libraries can’t really be disabled. The integration
with those however is handled by feature gates dedicated to integrations (e.g.
WorkloadWithJob for the integration with Job API and job-controller). This
can be disabled and it’s described in detail in corresponding KEPs (e.g.
KEP-5547
for Job integration).
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
The building blocks and libraries themselves aren’t gated, so there is nothing to roll back at this
level. Rollback is a property of each integration, disabling that integration’s dedicated gate (e.g.
WorkloadWithJob) is what stops the building-block fields from being served.
What happens if we reenable the feature if it was previously rolled back?
This is likewise governed by the integration rather than the building blocks. Reenabling an integration’s gate is handled according to each controller’s enablement/disablement strategy.
Are there any tests for feature enablement/disablement?
Since the building blocks aren’t gated, enablement/disablement tests live with each integration and are described in the corresponding KEPs. This KEP is covered by library and unit tests for the building blocks themselves.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
What specific metrics should inform a rollback?
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
How can someone using this feature know that it is working for their instance?
- Events
- Event Reason:
- API .status
- Condition name:
- Other field:
- Other (treat as last resort)
- Details:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- Other (treat as last resort)
- Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?
Dependencies
Does this feature depend on any specific services running in the cluster?
Scalability
Will enabling / using this feature result in any new API calls?
Enabling an integration’s gate adds no new API calls by itself, it only allows the integration’s
building-block fields to be persisted. The new calls (creating Workload/PodGroup objects) are
made by integrating controllers when a user opts in. Workloads that omit the scheduling block
generate no new API calls.
Will enabling / using this feature result in introducing new API types?
Yes, the building-block field types under scheduling.k8s.io/v1alpha3, embedded into integrating
APIs.
Will enabling / using this feature result in any new calls to the cloud provider?
No.
Will enabling / using this feature result in increasing size or count of the existing API objects?
Yes, but only when a user opts in. This KEP itself only adds optional building-block fields
to integrating workload objects. The larger effect — creating Workload and PodGroup (or
CompositePodGroup) objects (~500 bytes each, typically one per opted-in workload) — is performed
by integrating controllers and quantified per integration.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
This KEP itself adds only in-process API translation (negligible CPU) in controllers that vendor
the library. The user-visible effects come from the integrations and the scheduler and
apply only to opted-in workloads. Workloads that omit the scheduling block are unaffected.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
This KEP itself adds only the building-block fields and the build-time translation library (negligible CPU).
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No. This feature operates entirely at the control-plane/API level and does not consume any node-level resources.
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
What are other known failure modes?
What steps should be taken if SLOs are not being met to determine the problem?
Implementation History
- 2026-06-03: KEP Created for alpha release
Drawbacks
- Reduced global uniformity / API fragmentation: Because each controller composes
its own user-facing scheduling API from the shared building blocks rather than a
single unified schema, the exact shape and vocabulary of the
schedulingconfiguration can differ between controllers. - Shared-library coupling and version skew: Out-of-tree controllers that adopt the
workloadbuilderlibrary take on a dependency whose translation/defaulting logic must stay compatible across controller and library versions. Skew between a controller’s vendored library version and the cluster’sscheduling.k8s.ioAPI version can lead to subtle behavioral differences. - Additional API surface to maintain: The standardized building blocks add new
types under
scheduling.k8s.iothat must evolve carefully to remain backward-compatible across the many controllers that embed them.
Alternatives
During the design phase, we initially pursued a highly unified, top-down compiler vision outlined in the [Public] API Design for WAS Controller Integration .
However, as we analyzed the implementation details, we discovered two fatal architectural and logistical challenges documented in [Public] WAS Controller API - challenges and potential alternatives that made the original unified API vision unfeasible within a reasonable timeframe:
1. Implementation Complexity & The “Transitive Capability Leak”
As detailed in [Public] The “capability leak” in
go/was-controller-api
,
because composite workloads (such as JobSet or TrainJob) natively wrap child templates (like
standard JobTemplateSpec), any new scheduling field introduced at the child level transitively
propagates (“leaks”) up the schema stack. Handling these nested configurations requires massive,
complex boilerplate inside every intermediate controller (e.g., reconcilers dynamically checking
if they are the root compiler, managing owner references, and validating nested fields), making
the unified compiler pattern highly cumbersome and fragile.
2. The Upstream Dependency Bottleneck
The most critical issue with the original unified API design is the strict Controller
Integration Dependency chain. Under a monolithic, cascading rollout, integrating a new
scheduling feature into a top-level out-of-tree controller (such as TrainJob or RayJob) was
strictly blocked by the successful integration of all intermediate child controllers (waiting
first for core Job and then JobSet). This dependency chain would delay crucial Workload-aware
Scheduling features for quarters or years, which is completely unacceptable when the user demand
in the AI/ML space is immediate.
The Chosen Solution: Autonomous Composed Configurations & Conscious Trade-offs
Rather than delaying critical features, this KEP embraces Controller Autonomy. Sponsoring
out-of-tree controllers have full authority to design their own composed configurations using the
standard scheduling.k8s.io building blocks and the workloadbuilder library.
This represents a conscious and deliberate architectural trade-off:
- Local Consistency > Global Uniformity/Fragmentation: We prioritize native, idiomatic
consistency within each controller’s local API over a globally unified, rigid schema. Enabling
JobSetto utilize its establishedtargetReplicatedJobsconvention is far more intuitive for its users than forcing a single, shared structure across the entire ecosystem. - Time-to-Market > Perfect API: In the fast-paced AI and machine learning landscape, workload requirements change from month to month. Users need working scheduling capabilities today, not an idealized but delayed API a year from now. A “prettier” global API structure is not an acceptable justification for blocking immediate ecosystem adoption.