KEP-4368: Job API managed-by label

KEP-4368: Support managedBy field for Jobs

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

We support the “managedBy” field as a lightweight mechanism to delegate the Job synchronization to an external controller.

Motivation

As a part of Kueue (an effort done by WG Batch, in cooperation with SIG Autoscaling, SIG Scheduling, SIG Apps and SIG Node) we are working on a multi-cluster job dispatcher project, called MultiKueue .

In the MultiKueue design, which follows manager-worker architecture, a user creates a Job in the management cluster, but a mirror-copy of the Job is created and executed in one of the worker clusters. The status updates of the mirror-Job are reflected by the Kueue controller in the management cluster, in the status of the Job created by the user.

In order to support this workflow we need a mechanism to disable the main Job controller, and delegate the status synchronization to the Kueue controller.

Goals

support delegation of Job synchronization to an external controller

Non-Goals

passing custom parameters to the external controller
Introduce a new concurrency policy for CronJobs (e.g. ForbidActive or SoftForbid) to replace a Job that is about to complete, but still has terminating pods.

Proposal

The proposal is to support the “managedBy” field in the Job spec to indicate the only controller responsible for the Job object synchronization.

User Stories (Optional)

Story 1

As a developer of Kueue I want to have Job API which allows me to implement the MultiKueue design. For this reason I need a way to disable the main Job controller on the management cluster.

The mechanism should be per-Job, because the management cluster might also be one of the worker clusters, for two reasons:

Disabling the Job controller per cluster requires access to the kube-controller-manager manifest. Such access is generally discouraged by cloud providers.
The management cluster may also be a worker. Supporting this scenario is important for smooth transition of Kueue users from a single-cluster to multi-cluster.

Ideally, the mechanism should be lightweight so that it is easy to be adopted by other Job CRDs supported by Kueue (see here ): MPIJob, RayJob, JobSet, multiple Kubeflow jobs.

It could be handy if the controller can be indicated by Kueue after the Job is created, but before starting it. In the scenario of role sharing (where the management cluster is also a worker), it would allow to avoid creation of a mirror Job within the cluster.

Notes/Constraints/Caveats (Optional)

Prior work

This approach of allowing another controller to mirror information between APIs is already supported with the “managedBy” label used by EndpointSlices (endpointslice.kubernetes.io/managed-by , see also in KEP ) and IPAddresses (ipaddress.kubernetes.io/managed-by ).

Note that, the reserved label values for the built-in controllers have the k8s.io suffix, i.e.: endpointslicemirroring-controller.k8s.io and ipallocator.k8s.io, for the EndpointSlices, and IPAddresses, respectively.

Can the field be mutable?

There is a potential risk of leaking pods, if the value is changed. For example, assume there is a running Job, which is reconciled by the Job controller, and has some pods created. Then, if the value is switched to the mirroring Kueue controller (which by itself does not manage pods). Then, the pods are leaking and remain running.

In order to avoid the risk of pods leaking between the controllers when changing value, we make it immutable (allow to be added on Job creation, but fail requests trying to update its value, see also mutability ).

However, the question remains if we can make the field mutable when the job is stopped, similarly, as we do with the AllowMutableSchedulingDirectives flag which guards mutability of the Job’s pod template labels.

It seems possible, and could be handy in Story 1 , but it is also not a blocker.

It would also complicate debuggability of the feature.

Also, we already observe the adoption of the mechanism in other batch projects, such as:

These projects for now follow the decision taken in the core k8s to make the field immutable to avoid complication of the support for mutability.

All together, we decide to keep the field immutable.

Use for MultiKueue

The “managedBy” field is going to be added by a dedicated MultiKueue webhook for Jobs created by users, and the Jobs remain suspended until ready to run. Once the job is ready to run its mirror copy is created on a selected worker cluster. Note that the mirror copy differs from the Job on the management cluster as it does not have the “managedBy” field (removed), and will have different UIDs.

When the job is running it is unsuspended (both on the management and the worker clusters). Until the job completes, the MultiKueue controller mirrors back the Job status, periodically, from the worker onto the Job object in the management cluster.

Note that for simplicity of the design, and thus debugging, there is one-to-one relationship between the Job on the management cluster and the Job on the worker cluster (it is non-goal to distribute the Job among multiple clusters).

Risks and Mitigations

Ecosystem fragmentation due to forks

The mechanism to disable the main Job controller opens the door for users to substitute it with a fork. This may create more fragmentation in the community as users may prefer to use their forked controllers rather than contribute upstream.

First, this risk, to some extent, exists even today as admins with access to the control plane can disable job controller by passing --controllers=-job,* in the manifest for kube-controller-manager (see more info on the --controllers flag here .

Second, we believe that users who had the need to fork the Job controller already introduced dedicated Job CRDs for their needs.

Two controllers running when feature is disabled

It is possible that one creates jobs with “managedBy” field on kubernetes version which enables the feature, then downgrades to the alpha version which disables the feature gate. In that case the field remains present on the job and two controllers (the built-in, and external controller) might start running and compete with Job status updates at the same time.

Note that an analogous situation may happen when the version of Kubernetes already supports the field, but the feature gate is disabled in kube-controller-manager.

To mitigate this risk we warn about it in Kueue documentation, to remove the jobs using this field before downgrade or disablement of the feature gate.

Finally, this risk will fade away with time as the new versions of Kubernetes support it.

Debuggability

With this mechanism new failure modes can occur. For example, a user may make a typo in the field value, or the cluster administrator may not install the custom controller (like MultiKueue) on that cluster.

In such cases the user may not observe any progress by the job for a long time and may need to debug the Job.

In order to allow for debugging of situations like this the Job controller will put a log line indicating the synchronization is delegated to another controller (see implementation overview ).

Additionally, re-evaluate extending the kubectl command-line tool before GA . We could extend the command to provide useful debugging information with the following:

new MANAGED_BY column for kubectl get job -owide (possibly also without -owide)
a line in the kubectl describe job output, just before the list of events, providing a user readable information if the Job is synchronized by a custom controller.

Alternative ideas considered were a dedicated condition and events .

Based on beta feedback, kubectl extensions are deferred as follow-up enhancements. The “managedBy” field is already visible via standard kubectl commands (kubectl get job <name> -o yaml), and MultiKueue users have not reported difficulties debugging Jobs with this field. We will reconsider these enhancements based on user feedback post-GA.

Custom controllers not compatible with API assumptions by CronJob

Currently, the validation of the Job status API is rather relaxed, allowing for potential situations, when modified by custom controllers, which might be problematic for the API consumers, like the built-in CronJob controller.

In particular, the CronJob relies on the assumption that the CompletionTime is only set to successful jobs. Also, that a job does not flip its state from Complete to Failed (or the other way round). Also, a finished job should not flip back to non-finished.

In order to mitigate for this risk, we strengthen the validation of the Job Status API to prevent faulty custom controllers to do unexpected changes to the status (see here ).

Additionally, we intend to strengthen the CronJob implementation to verify the Job has the Complete condition before using CompletionTime (see here ).

CronJob delaying start of a new Job in Forbid mode

As a consequence of fixing #123775 as proposed in Terminating pods and terminal Job conditions delay setting the Complete and Failed conditions until the Job pods are terminated.

This impacts CronJobs with the Forbid concurrency policy, resulting in delayed creation of the replacement Job (until all pods of the previous job are terminated). This might be particularly noticeable for Jobs using very long termination period (terminationGracePeriodSeconds).

However, we argue that this change improves the CronJob to better match the semantic for the Forbid concurrency policy (see here , and the comment ).

Users who expect two (or more) Jobs owned by a CronJob to run at the same time should use the Allow concurrency policy, which can be combined with the use of a quota management system, like Kueue , to control the maximal number of the Jobs running at the same time.

Design Details

API

type JobSpec struct {
  ...
	// ManagedBy field indicates the controller that manages a Job. The k8s Job
	// controller reconciles jobs which don't have this field at all or the field
	// value is the reserved string `kubernetes.io/job-controller`, but skips
	// reconciling Jobs with a custom value for this field.
	// The value must be a valid domain-prefixed path (e.g. acme.io/foo) -
	// all characters before the first "/" must be a valid subdomain as defined
	// by RFC 1123. All characters trailing the first "/" must be valid HTTP Path
	// characters as defined by RFC 3986. The value cannot exceed 64 characters.
	// +optional
	ManagedBy *string
}

Implementation overview

We skip synchronization of the Jobs with the “managedBy” field, if it has any different value than kubernetes.io/job-controller. When the synchronization is skipped, the name of the controller managing the Job object is logged.

We skip the reconciliation inside the syncJob function (see here ).

We will re-evaluate for GA to also skip the reconciliation within the enqueueSyncJobInternal for optimal performance. See discussion in the Skip reconciliation in the event handler .

Skipping reconciliation in event handlers is deferred as a performance optimization for post-GA. This is considered a premature optimization without evidence of performance issues from MultiKueue users on the management cluster.

There is no validation for a value of the field beyond its format as described in the API comment above.

Job status validation

We will add the following validation rules, which are currently satisfied by the built-in Job controller, to mitigate the risk of here :

the CompletionTime field is only set for jobs with Complete condition (per API comment ) and cannot be changed once set.
once the Failed=True or Complete=True conditions are added, their status cannot be changed. They are also mutually exclusive.

Additionally, we verify the following:

the completedIndexes and failedIndexes fields are non-nil only when .spec.completionMode: Indexed
the format of the completedIndexes and failedIndexes fields (but tolerate corrupted values if already present)

We may come up with more validation rules during the implementation phase. The API comments to the Job status API fields will be updated to make the contract clear.

We also plan to add a validation rule to make sure that the terminal conditions (Failed and Complete) are only added to Job when all pods are terminated. For that we plan to follow the approach described below , which extend the scope of the interim FailureTarget and SuccessCriteriaMet conditions. We will also validate that the transition to Failed or Complete condition is preceded by adding the FailureTarget or SuccessCriteriaMet condition, respectively.

Additionally, we are going to introduce a validation rule that the count of ready status.ready pods is lower or equal than the number of active status.active pods. In order to introduce this validation we need to first solve Job controller reports the count of ready pods with unnecessary delay , as well as merge Improve the Job API comment for ready field .

Terminating pods and terminal Job conditions

During the development process of Alpha in 1.30 we considered adding a validation rule enforcing that the Job terminal conditions (Failed or Complete) are only added when all pods are terminated (status.terminating=0 and status.ready=0). However, the rule turned out to be violated by the built-in Job controller (see issue #123775 ).

We are going to solve this issue by delaying the addition of the Job terminal conditions (Failed or Complete) until the pods are terminated (terminating=0 and ready=0).

One complication of this approach is that pod termination may take an arbitrarily long if the non-standard pod graceful termination period (terminationGracePeriodSeconds) is configured (30s by default). In order to give the API clients flexibility to know the fate of a Job as soon as possible we extend the scope for the following Job conditions:

FailureTarget - introduced in Pod Failure Policy KEP
SuccessCriteriaMet - introduced in Success Policy KEP

Note that, with the new conditions an API client can know the fate of a Job earlier than currently, since adding the conditions will not wait for emptying of the uncountedTerminatedPods structure, as shown in the experiment .

Since the fix is needed for this KEP and the Pod Replacement Policy KEP , we intend to protect the fix with the OR of the JobManagedBy and the JobReplacementPolicy feature gates.

Additionally, fixing the issue #123775 will also require fixing:

Note also that the fix impacts CronJob when using the Forbid concurrency, see more details in CronJob delaying start of a new Job in Forbid mode .

Mutability

We keep the field immutable. See also the discussion in Can the field be mutable? .

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

pkg/controller/job: 2023-12-20 - 91.5%
pkg/registry/batch/job: 2023-12-20 - 92.2%
pkg/apis/batch/validation: 2024-01-22 - 98.0%
pkg/apis/batch/v1: 2023-12-20 - 29.3% (mostly generated code)

The following scenarios are covered:

the Job controller reconciles jobs with the “managedBy” field equal to kubernetes.io/job-controller when the feature is enabled
the Job controller reconciles jobs without the “managedBy” field when the feature is enabled
the Job controller does not reconcile jobs with custom value of the “managedBy” field when the feature is enabled
the Job controller reconciles jobs with custom “managedBy” field when the feature gate is disabled
verify the field is immutable, both when the job is suspended or unsuspended; when the feature is enabled
enablement / disablement of the feature after the Job (with custom “managedBy” field) is created
verify the new Job Status API validation rules (see here ). In particular:
1. Failed and Complete conditions cannot be added when status.terminating!=0 or status.ready!=0
2. Failed and Complete conditions cannot be added if there are not corresponding FailureTarget or SuccessCriteriaMet conditions,
3. the counter for the “ready” pods is lower or equal to the counter for “active” pods

The following scenarios related to Terminating pods and terminal Job conditions are covered:

Failed or Complete conditions are not added while there are still terminating pods
FailureTarget is added when backoffLimitCount is exceeded, or activeDeadlineSeconds timeout is exceeded
SuccessCriteriaMet is added when the completions are satisfied

Integration tests

The following scenarios are covered:

the Job controller reconciles jobs with the “managedBy” field equal to kubernetes.io/job-controller (link )
the Job controller reconciles jobs without the “managedBy” field (link )
the Job controller does not reconcile a job with any other value of the “managedBy” field. In particular:
- it does not reset the status for a Job with .spec.suspend=false (link ),
- it does not add the Suspended condition for a Job with .spec.suspend=true (link ).
the Job controller reconciles jobs with custom “managedBy” field when the feature gate is disabled (link )
the Job controller handles correctly re-enablement of the feature gate link
the jobs_by_external_controller_total metric is incremented when a new Job with custom “managedBy” is created (link )
the jobs_by_external_controller_total metric is not incremented for a new Job without “managedBy” or with default value (link )
the jobs_by_external_controller_total metric is not incremented for Job updates (regardless of the “managedBy”) (tested indirectly as here the Job controller updates the Job status)

The following scenarios related to Terminating pods and terminal Job conditions are covered:

Failed or Complete conditions are not added while there are still terminating pods (link )
FailureTarget is added when backoffLimitCount is exceeded, or activeDeadlineSeconds timeout is exceeded (link )
SuccessCriteriaMet is added when the completions are satisfied (link )

During the implementation more scenarios might be covered.

e2e tests

The feature does not depend on kubelet, so the functionality can be fully covered with unit & integration tests.

We propose a single e2e test for the following scenario:

the Job controller does not reconcile a job with any other value of the “managedBy” field. In particular, it does not reset the status for an unsuspended Job.

Job should allow to delegate reconciliation to external controller : SIG Apps , triage search

Graduation Criteria

Alpha

skip synchronization of jobs when the “managedBy” field does not exist, or equals kubernetes.io/job-controller
unit and integration
implement the additional Job status validation (see here ); also update the comments to the API fields affected by the new validation rules
make CronJob more resilient by checking the Job condition is Complete when using CompletionTime (see here )
The feature flag disabled by default
implement the jobs_by_external_controller_total metric

Second Alpha (1.31):

preparatory fix to address all known inconsistencies between validation and the Job controller behavior, in particular: #123775 . The proposed approach is outlined in here .
preparatory fixes to address the issues that currently the count for ready pods might be temporarily greater than active pods. The fix will entail Job controller reports the count of ready pods with unnecessary delay , as well as merging Improve the Job API comment for ready field .
Add validation rule that Failed and Complete conditions are added when terminating=0, and ready=0. Also, link provide links for the relevant integration tests in the KEP.
Add validation rule that the count of ready pods is lower or equal than active pods

Beta

e2e tests
verify the validation passes during e2e tests for open-source projects (like Kueue and JobSet)
The feature flag enabled by default

GA

Address reviews and bug reports from Beta users
- So far, with the adoption in MultiKueue we have not got users requesting extra debuggability. With the “managedBy” field being immutable it seems the feature is not causing issues requiring extensive debug. The metric and immutable field should be sufficient as a source of truth.
Re-evaluate the ideas of improving debuggability (like extended kubectl , dedicated condition , or events )
- Keep these ideas as follow-up features and not blockers for the graduation.
Re-evaluate the need to skip reconciliation in the event handlers to optimize performance
- This could be a follow-up performance improvement, but does not seem required. We have no record of MultiKueue users hitting performance issues on the management cluster due to Job controller.
Assess the fragmentation of the ecosystem. Look for other implementations of a job controller and assess their conformance with k8s.
- See Ecosystem Assessment section below.
Lock the feature gate
- This will be done in follow-up implementation PR for 1.35.

Ecosystem Assessment

We already have a couple of implementations of the “managedBy” field in order to integrate with MultiKueue:

JobSet
Kubeflow Trainer v1
Kubeflow Trainer v2
KubeRay
Tekton Pipelines
AppWrapper

These implementations follow the k8s core design, particularly making the field immutable.

Conformance Analysis:

Aspect	k8s Job	JobSet	Kubeflow Trainer v1	Kubeflow Trainer v2	KubeRay	Tekton Pipelines	AppWrapper
Open set of allowed values	y	y	n	n	n	y	y
Max length 63 chars	y	y	y (by closed set)	y (by closed set)	y (by closed set)	n	n
Specific format validation	y	y	y (by closed set)	y (by closed set)	y (by closed set)	n	n
Immutability	y	y	y	y	y	y	y
Status validation	y	n	n	n	n	n	n
Skip inside Reconcile	y	y	y	y	y	y (also filtering)	y

Key Findings:

There is a bit inconsistency about the “open set of values” vs “closed set of values”
Kubeflow Trainer v1 puts the field under spec.runPolicy.managedBy. This is because the code for different controllers is deduplicated at the level of the runPolicy field
Kubeflow Trainer locks the set to only “known controllers” like MultiKueue: kueue.x-k8s.io/multikueue. This was done to be “on the safe side”, and avoid complex status validations as in the core
The main part is that all of them follow the key aspects: immutability and skip inside Reconcile
All projects consistently wanted to avoid the effort of validating the status which we took in the k8s Job

Validation inconsistencies:

Some controllers (Tekton, AppWrapper) don’t enforce the 63-character length limit. If they allow more than 63 characters, then nothing will break for them. The idea of restricting in the core k8s was to make it more constrained
Most controllers skip the status validation that k8s Job implements to avoid the extra work, since “managedBy” is immutable

Follow-up Work

We may reconsider dropping the code gracefully handling Jobs with invalid statuses after 1.39 behind a dedicated feature gate. All Jobs created in 1.32+ won’t require the graceful handling as the rules prevent entering invalid state. We will wait at least 4 releases before locking such a feature gate.

Metric Stability: The jobs_by_external_controller_total metric will be promoted to STABLE in 1.35.

Deprecation

Remove the feature-gate in GA+2.

Upgrade / Downgrade Strategy

Upgrade

An upgrade to a version which supports this feature does not require any additional configuration changes. This feature is opt-in at the Job-level, so to use it users need to add the “managedBy” field to their Jobs.

Downgrade

A downgrade to a version which does not support this feature (1.29 and below) does not require any additional configuration changes. All jobs, including these that specified a custom value for “managedBy”, will be handled in the default way by the Job controller. However, this introduces the risk of two controllers running at the same time .

In order to prepare the risk the admins may want to make sure the custom controllers using the “managedBy” field are disabled before the downgrade.

Version Skew Strategy

This feature is limited to control plane, so the version skew with kubelet does not matter.

In case kube-apiserver is running in HA mode, and the versions are skewed, then the old version of kube-apiserver may let the field get mutated, if the feature is not supported on the old version.

In case the version of the kube-controller-manager leader is skewed (old), the built-in Job controller would reconcile the Jobs with custom “managedBy” field, running into the risk of two controllers running at the same time . It is recommended the users don’t create jobs with custom “managedBy” field during an ongoing upgrade.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: JobManagedBy
- Components depending on the feature gate: kube-apiserver, kube-controller-manager
Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?

Does enabling the feature change any default behavior?

No.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes.

However, when the feature is disabled and there are Jobs external controllers by using “managedBy” field there is a risk of two controller running at the same time . Thus, it is recommended administrators make sure there are no Jobs using external controllers before rollback.

What happens if we reenable the feature if it was previously rolled back?

The feature behaves as if it was enabled for the first time.

Are there any tests for feature enablement/disablement?

Yes, we introduce the integration tests for the feature enablement / disablement after the Job is created (see here ).

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

The rollout will not impact already running workloads, unless they set the “managedBy” field to a custom value, but this would require a prior intentional action.

What specific metrics should inform a rollback?

A substantial increase in the apiserver_request_total[code=409, resource=job, group=batch], while there are jobs with the custom “managedBy” field, can be indicative of the built-in job controller stepping onto another controller, causing conflicts. This can be further investigate per-job by checking the .metadata.managedFields.manager being flipped between two owners.

The feature is opt-in so in case of such problems the custom “managedBy” field should not be used.

Also, an admin could check if the value of the jobs_by_external_controller_total matches the expectations. For example, if the value of the metric does not increase when new jobs are being added with a custom “managedBy” field, it might be indicative that the feature is not working correctly.

A substantial increase in kube_cronjob_status_active after upgrade may suggest that the Jobs are not making progress. Additionally, if the non-progressing Jobs use custom “managedBy” field, then rollback of the feature might be justified to make the CronJobs run, by letting the built-in Job controller handle the Jobs.

A substantial drop in the job_sync_duration_seconds, while the number of jobs with the custom “managedBy” field is low, could be indicative of the Job controller skipping reconciliation of jobs it should reconcile. This could be further investigated per-job by looking at the timestamp of changes in .metadata.managedFields.time, and owners in .metadata.managedFields.manager.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

The Upgrade->downgrade->upgrade was tested manually using the 1.31 release (Alpha), with the following steps:

Start the cluster with the JobManagedBy enabled for kube-apiserver and kube-controller-manager.

Then, create two-long running Jobs:

job-managed with custom value of the “managedBy” field
job-regular without the “managedBy” field

Then, verify that:

the job-managed does not get status updates from built-in controller. Update the status manually and observe it is not reset by the built-in controller.
the job-regular starts making progress (creates pods and updates the status accordingly by the built-in controller)

Simulate downgrade by disabling the feature for kube-apiserver and kube-controller-manager.

Then, verify that:

the job-managed starts to make progress, the status is reset, and updated to some new values
the job-regular continues making progress

Simulate upgrade by re-enabling the feature for kube-apiserver and kube-controller-manager.

Then, verify that:

the job-managed stops getting status updates from the built-in controller. Update the status manually and observe it is not reset by the built-in controller.
the job-regular continues making progress

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

Check the jobs_by_external_controller_total metric. If the value is non-zero for a field, it means there were Jobs using the custom controller created, so the feature is in use.

For a specific Job in question, check if the Job has the “managedBy” field.

How can someone using this feature know that it is working for their instance?

Events
- Event Reason:
API .spec
- Condition name:
- Other field:
  - .spec.managedBy for Jobs
Other (treat as last resort)
- Details:

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

This feature does not propose SLOs. We don’t expect any of the existing SLOs to be impacted negatively by the proposal.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name:
  - jobs_by_external_controller_total (new), with the controller_name label corresponding to the custom value of the “managedBy” field. The metric is incremented by the built-in Job controller on each ADDED Job event, corresponding to a Job with custom value of the “managedBy” field. This metric can be helpful to determine the health of a job and its controller in combination with already existing metrics (see below).
    - Components exposing the metric: kube-controller-manager
  - apiserver_request_total[code=409, resource=job, group=batch] (existing): substantial increase of this metric, when additionally jobs_by_external_controller_total>0 may be indicative of two controllers stepping onto each-other causing conflicts (see here ).
    - Components exposing the metric: kube-apiserver
  - kube_cronjob_status_active (existing), substantial increase of this metric, may suggest that there are accumulating non-progressing jobs controlled by CronJob. If additionally jobs_by_external_controller_total>0 it may suggest that the Jobs are getting stuck due to not being synchronized by the custom controller.
    - Components exposing the metric: kube-apiserver

Are there any missing metrics that would be useful to have to improve observability of this feature?

No.

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No, unless a custom value of the “managedBy” field is set. In the worst case scenario this is 9 bytes for the field name and 63 for the value.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

No change from existing behavior of the Job controller.

What are other known failure modes?

None.

What steps should be taken if SLOs are not being met to determine the problem?

N/A.

Implementation History

2023-12-20 - First version of the KEP
2024-03-05 - Merged implementation PR Support for the Job managedBy field (alpha)
2024-03-07 - Merged Update Job conformance test for job status updates
2024-03-08 - Merged Follow up fix to the job status update test
2024-03-11 - Merged Adjust the Job field API comments and validation to the current state
2024-05-16 - Merged Fix the comment for the Job managedBy field
2024-06-11 - Merged Count terminating pods when deleting active pods for failed jobs
2024-06-21 - Merged Update the count of ready pods when deleting pods
2024-07-12 - Merged Delay setting terminal Job conditions until all pods are terminal
2024-07-30 - Merged Update the docs for JobManagedBy and JobPodReplacementPolicy related to pod termination
2024-10-17 - Merged Graduate JobManagedBy to Beta in 1.32

Drawbacks

Alternatives

Skip reconciliation in the event handler

We discussed to skip the reconciliation only within the enqueueSyncJobInternal .

However, it was noted that it would cause race conditions when the Job with the same name and namespace is re-created, but with the managedBy field. The race condition was reproduced by the TestManagedBy_RecreatedJob integration test which demonstrated the issue with such an implementation.

Still, it is a potential improvement to skip the reconciliation inside syncJob and skip queuing within the enqueueSyncJobInternal function for optimal performance (by saving memory and off-loading the reconciliation queue).

Reasons for discarding/deferring

Potentially a premature optimization which would complicate the code. We will prefer to base the introduction of the optimization on users’ feedback.

Reserved controller name value

We could also use just job-controller for the reserved value of the field (without the k8s suffix).

Reasons for discarding/deferring

In the prior work the names end with k8s.io for the built-in kubernetes controllers.

Defaulting of the for newly created jobs

We could default the field in the PrepareForCreate function in strategy.go for newly created jobs.

Reasons for discarding/deferring

We anyway need to support jobs without the field to be synchronized by the Job controller for many releases before we can ensure that all the jobs have it.

An additional case for jobs without the field does not increase the complexity significantly.

Alternative names for field

Alternative names we considered:

controllerName
controlledBy

Reasons for discarding/deferring

The use of “controller” in the field name may be confused with the owning controller (indicated by the OwnerReference). For a batch Job this might be CronJob.

Choosing “managedBy” as the name we are also closer to the “managed-by” label used in the prior work .

Managed-by label

We also considered the label batch.kubernetes.io/managed-by, which was planned originally for this KEP.

Reasons for discarding/deferring

no clear indication if supported, on old versions of k8s users would add the label, the external controller is likely to try to sync the Job, so is the built-in controller. With the field old k8s will reject the request.
The risk of two controllers running at the same time is limited to disabling the feature or downgrade, requiring admin action. With label this was possible if one created the Job on an old k8s version.
worse discoverability of this functionality would be worse, compared to the field.

Users don’t know what the allowed values of the field are. The values are not validated anyway.

Alternative names for label (scopes)

Generic kubernetes.io/managed-by

The idea was suggested (see thread ) that the similar mechanism might be useful for other k8s controllers.

Reasons for discarding/deferring

There is currently no clear use case to support the label across the stack for APIs like StatefulSets, Deployments, or DaemonSets.

A generic name without support across all k8s APIs might be confusing to the users, and supporting it for all k8s APIs would be much bigger effort than currently needed for the MultiKueue scenario use.

The “managed-by” label idea has significant risks, such as ecosystem fragmentation due to forks . It makes sense to start with limited scope as a “pilot” and assess the impact.

Job-prefixed job.kubernetes.io/managed-by

The idea of job-prefixed label was suggested (see here ).

Reasons for discarding/deferring

We prefer the “batch” prefix for consistency with current labels used by the Job controller (see here ).

The “batch” prefix fits well for the use case of supporting this label in other batch Job CRDs supported by Kueue, such as JobSet, MPIJob, RayJob, etc.

Alternative mechanisms to mirror the Job status

A couple of other approaches to allow mirroring of the Job status was considered. They share the same risk as the managed-by label approach of substituting the Job controller with a custom one implementing the Job API.

mirrored-by label

Similar idea as the managed-by trying to address the risk of replacing the controller. To mitigate this risk we would document the label as used for the purpose of mirroring only. No controllers with custom logic are supported.

Reasons for discarding/deferring

This is wishful thinking, the users would still be free to use other custom controllers for Job API.

Class-based approach

The idea is that there is an interim object which allows to specify also parameters of the custom controllers.

Reasons for discarding/deferring

Also, the mechanism will be significantly harder to adopt by other Job CRD projects with which Kueue integrates, so effectively we would need to have multiple mechanisms in the ecosystem.

There is no need for the custom controllers in the job-mirroring use-case for MultiKueue, so it adds unnecessary complexity.

Annotation

Annotations have more relaxed validation for values.

Reasons for discarding/deferring

This would not be consistent with the prior work .

The ability to filter jobs by the label or field is likely useful by users to identify jobs using custom controllers, for example by kubectl get jobs -lbatch.kubernetes.io/managed-by=custom-controller.

Custom wrapping CRD

To avoid the risk we could introduce a CRD that allows users to run and monitor the status of the k8s Jobs. In this case a user creates, say kueue.MulticlusterJob. The instance of the MulticlusterJob embeds the JobSpec and the JobStatus. Then, based on the MulticlusterJob, Kueue creates the k8s Job on the selected cluster. Also, Kueue mirrors the status of the running k8s Job as the status of the MulticlusterJob.

Reasons for discarding/deferring

Huge friction when transitioning from single cluster to multi cluster. The in-house frameworks and pipelines need to be updated to use (create and monitor) the MulticlusterJob. This requires all the pipelines and frameworks to be aware of the multi-cluster. On the contrary, the proposed approach is transparent to the ecosystem.

The approach isn’t easily transferable for other Job CRDs. Creating a wrapping Multicluster Job CRD per Job CRD type creates maintenance cost at the Kueue side.

Increases fragmentation in the ecosystem. We don’t need yet another Job CRD and uproot the k8s Job. We want to have less, more universal APIs. We believe that the community driving the development of other Job CRDs is likely to adopt the label-based mechanism for making their CRDs multicluster-ready. So, the situation in which we go with the wrapping CRD for the K8s job, but the label-based mechanism for other CRD Jobs may result in decreased adoption of k8s Job, relative to the alternative Job CRDs, for the batch-related tasks.

It would not be compatible with CronJob. Using CronJob with MultiKueue is a valid use case we want to support.

Use the spec.suspend field

This approach is to keep spec.suspend=true on the management cluster, while allowing spec.suspend=false on the worker cluster and syncing the status.

Reasons for discarding/deferring

when .spec.suspend=true then the Job controller resets some of the status fields (like .status.active or .status.ready), while not resetting others (like status.Failed) so mirrored fields would be inconsistent.

Frameworks or users observing the main Job would get wrong information that it is suspended, while some of its status fields would be updating.

Using field selectors

We consider using field selectors by the Job controller to identify the subset of jobs it should watch. This could result in smaller memory usage.

Reasons for discarding/deferring

First, We use shared-informers (so that all core k8s controllers see all objects), then we cannot make the memory saving this way.

Second, there is no “OR” logic in selectors, however, the built-in Job controller needs to sync jobs in two cases:

old jobs without the field
new jobs with the field equal to kubernetes.io/job-controller

This means we would need to go via a difficult process of ensuring all jobs have the field, or listen on events from two informers. In any case, the use of field-selectors is significantly more complicated than the skip if inside the syncJob, and does not allow for big memory gain.

Alternative ideas to improve debuggability

Condition to indicate Job is skipped

In order to inform the user that a job is skipped from synchronization we could add a dedicated condition, say ManagedBy, indicating that the job is skipped by the built-in controller.

Reasons for discarding/deferring

Since the Job field is immutable, then the usability of the condition is limited, because the timestamp of the other fields will not bring extra debugging value.
Conceptually, we want to give full ownership of the Job object to the other job controller, objects mutated by two controllers could actually make debugging more involving.
The MultiKueue controller would have to non-trivially reconcile the Job Status. If it just blindly mirrored the status from the worker cluster (as currently planned), then it would remove the condition. Other controllers would need to be careful not to remove the condition either.
It requires extra request per job, and risks conflicts for the status Update requests.

Additionally, notice that the analogous situation takes place when spec.schedulerName does not match a custom scheduling profile. There is no condition indicating that.

Event indicating the Job is skipped

Job controller could emit event on the Job creation event indicating the Job is synchronized by a custom controller. This would not run into the issue with controllers conflicting on status updates.

Reasons for discarding/deferring

Events have expiration time, which is potentially cloud-provider dependent. It makes them not that useful to debug situations when the Job didn’t make progress for long time. So, they would not give a reliable signal for debugging based on playbooks.

Renewing the event on every Job update seems excessive from the performance perspective.