KEP-3998: Job success/completion policy
KEP-3998: Job success/completion policy
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Relax a validation for the "completions" of the indexed job
- Alternative API Name, "Criteria"
- Hold succeededIndexes as []int typed in successPolicy
- Acceptable percentage of total succeeded indexes in the succeededCount field
- Match succeededIndexes using CEL
- Use JobSet instead of Indexed Job
- Possibility for the lingering pods to continue running after the job meets the successPolicy
- Possibility for introducing a new CronJob concurrentPolicy, "ForbidUntilJobSuccessful"
- Possibility for the configurable reason for the "SuccessCriteriaMet" condition
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
This KEP extends the Job API to allow setting conditions under which an Indexed Job can be declared as succeeded.
Motivation
There are cases where a batch workload requires an indexed job that want to care only about leader indexes in determining the success or failure of a Job, for example MPI and PyTorch etc. This is currently not possible because the indexed job is marked as Completed only if all indexes succeeded.
Some third-party frameworks have implemented success policy.
Goals
- Allow to mark a job as a succeeded according to a declared policy.
- Once the job meets the successPolicy, the lingering pods are terminated.
Non-Goals
- Change the existing behavior of Jobs without a SuccessPolicy.
- Support SuccessPolicy for the job with
NonIndexedmode: The SuccessPolicy can be theoretically supported for the job withNonIndexedmode. However, we don’t work on the job withNonIndexedmode in the first iteration since there aren’t any effective use cases for the NonIndexed job.
Proposal
We propose new policies under which a job can be declared as succeeded. Those policies can be modeled in the following:
- An indexed job completes if a set of [x, y, z…] indexes are successful.
- An indexed job completes if x of indexes are successful.
Then, when the job meets one of the success policies, a new condition, SuccessCriteriaMet, is added.
User Stories (Optional)
Story 1
As a machine-learning researcher, I run an indexed job which a leader is launched as index=0 and workers are launched as index=1+. I want to care about only the leaders when the result of job is evaluated.
In addition, we want to terminate the lingering pods if the leader index (index=0) is succeeded because the workers often don’t have any ways to terminate themselves due to launching daemon processes like ssh-server.
apiVersion: batch/v1
kind: Job
spec:
parallelism: 10
completions: 10
completionMode: Indexed
successPolicy:
rules:
- succeededIndexes: "0"
template:
spec:
restartPolicy: Never
containers:
- name: job-container
image: job-image
command: ["./sample"]
Story 2
As a simulation researcher/engineer for fluid dynamics/biochemistry. I want to mark the job as successful and terminate the lingering pods when the job meets the one of following conditions:
- The case of the leader index (index=0) is succeeded
- The case of some worker indexes (index=1+) are succeeded
Because succeeded leader index means that the whole simulation is succeeded, and succeeded some worker indexes means that the minimum required value is satisfied.
apiVersion: batch/v1
kind: Job
spec:
parallelism: 10
completions: 10
completionMode: Indexed
successPolicy:
rules:
- succeededIndexes: 0
- succeededCount: 5
succeededIndexes: "1-9"
template:
spec:
restartPolicy: Never
containers:
- name: job-container
image: job-image
command: ["./sample"]
Notes/Constraints/Caveats (Optional)
No support JobSuccessPolicy for the NonIndexed Job
As I described in Non-Goals
, we don’t support the SuccessPolicy for the job with NonIndexed mode.
Difference between “Complete” and “SuccessCriteriaMet”
The similar job conditions, Complete and SuccessCriteriaMet, are different in the following ways:
Completemeans that all pods completed and either all of them were successful or the Job already hadSuccessCriteriaMet=true.SuccessCriteriaMetmeans that the job meets at least one of successPolicies.
So, the job could have both conditions, Complete and SuccessCriteriaMet.
The CronJob concurrentPolicy is not affected by JobSuccessPolicy
Even after introducing the JobSuccessPolicy, all CronJob concurrentPolicies work as before
since the JobSuccessPolicy doesn’t change the semantics of the existing Job Complete condition
and the Job declares succeeded by a new condition, SuccessCriteriaMet.
Specifically, the CronJob with Forbid concurrentPolicy created Jobs based on Job’s Complete condition as before.
Status never switches from “SuccessCriteriaMet” to “Failed”
Switching the status from SuccessCriteriaMet to Failed would bring the confusions to the systems,
which depends on the Job API.
So, the status can never switch from SucessCriteriaMet to Failed.
Additionally, once the job has SuccessCriteriaMet=true condition, the job definitely ends with Complete=true condition
even if the lingering pods could potentially meet the failure policies.
The scope of the SuccessCriteriaMet condition
As part of this KEP we introduced the SuccessCriteriaMet condition scoped to
the success policy.
However, we are going to extend the scope of the condition to the scenario when
the Job completes by reaching the .spec.completions, as part of fixing
(issue #123775)[https://github.com/kubernetes/kubernetes/issues/123775].
Additionally, we introduce a new CompletionsReached condition reason for the Complete and SuccessCriteriaMet condition
so that we can represent the place where the SuccessCriteriaMet condition when the number of succeeded Job Pods reached the .spec.completions.
See more details in the Job API managed-by mechanism .
Risks and Mitigations
If the job object’s size reaches to limit of the etcd and the job controller can’t store a correct value in
.status.completedIndexes, we probably can not evaluate the SuccessPolicy correctly.If we allow to set unlimited size of the value in
.spec.successPolicy.rules.succeededIndexes, we have a risk similar to KEP-3850: Backoff Limits Per Index For Indexed Jobs . So, we limit the size ofsucceededIndexesto 64KiB.
Design Details
Job API
We extend the Job API to set different policies by which a Job can be declared as succeeded.
type JobSpec struct {
...
// successPolicy specifies the policy when the Job can be declared as succeeded.
// If empty, the default behavior applies - the Job is declared as succeeded
// only when the number of succeeded pods equals to the completions.
// When the field is specified, it must be immutable and works only for the Indexed Jobs.
// Once the Job meets the SuccessPolicy, the lingering pods are terminated.
//
// This field is alpha-level. To use this field, you must enable the
// `JobSuccessPolicy` feature gate (disabled by default).
// +optional
SuccessPolicy *SuccessPolicy
}
// SuccessPolicy describes when a Job can be declared as succeeded based on the success of some indexes.
type SuccessPolicy struct {
// rules represents the list of alternative rules for the declaring the Jobs
// as successful before `.status.succeeded >= .spec.completions`. Once any of the rules are met,
// the "SucceededCriteriaMet" condition is added, and the lingering pods are removed.
// The terminal state for such a Job has the "Complete" condition.
// Additionally, these rules are evaluated in order; Once the Job meets one of the rules,
// other rules are ignored. At most 20 elements are allowed.
// +listType=atomic
Rules []SuccessPolicyRule
}
// SuccessPolicyRule describes a rule for declaring a Job as succeeded.
// Each rules must have at least one of the "succeededIndexes" or "succeededCount" specified.
type SuccessPolicyRule struct {
// succeededIndexes specifies the set of indexes
// which need to be contained in the actual set of the succeeded indexes for the Job.
// The list of indexes must be within 0 to ".spec.completions-1" and
// must not contain duplicates. At least one element is required.
// The indexes are represented as intervals separated by commas.
// The intervals can be a decimal integer or a pair of decimal integers separated by a hyphen.
// The number are listed in represented by the first and last element of the series,
// separated by a hyphen.
// For example, if the completed indexes are 1, 3, 4, 5 and 7, they are
// represented as "1,3-5,7".
// When this field is null, this field doesn't default to any value
// and is never evaluated at any time.
//
// +optional
SucceededIndexes *string
// succeededCount specifies the minimal required size of the actual set of the succeeded indexes
// for the Job. When succeededCount is used along with succeededIndexes, the check is
// constrained only to the set of indexes specified by succeededIndexes.
// For example, given that succeededIndexes is "1-4", succeededCount is "3",
// and completed indexes are "1", "3", and "5", the Job isn't declared as succeeded
// because only "1" and "3" indexes are considered in that rules.
// When this field is null, this doesn't default to any value and
// is never evaluated at any time.
// When specified it needs to be a positive integer.
//
// +optional
SucceededCount *int32
}
...
// These are valid conditions of a job.
const (
// JobSuccessCriteriaMet means the job has been succeeded.
JobSucceessCriteriaMet JobConditionType = "SuccessCriteriaMet"
...
)
...
const (
...
// JobReasonSuccessPolicy reason indicates SuccessCriteriaMet condition is added due to
// a Job met successPolicy.
// https://kep.k8s.io/3998
JobReasonSuccessPolicy string = "SuccessPolicy"
// JobReasonCompletionsReached reason indicates SuccessCriteriaMet condition is added due to
// a number of succeeded Job Pods met completions.
// https://kep.k8s.io/3998
JobReasonCompletionsReached string = "CompletionsReached"
)
...
Moreover, we validate the following constraints for the rules and status.conditions:
rules- whether each criterion have at least one of the
succeededIndexesorsucceededCountspecified. - whether the specified indexes in the
succeededIndexesand the number of indexes in thesucceededCountdon’t exceed the value ofcompletions. - whether
Indexedis specified in thecompletionModefield. - whether the size of
succeededIndexesis under 64Ki. - whether the
succeededIndexesfield has a valid format. - whether the
succeededCountfield has an absolute number. - whether the rules haven’t changed.
- whether the successPolicies meet the
succeededCount <= |succeededIndexes|, where|succeededIndexes|means the number of indexes in thesucceededIndexes.
- whether each criterion have at least one of the
status.conditions- whether the
SuccessCriteriaMetcondition isn’t removed when the Job is updated. - whether the
SuccessCriteriaMetcondition isn’t added after the Job already has onlyCompletecondition. - whether the
SuccessCriteriaMetcondition isn’t added to NonIndexed Job. - whether the Job doesn’t have both
FailedandSuccessCriteriaMetconditions. - whether the Job doesn’t have both
FailureTargetandSuccessCriteriaMetconditions. - whether the Job without SuccessPolicy doesn’t have
SuccessCriteriaMetcondition. - whether the Job with SuccessPolicy doesn’t have only
Completecondition. The Job with SuccessPolicy need to have bothSuccessCriteriaMetandCompleteconditions.
- whether the
Evaluation
Every time the pod condition are updated, the job-controller evaluates the successPolicies following the rules in order:
succeededIndexes: the job-controller evaluates.status.completedIndexesto see if a set of indexes is there.succeededCount: the job-controller evaluates.status.succeededto see if the value issucceededCountor more.
After that, the job-controller adds a SuccessCriteriaMet condition instead of a FailureTarget condition to .status.conditions
and the job-controller terminates the lingering pods. At that time, SuccessPolicy is set to the status.reason field.
Note that when the job meets one of successPolicies, other successPolicies are ignored.
Finally, once all pods have terminated, the job-controller adds a Complete condition to .status.conditions.
If any successPolicy isn’t set, the job-controller adds an only Complete condition to the Job after the Job finished.
Furthermore, the behavior of FailureTarget and SuccessCriteriaMet is similar in that the Job with this condition triggers the termination of lingering pods;
after all pods are terminated, the terminal condition (Failed or Complete) is added:
FailureTargetis added to the Job matched with FailurePolicy withaction=FailJoband triggers the termination of the lingering pods. Then, after the lingering pods are terminated, theFailedcondition is added to the Job.SuccessCriteriaMetis added to the Job matched with SuccessPolicy and triggers the termination of lingering pods. Then, after the lingering pods are terminated, theCompletecondition is added to the Job.
Transition of “status.conditions”
After extending the scope of the SuccessCriteriaMet and FailureTarget conditions
as proposed in The scope of the SuccessCriteriaMet condition
the diagram of transitions looks like below:
stateDiagram-v2
[*] --> Running
Running --> FailureTarget: Exceeded backoffLimit
Running --> FailureTarget: Exceeded activeDeadlineSeconds
Running --> FailureTarget: Matched FailurePolicy with action=FailJob
FailureTarget --> Failed: All pods are terminated
Failed --> [*]
Running --> SuccessCriteriaMet: Matched SuccessPolicy
Running --> SuccessCriteriaMet: Achieved the expected completions
SuccessCriteriaMet --> Complete: All pods are terminated
Complete --> [*]It means that the job’s .status.conditions follows the following rules:
- The job could have both
SuccessCriteriaMet=trueandComplete=trueconditions. - The job can’t have both
Failed=trueandSuccessCriteriaMet=trueconditions. - The job can’t have both
FailureTarget=trueandSuccessCriteriaMet=trueconditions. - The job can’t have both
Failed=trueandComplete=trueconditions.
The situations where successPolicy conflicts other terminating policies
The successPolicy has potential conflicts with other terminating policies such as the pod failure policy , backoffLimit, and backoffLimitPerIndex in the following situations:
- when the job meets the successPolicy and some pod failure policies with the
FailJobaction. - when the job meets the successPolicy and the number of failed pods exceeds
backoffLimit. - when the job meets the successPolicy and the number of failed pods per indexes exceeds
backoffLimitPerIndexin all indexes.
To avoid the above conflicts, terminating policies are evaluated the first before successPolicies.
This means that the terminating policies are respected rather than the successPolicies,
if the Job doesn’t have the FailureTarget or SuccessCriteriaMet conditions yet.
Test Plan
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
Unit tests
Test cases:
- tests for Defaulting and Validating
- verify whether a job has a SuccessCriteriaMet condition if the job meets to successPolicy and some indexes fail.
- verify whether a job has both complete and SuccessCriteriaMet conditions if the job meets to successPolicy and all pods are terminated
- verify whether a job has a failed condition if the job meets to both successPolicy and terminating policies in the same reconcile cycle
k8s.io/kubernetes/pkg/controller/job:2 February 2024-91.5%k8s.io/kubernetes/pkg/apis/batch/validation:2 February 2024-98.0%
Integration tests
- Test scenarios:
- enabling, disabling and re-enabling of the
JobSuccessPolicyfeature gate : result - handling of successPolicy when all indexes succeeded : result
- jobs_finished_total metric with CompletionsReached is incremented when job does not have successPolicy : result
- handling of the
.spec.successPolicy.rules.succeededIndexeswhen some indexes remain pending : result - handling of the
.spec.successPolicy.rules.succeededCountwhen some indexes remain pending : result - handling of successPolicy when some indexes of job with
backOffLimitPerIndexfail : result
- enabling, disabling and re-enabling of the
e2e tests
- Test scenarios:
Graduation Criteria
Alpha
- Feature implemented behind the
JobSuccessPolicyfeature gate. - Unit and integration tests passed as designed in TestPlan .
Beta
- E2E tests passed as designed in TestPlan .
- Introduced new
CompletionsReachedandSuccessPolicyreason labels to thejobs_finished_totalmetric in Monitoring Requirements . - Introduced a new
CompletionsReachedcondition reason for theCompleteandSuccessCriteriaMetcondition type. - Feature is enabled by default.
- Address all issues reported by users.
GA
- No negative feedback.
- Verify
conditions[].reason=[CompletionsReached|SuccessPolicy]for the Job’sCompletecondition in all e2e conformance tests (see example )
Upgrade / Downgrade Strategy
- Upgrade
- If the feature gate is enabled,
JobSuccessPolicyare allowed to use only. - If the feature gate is enabled without
JobSuccessPolicy, the default values will be applied to a job object. - Even if the feature gate is enabled, the Job controller doesn’t update
.status.conditionsin already finished jobs.
- If the feature gate is enabled,
- Downgrade
- Previously configured values will be ignored, and the job will be marked as completed only when all indexes succeed.
- the Job controller doesn’t update
.status.conditionsin already finished jobs.
Version Skew Strategy
The apiserver’s version should be consistent with the kube-controller-manager version, or this feature will not work.
This feature is limited to control plane.
Note that, the kube-apiserver can be in the N+1 skew version relative to the kube-controller-manager as described here . If it’s enabled, jobs with SuccessPolicy set will have it respected. Otherwise, it will be ignored by the job controller.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name: JobSuccessPolicy
- Components depending on the feature gate:
- kube-apiserver
- kube-controller-manager
Does enabling the feature change any default behavior?
No, the default behavior of a job and cronJob stays the same. The newly added field is optional and has to be explicitly set by the user to use this new feature.
Regarding the CronJob, please see more details in #The CronJob concurrentPolicy is not affected by JobSuccessPolicy .
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
Yes, we can disable the JobSuccessPolicy feature gate.
When the feature is disabled, the job controller stop evaluating the successPolicy even if
the .spec.successPolicy is set.
What happens if we reenable the feature if it was previously rolled back?
The Job controller considers the .spec.successPolicy when it updates .status.conditions
only for running Jobs and don’t update .status.conditions for already finished jobs.
Are there any tests for feature enablement/disablement?
Yes, we added the “enablement -> disablement -> re-enablement” flow integration tests for the new APIs from the alpha stage here :
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
Even if the kube-controller-manager is rolled out or rollback fail, already running workloads aren’t any impact. The default behavior will be applied to running workloads.
What specific metrics should inform a rollback?
An increase in the job_sync_duration_seconds metrics can mean that finished jobs
are taking longer to evaluate.
The users should check whether the completed jobs have the appropriate condition, specifically the reason.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
In the alpha stage, the upgrade->downgrade->upgrade testing was added in the integration tests here .
In terms of a manual test for the upgrade and rollback, we can use the v1.30.
The upgrade->downgrade->upgrade testing was done manually using the alpha version in v1.30 with the following steps:
- Start the cluster with the
JobSuccessPolicyfeature gate enabled:
Create a KIND cluster with v1.30 and use the Cluster configuration below to turn this feature on.
kind create cluster --config config.yaml
using config.yaml:
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
"JobSuccessPolicy": true
nodes:
- role: control-plane
image: kindest/node:v1.30.0
- role: worker
image: kindest/node:v1.30.0
Then, create the job using the .spec.successPolicy.rules[0].succeededCount=1,succeedeIndexes=0:
kubectl create -f job.yaml
using job.yaml:
apiVersion: batch/v1
kind: Job
metadata:
name: job-success-policy
spec:
parallelism: 3
completions: 3
completionMode: Indexed
successPolicy:
rules:
- succeededCount: 2
template:
spec:
restartPolicy: Never
containers:
- name: main
image: python:3.12
command:
- python3
- -c
- |
import os, sys, time
index = os.environ.get("JOB_COMPLETION_INDEX")
sys.exit(0) if index == "0" else time.sleep(300)
sys.exit(0) if index == "1" else time.sleep(3600)
imagePullPolicy: IfNotPresent
Await for the pods to be running and the pod with index=0 to be succeeded.
- Simulate downgrade by disabling the feature for api server and control-plane.
Then, await for the pod with index=1 to be succeeded and
verify that the pod with index=2 still running and the Job doesn’t have SuccessCriteriaMet.
- Simulate upgrade by enabling the feature for api server and control-plane.
Then, verify that the pod with index=2 is terminated and the Job has SuccessCriteriaMet and Complete conditions.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
We will introduce the new CompletionsReached and SuccessPolicy reason labels to the jobs_finished_total,
which indicates the following situations:
CompletionsReachedindicates a job is declared asCompletebecause the number of succeeded job pods meet.spec.completions.SuccessPolicyindicates a job is declared asCompletebecause the job meets.spec.successPolicy.
As we discussed in this thread
,
the new CompletionsReached reason label is used to count the successful jobs instead of existing "" reason label.
How can someone using this feature know that it is working for their instance?
- Job API .status
- The Job controller will add a
SuccessCriteriaMetcondition withSuccessPolicyreason toconditions.
- The Job controller will add a
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
99% percentile over day for Job syncs is <= 15s for a client-side 50 QPS limit.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
job_sync_duration_seconds(existing): can be used to see how much the feature enablement increases the time spent in the sync job - Components exposing the metric:
- Metric name:
Are there any missing metrics that would be useful to have to improve observability of this feature?
No.
Dependencies
Does this feature depend on any specific services running in the cluster?
No.
Scalability
No.
Will enabling / using this feature result in any new API calls?
Yes, if the Job meets the SuccessPolicy,
the job-controller must make an additional API call to update the condition with SuccessCriteriaMet.
Will enabling / using this feature result in introducing new API types?
No.
Will enabling / using this feature result in any new calls to the cloud provider?
No.
Will enabling / using this feature result in increasing size or count of the existing API objects?
Yes, it will increase the size of existing API objects only when the .spec.successPolicy is set.
- API type(s): Job
- Estimated increase in size:
.spec.successPolicy.rules.succeededIndexesfield are impacted. In the worst case, the size ofsucceededIndexescan be estimated about 64KiB (see Risks and Mitigations ).
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
Yes, the job-controller will consume more CPU and memory to compute the set of indexes from the succeededIndexes.
Especially, if there are many of them (approaching the maximum size of them, 64KiB),
the consumed resources might be non-negligible.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No.
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
The job controller will declare that the job is “Succeeded” based on the .status.completedIndexes.
So, in this case, the job controller can not correctly evaluate the successPolicy
because the .status.completedIndexes won’t be updated.
What are other known failure modes?
None.
What steps should be taken if SLOs are not being met to determine the problem?
If a successPolicy isn’t respected even though the job doesn’t match other policies such as a pod failure policy and backoffLimit:
- Check reachability between Kubernetes components.
- Consider increasing the logging level to trace when the issues occur.
- Check the job controller’s
job_sync_duration_secondsmetric to check if the job controller’s processing duration increases.
If many requests are rejected, re-queued many times or increased the job controller’s processing duration, consider tuning the parameters for APF .
Implementation History
- 2023.06.06: This KEP is ready for review.
- 2023.06.09: API design is updated.
- 2023.10.03: API design is updated.
- 2024.02.07: API is finalized for the alpha stage.
- 2024.03.09: “Criteria” is replaced with “Rules”.
- 2024.06.11: Beta Graduation.
- 2024.07.26: “CompletionsReached” reason is added and new reason labels are added to the “jobs_finished_total” metric.
- 2025.02.07: GA Graduation.
Drawbacks
Adds more complexity to the criteria to be terminated Job.
Alternatives
Relax a validation for the “completions” of the indexed job
Currently, the indexed job is restricted .spec.completion!=nil.
By relaxing the validation, the indexed job can be declared as succeeded when some indexes succeeded,
similar to NonIndexed jobs.
Alternative API Name, “Criteria”
The criteria would be matched to express the behavior of the successPolicy compared with rules.
But, we decided to adopt the API name, rules to keep consistency with the existing podFailurePolicy.rules.
// SuccessPolicy describes when a Job can be declared as succeeded based on the success of some indexes.
type SuccessPolicy struct {
// Criteria represents the list of alternative criteria for declaring the jobs
// as successful before its completion. Once any of the criteria are met,
// the "SuccessCriteriaMet" condition is added, and the lingering pods are removed.
// The terminal state for such a job has the "Complete" condition.
// Additionally, these criteria are evaluated in order; Once the job meets one of the criteria,
// other criteria are ignored.
//
// +optional
Criteria []SuccessPolicyCriteria
}
// SuccessPolicyCriteria describes a criteria for declaring a Job as succeeded.
// Each criteria must have at least one of the "succeededIndexes" or "succeededCount" specified.
type SuccessPolicyCriteria struct{
...
}
...
// These are valid conditions of a job.
const (
// JobSuccessCriteriaMet means the job has been succeeded.
JobSucceessCriteriaMet JobConditionType = "SuccessCriteriaMet"
...
)
Hold succeededIndexes as []int typed in successPolicy
We can hold the succeededIndexes as []int typed that a job can be declared as succeeded.
// SuccessPolicyRule describes rule for declaring a Job as succeeded.
// Each rule must have at least one of the "succeededIndexes" or "succeededCount" specified.
type SuccessPolicyRule struct {
// Specifies a set of required indexes.
// The job is declared as succeeded if a set of indexes are succeeded.
// The list of indexes must come within 0 to `.spec.completions` and
// must not contain duplicates. At least one element is required.
//
// +optional
SucceededIndexes []int
...
}
However, if we allow users to set all succeededIndexes (0-10^5) to .spec.successPolicy.creteria.succeededIndexes
and don’t limit the number of list sizes, there are cases that the size of a job object is too big.
In the worst case, allowed all indexes (0-10^5) are added to succeededIndexes, and a succeededIndexes size
is SUM[9*10^n*(n+1)]+2+6≈5.6656MiB, where:
nstarts from0and goes up to5.1ofn+1means a separator that indicates,.2is the sum of the index0and,.6is the size of indexes10^5.
So, if we select this alternative API design, we need to limit the size of succeededIndexes.
Acceptable percentage of total succeeded indexes in the succeededCount field
We can accept a percentage of total succeeded indexes in the succeededCount field for the job with autoscaling semantics.
However, there is no effective use case for typical stories using elastic horovod or PyTorch elastic training,
as all pods must be completed.
// SuccessPolicyRule describes rule for declaring a Job as succeeded.
// Each rule must have at least one of the "succeededIndexes" or "succeededCount" specified.
type SuccessPolicyRule struct {
...
// Specifies the required number of indexes when .spec.completionMode =
// "Indexed".
// Value can be an absolute number (ex: 5) or an absolute percentage of total indexes
// when the job can be declared as succeeded (ex: 50%).
// The absolute number is calculated from the percentage by rounding up.
//
// +optional
SucceededCount *intstr.IntOrString
...
}
Match succeededIndexes using CEL
We can accept a set of required indexes represented by CEL in the succeededIndexes field.
However, it is difficult to represent the set of indexes without regularity.
// SuccessPolicyRule describes rule for declaring a Job as succeeded.
// Each rule must have at least one of the "succeededIndexes" or "succeededCount" specified.
type SuccessPolicyRule struct {
...
// Specifies a set of required indexes using CEL.
// For example, if the completed indexes are only the last index, they are
// represented as (job.completions -1).
//
// +optional
SucceededIndexesMatchExpression *string
...
}
Use JobSet instead of Indexed Job
The JobSet is a custom resource for managing a group of Job as a unit.
Some of the stories are better served using JobSet. Specifically, cases that make assumptions about what an index represents could be mapped as jobs in JobSet, with names representing the semantics of those different groups of pods.
However, both Job level and JobSet level successPolicies would be valuable since there are some cases in which we want to launch a Job by a single podTemplate.
Possibility for the lingering pods to continue running after the job meets the successPolicy
There are cases where a batch workload can be declared as succeeded, and continue the lingering pods if a number of pods succeed.
So, it is possible to introduce a new field, whenCriteriaAchieved to make configurable the action for the lingering pods.
Additional Story
As a simulation researcher/engineer for fluid dynamics/biochemistry.
I want to mark the job as successful and continue the lingering pods if some indexes succeed
because I set the minimum required value for sampling in the .successPolicy.rules.succeededCount and
perform the same simulation in all indexes.
apiVersion: batch/v1
kind: Job
spec:
parallelism: 10
completions: 10
completionMode: Indexed
successPolicy:
rules:
- succeededCount: 5
succeededIndexes: "1-9"
whenCriteriaAchieved: continue
template:
spec:
restartPolicy: Never
containers:
- name: job-container
image: job-image
command: ["./sample"]
Job API
// SuccessPolicyRules describes a Job can be succeeded based on succeeded indexes.
type SuccessPolicyRule struct {
...
// Specifies the action to be taken on the not finished (successCriteriaMet or failed) pods
// when the job achieved the requirements.
// Possible values are:
// - Continue indicates that all pods wouldn't be terminated.
// When the lingering pods failed, the pods would ignore the terminating policies (backoffLimit,
// backoffLimitPerIndex, and podFailurePolicy, etc.) and the pods aren't re-created.
// - ContinueWithRecreations indicates that all pods wouldn't be terminated.
// When the lingering pods failed, the pods would follow the terminating policies (backoffLimit,
// backoffLimitPerIndex, and podFailurePolicy, etc.) and the pods are re-created.
// - Terminate indicates that not finished pods would be terminated.
//
// Default value is Terminate.
WhenCriteriaAchieved WhenCriteriaAchievedSuccessPolicy
}
// WhenCriteriaAchievedSuccessPolicy specifies the action to be taken on the pods
// when the job achieved the requirements.
// +enum
type WhenCriteriaAchievedSuccessPolicy string
const (
// All pods wouldn't be terminated when the job reached successPolicy.
// When the lingering pods failed, the pods would ignore the terminating policies (backoffLimit,
// backoffLimitPerIndex, and podFailurePolicy, etc.) and the pods aren't re-created.
ContinueWhenCriteriaAchievedSuccessPolicy WhenCriteriaAchievedSuccessPolicy = "Continue"
// All pods wouldn't be terminated when the job reached successPolicy.
// When the lingering pods failed, the pods would follow the terminating policies (backoffLimit,
// backoffLimitPerIndex, and podFailurePolicy, etc.) and the pods are re-created.
ContinueWithRecreationsWhenCriteriaAchievedSuccessPolicy WhenCriteriaAchievedSuccessPolicy = "ContinueWithRecreations"
// Not finished pods would be terminated when the job reached successPolicy.
TerminateWhenCriteriaAchievedSuccessPolicy WhenCriteriaAchievedSuccessPolicy = "Terminate"
)
Evaluation
We need to have more discussions if we support the continue and continueWithRecreatios in the whenCriteriaAchieved.
We have main discussion points here:
- After the job meets any successPolicy with
whenCriteriaAchieved=continueand the job getsSuccessCriteriaMetcondition, what we would expect to happen, when the lingering pods are failed. We may be able to select one of the actions ina: Failed pods follow terminating policies like backoffLimit and podFailurePolicyorb: Failed pods are terminated immediately, and the terminating policies are ignored. Moreover, as an alternative, we may be able to select the actionbfor thewhenCriteriaAchieved=continue, and then we may be possible to introduce a newwhenCriteriaAchieved,continueWithRecreations, for the actiona.
terminate: The current supported behavior. All pods would be terminated by the job controller.continue: This behavior isn’t supported in the alpha stage. The lingering pods wouldn’t be terminated when the job reached successPolicy. Additionally, when the lingering pods failed, the pods are re-created followed terminating policies.continueWithRecreations: This behavior isn’t supported in the alpha stage. The lingering pods wouldn’t be terminated when the job reached successPolicy. Additionally, when the lingering pods failed, the pods are re-created followed terminating policies.
Transition of “status.conditions”
When the job with whenCriteriaAchieved=continue is submitted, the job status.conditions transits in the following:
Note that the Job doesn’t have an actual Running condition in the status.conditions.
stateDiagram-v2
[*] --> Running
Running --> Failed: Exceeded backoffLimit
Running --> FailureTarget: Matched FailurePolicy with action=FailJob
FailureTarget --> Failed: All pods are terminated
Failed --> [*]
Running --> SuccessCriteriaMet: Matched SuccessPolicy
Running --> SuccessCriteriaMet: Matched SuccessPolicy
SuccessCriteriaMet --> SuccessCriteriaMet: Wait for all pods are finalized
SuccessCriteriaMet --> Complete: All pods are finished
Complete --> [*]Possibility for introducing a new CronJob concurrentPolicy, “ForbidUntilJobSuccessful”
It is potentially possible to add a new CronJob concurrentPolicy, ForbidUntilJobSuccessful,
which the CronJob with ForbidUntilJobSuccessful creates Jobs based on Job’s SuccessCriteriaMet condition.
type ConcurrentPolicy string
const (
...
// ForbidUntilJobSuccessful means that the CronJob creates Jobs based on Job's SuccessCriteriaMet condition.
ForbidUntilJobSuccessful ConcurrentPolicy = "ForbidUntilJobSuccessful"
)
Possibility for the configurable reason for the “SuccessCriteriaMet” condition
It is possible to add a configurable reason for the “SuccessCriteriaMet” condition. The machine-readable reason would be useful when the external programs like custom controllers implements the mechanism so that the CustomJob API can change the actions based on the reason similar to the PodFailrePolicyReason (KEP-4443) .
Additional Story
As a developer of CustomJob API built with Job API like JobSet, I want to implement the reconcile logic so that the controller can change the actions based on the reason why the job has been succeeded.
So, it should be included in the reason field instead of message field since the reason field should be machine-readable.
apiVersion: batch/v1
kind: Job
spec:
parallelism: 10
completions: 10
completionMode: Indexed
successPolicy:
rules:
- succeededIndexes: "0"
setSuccessCriteriaMetReason: "LeaderSucceeded"
- succeededIndexes: "1-9"
setSuccessCriteriaMetReason: "WorkersSucceeded"
template:
spec:
restartPolicy: Never
containers:
- name: job-container
image: job-image
command: ["./sample"]
status:
conditions:
- type: "SuccessCriteriaMet"
status: "True"
reason: "LeaderSucceeded"
Job API
Selecting one of the following API designs is possible, but the first, setSuccessCriteriaMetReason was preferred
during the JobSuccessPolicy alpha stage discussions. Because the second, SuccessCriteriaMetReasonSuffix would decrease the machine-readability
since we need to parse the reason by the separation, As.
Furthermore, allowing the reason to have merging field responsibility wouldn’t better and
decreasing the machine-readability would decrease the valuable that we have this reason in the status.reason field instead of status.message field.
Set the entire reason
// SuccessPolicyRule describes rule for declaring a Job as succeeded.
// Each rule must have at least one of the "succeededIndexes" or "succeededCount" specified.
type SuccessPolicyRule struct {
// SetSuccessCriteriaMetReason specifies the CamelCase reason for the "SuccessCriteriaMet" condition.
// Once the job meets this rule, the specified reason is set to the "status.reason".
// The default value is "JobSuccessPolicy".
//
// +optional
SetSuccessCriteriaMetReason *string
}
Set the suffix of the reason
// SuccessPolicyRule describes a rule for declaring a Job as succeeded.
// Each rule must have at least one of the "succeededIndexes" or "succeededCount" specified.
type SuccessPolicyRule struct {
// SuccessCriteriaMetReasonSuffix specifies the CamelCase suffix of the reason for the "SuccessCriteriaMet" condition.
// Once the job meets these rule, "JobSuccessPolicy" and the specified suffix is combined with "As".
// For example, if specified suffix is "LeaderSucceeded", it is represented as "JobSuccessPolicyAsLeaderSucceeded".
//
// +optional
SuccessCriteriaMetReasonSuffix *string
}