KEP-3850: Backoff Limits Per Index For Indexed Jobs
KEP-3850: Backoff Limits Per Index For Indexed Jobs
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- backoffLimitPerIndex inside new runPolicy
- Mark Job Complete if some indexes failed
- Support backoffLimitPerIndex when restartPolicy=OnFailure
- Mutually exclusive backoffLimit and backoffLimitPerIndex
- Use bool field
- Use enum field
- Global exponential backoff delay
- Exponential backoff delay with in-memory tracking
- Alternative ways to support high number of completions
- Skip uncountedTerminatedPods when backoffLimitPerIndex is used
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
This KEP extends the Job API to support indexed jobs where the backoff limit is per index, and the Job can continue execution despite some of its indexes failing.
Motivation
Currently, the indexes of an indexed job share a single backoff limit. When the job reaches this shared backoff limit, the job controller marks the entire job as failed, and the resources are cleaned up, including indexes that have yet to run to completion.
As a result, the current implementation does not cover the situation where the workload is truly embarrassingly parallel and each index is independent of other indexes.
For instance, if indexed jobs were used as the basis for a suite of long-running integration tests, then each test run would only be able to find a single test failure.
Other popular batch services like AWS Batch use a separate backoff limit for each index, showing that this is a common use case that should be supported by Kubernetes.
Goals
- allow to count failures towards the backoffLimit independently for all indexes,
- allow to continue Job execution despite some of its indexes failing,
- allow to fail an index (stop recreating pods for the index) using pod failure policy.
Non-Goals
- allow to control the number of retries per index when pod’s
restartPolicy=OnFailure(see Support backoffLimitPerIndex when restartPolicy=OnFailure ).
Proposal
We propose a new policy for running Indexed Jobs in which the backoff limit controls the number of retries per index. When the new policy is used all indexes execute until their success or failure. We also propose a new API field to control the number of failed indexes.
Additionally, we propose a new action in PodFailurePolicy , called FailIndex, to short-circuit failing of the index before the backoff limit per index is reached.
User Stories (Optional)
Story 1
As a CI/CD platform administrator, I want to use Indexed Jobs to run suites of integration tests, one suite per index. A failure of one suite should not interrupt running of other suites. Additionally, I would like to be able to control the maximal number of retries per index.
The following Job configuration could satisfy my use case:
apiVersion: v1
kind: Job
spec:
parallelism: 10
completions: 10
completionMode: Indexed
backoffLimitPerIndex: 1
template:
spec:
restartPolicy: Never
containers:
- name: job-container
image: job-image
command: ["./tests-runner"]
In this case, we run 10 indexes representing the test suites. We allow for one failure per index.
Story 2
As a CI/CD platform administrator from the Story 1 I want to be able to control the failures with the pod failure policy. In particular, I want to be able to use pod failure policy to avoid restarts of some indexes, based on exit codes.
The following Job configuration could satisfy my use case:
apiVersion: v1
kind: Job
spec:
parallelism: 10
completions: 10
completionMode: Indexed
backoffLimitPerIndex: 1
template:
spec:
restartPolicy: Never
containers:
- name: job-container
image: job-image
command: ["./tests-runner"]
podFailurePolicy:
rules:
- action: FailIndex
onExitCodes:
operator: In
values: [42]
Story 3
As a CI/CD platform administrator from the Story 1 I want to be able to fail the entire Job if the number of failed indexes exceeds 50%. I want to do this in order to cut down costs of running the tests in case of compilation issues that would result in all tests failing.
The following Job configuration could satisfy my use case:
apiVersion: v1
kind: Job
spec:
parallelism: 10
completions: 10
completionMode: Indexed
backoffLimitPerIndex: 1
maxFailedIndexes: 5
template:
spec:
restartPolicy: Never
containers:
- name: job-container
image: job-image
command: ["./tests-runner"]
Notes/Constraints/Caveats (Optional)
Performance benchmark
We assess the performance of the Beta implementation in comparison to
the index jobs with regular backoffLimit using the two integration tests
(BenchmarkLargeIndexedJob and BenchmarkLargeFailureHandling)
in the PR #121393
.
In the BenchmarkLargeIndexedJob test, the measured part creates N pods
and marks them as Succeeded, awaiting for the Job status to be updated accordingly.
This is a sanity test for the backoffLimitPerIndex, to demonstrate that the
new branches of code don’t have significant performance impact.
Here are the results (lines re-ordered from smallest to the largest N):
go test -benchmem -run="^$" -timeout=80m -bench "^BenchmarkLargeIndexedJob" k8s.io/kubernetes/test/integration/job | grep "^Benchmark"
BenchmarkLargeIndexedJob/regular_indexed_job_without_failures;_size=10-48 1 3034342185 ns/op 14391160 B/op 164352 allocs/op
BenchmarkLargeIndexedJob/regular_indexed_job_without_failures;_size=100-48 1 3050613253 ns/op 111100464 B/op 1324757 allocs/op
BenchmarkLargeIndexedJob/regular_indexed_job_without_failures;_size=1000-48 1 19382609963 ns/op 1133953568 B/op 13079710 allocs/op
BenchmarkLargeIndexedJob/regular_indexed_job_without_failures;_size=10_000-48 1 222696805443 ns/op 11610639800 B/op 131946944 allocs/op
BenchmarkLargeIndexedJob/job_with_backoffLimitPerIndex_without_failures;_size=10-48 1 3025650312 ns/op 14757368 B/op 166282 allocs/op
BenchmarkLargeIndexedJob/job_with_backoffLimitPerIndex_without_failures;_size=100-48 1 3045479158 ns/op 114324072 B/op 1345524 allocs/op
BenchmarkLargeIndexedJob/job_with_backoffLimitPerIndex_without_failures;_size=1000-48 1 19384632203 ns/op 1161105080 B/op 13216319 allocs/op
BenchmarkLargeIndexedJob/job_with_backoffLimitPerIndex_without_failures;_size=10_000-48 1 223635439324 ns/op 11911685592 B/op 133325939 allocs/op
In the BenchmarkLargeFailureHandling test, the measured part of the test marks
N running pods as Failed and awaits for the job status to be updated accordingly.
In order to make the test comparable for regular indexed jobs and with
backoffLimitPerIndex we set the max backoff delay due to pod failures as 10ms.
Here are the results (lines re-ordered from smallest to the largest N):
go test -benchmem -run="^$" -timeout=80m -bench "^BenchmarkLargeFailureHandling" k8s.io/kubernetes/test/integration/job | grep "^Benchmark"
BenchmarkLargeFailureHandling/regular_indexed_job_with_failures;_size=10-48 1 2021272442 ns/op 13813736 B/op 165760 allocs/op
BenchmarkLargeFailureHandling/regular_indexed_job_with_failures;_size=100-48 1 3036166978 ns/op 109866704 B/op 1310651 allocs/op
BenchmarkLargeFailureHandling/regular_indexed_job_with_failures;_size=1000-48 1 21049273834 ns/op 1074301144 B/op 12832549 allocs/op
BenchmarkLargeFailureHandling/regular_indexed_job_with_failures;_size=10_000-48 1 202327947010 ns/op 10926201704 B/op 131423197 allocs/op
BenchmarkLargeFailureHandling/job_with_backoffLimitPerIndex_with_failures;_size=10-48 1 3016501067 ns/op 14676224 B/op 175301 allocs/op
BenchmarkLargeFailureHandling/job_with_backoffLimitPerIndex_with_failures;_size=100-48 1 3038839798 ns/op 112090728 B/op 1323948 allocs/op
BenchmarkLargeFailureHandling/job_with_backoffLimitPerIndex_with_failures;_size=1000-48 1 21057643253 ns/op 1096364096 B/op 13008669 allocs/op
BenchmarkLargeFailureHandling/job_with_backoffLimitPerIndex_with_failures;_size=10_000-48 1 202373728278 ns/op 11185209520 B/op 132578325 allocs/op
The above results show that the jobs using .spec.backoffLimitPerIndex are be
slower for about 1% compared to regular indexed jobs. In practice the difference
is expected to be covered by the exponential backoff delay due to pod failures.
Risks and Mitigations
The Job object too big
With the new field .status.failedIndexes the Job object can be significantly
larger as every failed index is recorded in the field.
Note that, the similar risk is also present for Indexed Jobs, regarding the
already existing .status.completedIndexes field (see
Indexed Jobs can break with high number of parallelism or completions
).
In order to mitigate this risk we first constrain the .spec.maxFailedIndexes
to 10^5, which is the same limit as for .spec.parallelism currently.
Second, we validate if the fields are inside of the scalability limits:
.spec.completions<=10^5,.spec.parallelism<=10^5,spec.maxFailedIndexes<=10^5spec.completionsunlimited (<= max int32 ~2*10^9),.spec.parallelism<=10^4,spec.maxFailedIndexes<=10^4
In (1.), in the worst case scenario, every index is either present
in completedIndexes or failedIndexes, but not in both. Thus the total
sum of both fields is limited by (5+1)*10^5=0.572Mi, where:
- 5 is the maximal number of digits in the indexes,
- 1 is for separation character,
- 10^5 is the total number of listed indexes.
In (2.) the worst case scenario for the completedIndexes field is when every
third index is not in the field, because it corresponds to either a failed or
a hanging indexes, so it is a “gap”. Then, between every gap we have two indexes
listed. Thus, the size of the completedIndexes field is limited
by: (10+1)*2*(10^4+10^4)=0.42Mi, where:
- 10 is the maximal number of digits in the indexes
- 1 is for the separation character
- 2*(10^4+10^4) is the number of indexes explicitly listed in the field - two indexes per gap.
The size of the failedIndexes field is limited by: (10+1)*10^4=0.105Mi, where:
- 10 is the maximal number of digits in the indexes,
- 1 is for the separation character
- 10^4 is the maximal number of indexes present in the field.
Thus, the size of both fields is capped at 0.572Mi for the limits in (1.) and
0.525Mi for the limits in (2.).
For comparison, before the introduction of .status.failedIndexes, the max
size of the .status.completedIndexes was limited by (5+1)*10^5*2/3=0.382Mi in
the (1.) case, and (10+1)*2*10^4=0.21Mi in the (2.) case. This means an increase
of 0.19Mi.
The values of the limits are aligned with the values for the soft limits proposed
as a fix for the for regular indexed jobs
(see here
).
However, in case when backoffLimitPerIndex is used we propose these limits
to be hard.
We believe that the scalability limits should be enough for most of Job use-cases. For workloads requiring larger jobs users should be able to create multiple Jobs, orchestrated by the JobSet .
Exponential backoff delay issue
Currently, a pod is recreated by the Job controller with exponential backoff delay (10s, 20s, 40s …), counted from the last failure time.
One complication is that the last failure time for failed pods may increase with
time, as it fallbacks to now in some cases
(see in code
).
Thus, there is a risk that due to the presence
of pods hitting the fallback the last failure time is continuously bumped,
thus shifting the time to recreate the pod.
This risk is present both when computing the exponential backoff delay globally (as for regular indexed Jobs), or per-index as proposed in in this KEP (see Exponential backoff delay per index ).
In order to mitigate this risk currently the time of last failure is recorded
in-memory (globally for all pods within a Job). And a new failed pod may bump
it only until it is added to the uncountedTerminatedPods structure.
However, tracking the last failure time per index might be costly for memory consumption (see Exponential backoff delay with in-memory tracking ).
Thus, in order to mitigate this risk we propose to compute the finish time for
a pod as the first available value of the following (avoiding the ever-increasing
fallback to now):
- max
finishAtof all containers, if specified for all containers LastTransitionTimefor theReady=FalseconditiondeletionTimestamp-deletionGracePeriodSecondsifdeletionTimestampis set
Here (3.) is used to mark the moment of deletion which is used to approximate
the current behavior. (2.) is used when Kubelet loses track of one of its containers,
the Ready=False condition is set by Kubelet when transitioning a pod to Failed
phase: https://github.com/kubernetes/kubernetes/blob/release-1.27/pkg/kubelet/status/status_manager.go#L1060-L1068
.
When none of the above conditions is satisfied to compute the finish time we
fallback to the pod’s creation time.
This fix can be considered a preparatory PR before the KEP, as to some extent is solves the preexisting issue.
Too fast Job status updates
In this KEP the Job controller needs to keep updating the new status field
.status.failedIndexes to reflect the current status of the Job. This can raise
concerns of overwhelming the API server with status updates.
First, observe that the new field does not entail additional Job status updates.
When a pod terminates (either failure or success), it triggers Job status update
to increment the status.failed or .status.succeeded counter fields. These
updates are also used to update the pre-existing status.completedIndexes
field, and the new status.failedIndexes field.
Second, in order to mitigate this risk there is already a mechanism present in the Job controller, to bulk Job status updates per Job.
The way the mechanism works is that Job controller maintains a queue of syncJob
invocations per job
(see in code
).
New items are added to the queue with a delay (1s for pod events, such as:
delete, add, update). The delay allows for deduplication of the sync per Job.
One place to queue a new item in the queue, specific to this KEP, is when the exponential backoff delay hasn’t elapsed for any index (allowing pod recreation), then we requeue the next Job status update. The delay is computed as minimum of all delays computed for all indexes requiring pod recreation, but not less that 1s.
Design Details
We introduce a new Job API field, called .spec.backoffLimitPerIndex.
When set it limits the number of retries, counted independently for all indexes.
Additionally, we propose the .spec.maxFailedIndexes to control
the maximal number of failed indexes. Once the number is exceeded the entire
Job is marked Failed and its execution is terminated.
We also propose to extend the PodFailurePolicy with a new action, called
FailIndex to allow an index to fail fast before reaching the backoff limit
per index.
Job API
// PodFailurePolicyAction specifies how a Pod failure is handled.
// +enum
type PodFailurePolicyAction string
const (
// This is an action which might be taken on a pod failure - mark the
// Job's index as failed to avoid restarts within this index. This action
// can only be used when backoffLimitPerIndex is set.
PodFailurePolicyActionFailIndex PodFailurePolicyAction = "FailIndex"
...
)
...
// JobSpec describes how the job execution will look like.
type JobSpec struct {
...
// Specifies the limit for the number of retries within an
// index before marking this index as failed. When enabled the number of
// failures per index is kept in the pod's
// batch.kubernetes.io/job-index-failure-count annotation. It can only
// be set when Job's completionMode=Indexed, and the Pod's restart
// policy is Never. The field is immutable.
// +optional
BackoffLimitPerIndex *int32
// Specifies the maximal number of failed indexes before marking the Job as
// failed, when backoffLimitPerIndex is set. Once the number of failed
// indexes exceeds this number the entire Job is marked as Failed and its
// execution is terminated. When left as null the job continues execution of
// all of its indexes and is marked with the `Complete` Job condition.
// It can only be specified when backoffLimitPerIndex is set.
// It can be null or up to completions. It is required and must be
// less than or equal to 10^4 when is completions greater than 10^5.
// +optional
MaxFailedIndexes *int32
...
}
type JobStatus struct {
...
// FailedIndexes holds the failed indexes when backoffLimitPerIndex is set.
// The indexes are represented in the text format analogous as for the
// `completedIndexes` field, ie. they are kept as decimal integers
// separated by commas. The numbers are listed in increasing order. Three or
// more consecutive numbers are compressed and represented by the first and
// last element of the series, separated by a hyphen.
// For example, if the failed indexes are 1, 3, 4, 5 and 7, they are
// represented as "1,3-5,7".
// +optional
FailedIndexes *string
}
Note that, the PodFailurePolicyAction type is already defined in master with
three possible enum values: Ignore, FailJob and Count (see here
).
We allow to specify custom .spec.backoffLimit and .spec.backoffLimitPerIndex.
This allows for a controlled downgrade. Also, when .spec.backoffLimitPerIndex
is specified, then we default .spec.backoffLimit to max int32 value. This way
we ensure old clients of the API wouldn’t break when reading or trying to modify
the .spec.backoffLimit that has nil value.
Tracking the number of failures per index
In order to determine if the backoff limit per index is exceeded we keep
track of the number of failures per index. For this purpose we use the Pod
annotation, batch.kubernetes.io/job-index-failure-count, which holds the value
of the number of pod failures for a given index. It is set to 0 for the first
pod created for a given index.
When Job controller sees a failed pod corresponding to a given index, and the
value of the annotation batch.kubernetes.io/job-index-failure-count is greater
or equal to the configured backoff limit per index then the index is marked
as failed and added to .status.failedIndexes.
When Job controller creates replacement pods for failed pods for a given
index it checks if the index isn’t finished yet (it is not in
.status.failedIndexes nor .status.completedIndexes).
Then, if x is the highest batch.kubernetes.io/job-index-failure-count
for the index, the newly created pod will have the annotation set to x+1.
An exception is when the newly failed pod matches the Ignore action in pod
failure policy. In this case the replacement pod does not increment the
value in the annotation.
In order to keep track of the number of failures per index, the Job controller
removes finalizers of a failed pod for a given index, only once the replacement
pod (with incremented value of batch.kubernetes.io/job-index-failure-count) is
created, or the index is marked as failed in .status.failedIndexes. This means
that these are the main steps when handling a failed pod to prepare it for
deletion:
- Pod is recognized as failed
- pod UID is recorded in Job status (
.status.uncountedTerminatedPods) - the replacement Pod is created
- Pod’s finalizer is removed
Here, the new feature adds a dependency between steps (3.) and (4.) as previously these steps could be performed in any order. Note that, typically when a pod is deleted or fails the replacement pod is created with a backoff delay, starting from 10s. This means, that after the proposed change the pod finalizer removal will be paused for at least 10s, until the backoff elapses and the replacement pod is created. While this may result in pods hanging around before garbage collection, it does not affect directly the rate of pod recreation.
Note that, the first step (1.) will also be impacted by KEP-3939: Consider Terminating pods as active pods in Jobs.
Failed indexes format
The format of the .status.failedIndexes field is analogous to the one used for
successful indexes represented by the completedIndexes field
), which is a
text format grouping consecutive integers into ranges. In a special case, when
the indexes are non-consecutive they are represented by comma-separated numbers.
In the worst-case scenario this is a string of comma-separated even values. In
order to constrain the size of the field we cap the number of completions
(see The Job object too big
for more details).
Job completion
When backoff limit per index is used, then we execute indexes until all of them
are completed (either failed or succeeded), or the number of failed indexes
exceeds the specified .spec.maxFailedIndexes.
Then, the Job is marked as completed (the Complete Job condition type) when
all indexes are succeeded. The Job is marked as failed (the Failed Job condition)
when at least one index is failed. The Failed condition is added once
all indexes completed their execution (either failed or succeeded), or when
the number of failed indexes exceeds the specified .spec.maxFailedIndexes.
FailIndex action
In order to allow early termination of indexes with the FailIndex action
we add the corresponding index to the set of failed indexes represented by
.status.failedIndexes. This action can only be used if backoff limit per index
is used.
Exponential backoff delay per index
First, we solve the issue of increasing failure time for deleted pods when the
finalizer removal is delayed, by modifying the definition of the pod finish time,
to avoid fallback to now
(see also Exponential backoff delay issue
).
Second, we compute the backoff delay within each index independently. The number
of consecutive failures per-index can be derived from the
batch.kubernetes.io/job-index-failure-count annotation of the last failed pod,
plus one. This is because any successful pod marks the index as successful and
stops retries. Note that, using the annotation value means that failed pods
matching the Ignore rule are skipped in the calculation, but this behavior is
consistent with handling ignored pod failures for regular backoff limit.
Test Plan
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
Unit tests
Unit tests will be added along with any new code introduced. In particular, the following scenarios will be covered with unit tests:
- handling or ignoring of
.spec.backoffLimitPerIndexby the Job controller when the feature gate is enabled or disabled, respectively, - handling of ignoring of the pod failure policy rule with
FailIndexaction - the
JobBackoffLimitPerIndexfeature gate is enabled or disabled, respectively, - validation of a job configuration with respect to
.spec.backoffLimitPerIndexby kube-apiserver (including limits for.spec.maxFailedIndexes,.spec.parallelismand.spec.completions), when the feature gate is enabled or disabled, - marking of the Job as
Completeonly once all indexes are completed, - termination of Job execution and marking it as failed when
.spec.maxFailedIndexesis exceeded. - calculation of the exponential backoff delay per index when
backoffLimitPerIndexis used. - a fuzzer roundtrip test for API when
backoffLimitis set to max int32.
The core packages (with their unit test coverage) which are going to be modified during the implementation:
k8s.io/kubernetes/pkg/controller/job:27 Apr 2023-90.4%k8s.io/kubernetes/pkg/apis/batch/validation:27 Apr 2023-98.5%
Integration tests
The following scenarios will be covered with integration tests:
- enabling, disabling and re-enabling of the
JobBackoffLimitPerIndexfeature gate (code ) - handling of the
.spec.backoffLimitPerIndexwhen theFailIndexaction is used (code ), - handling of the
.spec.backoffLimitPerIndexwhen.spec.maxFailedIndexesisn’t set (code ), - handling of the
.spec.backoffLimitPerIndexwhen.spec.maxFailedIndexesis set (code ), - handling of the
.spec.backoffLimitwhen.spec.backoffLimitPerIndexis set (code ), - handling of the exponential backoff delay per index when
.spec.backoffLimitPerIndexis set (code ).
The [k8s-triage] page for the BackoffLimitPerIndex integration tests .
More integration tests might be added to ensure good code coverage based on the actual implementation.
e2e tests
The following scenario is covered with e2e tests for Beta:
The [k8s-triage] page for the BackoffLimitPerIndex e2e tests .
Graduation Criteria
Alpha
- the feature implemented behind the
JobBackoffLimitPerIndexfeature flag - change the logic of computing the exponential backoff delay (see here )
- user-facing documentation, including the warning for setting completions > 10^5
- The
JobBackoffLimitPerIndexfeature flag disabled by default - Tests: unit and integration
Beta
- Address reviews and bug reports from Alpha users
- Implement the
job_finished_indexes_totalmetric - E2e tests are in Testgrid and linked in KEP
- Move the new reason declarations from Job controller to the API package
- Evaluate performance of Job controller for jobs using backoff limit per index with benchmarks at the integration or e2e level (discussion pointers from Alpha review: thread1 and thread2 )
- The feature flag enabled by default
GA
- Address reviews and bug reports from Beta users
- Write a blog post about the feature
- Revisit extending the hands-on guide for Pod failure policy
to use
FailIndex - Graduate e2e tests as conformance tests
- Lock the
JobBackoffLimitPerIndexfeature gate
Upgrade / Downgrade Strategy
Upgrade
An upgrade to a version which supports this feature should not require any
additional configuration changes. In order to use this feature after an upgrade
users will need to configure their Jobs by specifying
.spec.backoffLimitPerIndex.
There is no difference in behavior of Jobs if .spec.backoffLimitPerIndex is
not set.
Downgrade
A downgrade to a version which does not support this feature should not require
any additional configuration changes. Jobs which specified
.spec.backoffLimitPerIndex (to make use of this feature) will be
handled in a default way, ie. using the .spec.backoffLimit.
However, since the .spec.backoffLimit defaults to max int32 value
(see here
) is might require a manual setting of the .spec.backoffLimit
to ensure failed pods are not retried indefinitely.
Version Skew Strategy
This feature is limited to control plane.
Note that, kube-apiserver can be in the N+1 skew version relative to the kube-controller-manager (see here ). In that case, the Job controller operates on the version of the Job object that already supports the new Job API.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name: JobBackoffLimitPerIndex
- Components depending on the feature gate: kube-apiserver, kube-controller-manager
- Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning
of a node? (Do not assume
Dynamic Kubelet Configfeature is enabled).
Does enabling the feature change any default behavior?
No.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
Yes. Using the feature gate is the recommended way. When the feature is disabled
the Job controller manager handles pod failures in the default way, even if
.spec.backoffLimitPerIndex is set.
What happens if we reenable the feature if it was previously rolled back?
The Job controller starts to handle pod failures according to the specified
.spec.backoffLimitPerIndex or .spec.maxFailedIndexes fields.
Are there any tests for feature enablement/disablement?
Yes, there is an integration test which tests the following path: enablement -> disablement -> re-enablement.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
This change does not impact how the rollout or rollback fail.
The change is opt-in, thus a rollout doesn’t impact already running pods.
The rollback might affect how pod failures are handled, since they will
be counted only against .spec.backoffLimit, which is defaulted to max int32
value, when using .spec.backoffLimitPerIndex (see here
).
Thus, similarly as in case of a downgrade (see here
)
it might be required to manually set spec.backoffLimit to ensure failed pods
are not retried indefinitely.
What specific metrics should inform a rollback?
A substantial increase in the job_sync_duration_seconds.
Also, a substantial increase in the total number of pods, as it may take additional time to get the finalizers removed.
Additionally, a substantial increase in the difference of
terminated_pods_tracking_finalizer_total for the add and delete labels may
indicate that it takes too long to delete the finalizers.
The feature is opt-in so in case of issues it is enough not to use the backoffLimitPerIndex API field.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
The Upgrade->downgrade->upgrade testing was done manually using the alpha
version in 1.28 with the following steps:
- Start the cluster with the
JobBackoffLimitPerIndexenabled:
kind create cluster --name per-index --image kindest/node:v1.28.0 --config config.yaml
using config.yaml:
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
"JobBackoffLimitPerIndex": true
nodes:
- role: control-plane
- role: worker
Then, create the job using .spec.backoffLimitPerIndex=1:
kubectl create -f job.yaml
using job.yaml:
apiVersion: batch/v1
kind: Job
metadata:
name: job-longrun
spec:
parallelism: 3
completions: 3
completionMode: Indexed
backoffLimitPerIndex: 1
template:
spec:
restartPolicy: Never
containers:
- name: sleep
image: busybox:1.36.1
command: ["sleep"]
args: ["1800"] # 30min
imagePullPolicy: IfNotPresent
Await for the pods to be running and delete 0-indexed pod:
kubectl delete pods -l job-name=job-longrun -l batch.kubernetes.io/job-completion-index=0 --grace-period=1
Await for the replacement pod to be created and repeat the deletion.
Check job status and confirm .status.failedIndexes="0"
kubectl get jobs -ljob-name=job-longrun -oyaml
Also, notice that .status.active=2, because the pod for a failed index is not
re-created.
- Simulate downgrade by disabling the feature for api server and control-plane.
Then, verify that 3 pods are running again, and the .status.failedIndexes is
gone by:
kubectl get jobs -ljob-name=job-longrun -oyaml
this will produce output similar to:
...
status:
active: 3
failed: 2
ready: 2
- Simulate upgrade by re-enabling the feature for api server and control-plane.
Then, delete 1-indexed pod:
kubectl delete pods -l job-name=job-longrun -l batch.kubernetes.io/job-completion-index=1 --grace-period=1
Await for the replacement pod to be created and repeat the deletion.
Check job status and confirm .status.failedIndexes="1"
kubectl get jobs -ljob-name=job-longrun -oyaml
Also, notice that .status.active=2, because the pod for a failed index is not
re-created.
This demonstrates that the feature is working again for the job.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
By the presence of the .spec.backoffLimitPerIndex field in the jobs.
For Beta we are also considering to introduce job_finished_indexes_total
metric
(see also here
).
How can someone using this feature know that it is working for their instance?
- Job API .status
- field:
failedIndexeswill not be empty as indexes fail
- field:
- Pod API
- annotation:
batch.kubernetes.io/job-index-failure-countis present for pods created by Jobs with this feature enabled
- annotation:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
This feature does not propose SLOs.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
job_sync_duration_seconds(existing): can be used to see how much the feature enablement increases the time spent in the sync jobjob_finished_indexes_total(new): can be used to determine if the indexes are marked failed,
- Components exposing the metric: kube-controller-manager
- Metric name:
Are there any missing metrics that would be useful to have to improve observability of this feature?
For Beta we will introduce a new metric job_finished_indexes_total
with labels status=(failed|succeeded), and backoffLimit=(perIndex|global).
It will count the number of failed and succeeded indexes across jobs using
backoffLimitPerIndex, or regular Indexed Jobs (using only .spec.backoffLimit).
It might be useful to determine the global ratio of failed vs. succeeded indexes
when backoffLimitPerIndex is used.
Dependencies
Does this feature depend on any specific services running in the cluster?
No.
Scalability
Will enabling / using this feature result in any new API calls?
No.
Will enabling / using this feature result in introducing new API types?
No.
Will enabling / using this feature result in any new calls to the cloud provider?
No.
Will enabling / using this feature result in increasing size or count of the existing API objects?
Yes, but only when the .spec.backoffLimitPerIndex field is set.
API type(s): Job
Estimated increase in size:
- New
.status.failedIndexesfield in Status and.status.completedIndexespre-existing field are impacted. When the scalability limits are respected, then the maximal increase of the total size of both fields can be estimated as190Ki(see The Job object too big for more details), - New
.spec.backoffLimitPerIndexfield of*int32is 12 bytes.
- New
API type(s): Pod
Estimated increase in size: the new annotation
batch.kubernetes.io/job-index-failure-countto keep the current number of retries per index. Is around 50 bytes.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
We don’t expect this increase to be captured by existing SLO/SLIs .
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
The added dependency of removing finalizers only after pod recreation Tracking the number of failures per index may keep pods around longer (around 10s which is the backoff for pod recreation) before actual deletion (requested or by PodGC).
This can increase the RAM consumption, but only for a short period of time. Also, it is only affecting the failing pods.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No. This feature does not introduce any resource exhaustive operations.
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
No change from existing behavior of the Job controller.
What are other known failure modes?
None.
What steps should be taken if SLOs are not being met to determine the problem?
N/A.
Implementation History
- 2023-01-23: Initial version of the KEP PR Backoff Limit Per Job #3774
- 2023-04-26: The KEP PR Backoff limit per Job Index #3967 takes over from #3774
- 2023-05-08: The KEP PR ready for review
- 2023-06-07: The KEP PR merged
- 2023-07-13: The implementation PR Support BackoffLimitPerIndex in Jobs #118009 under review
- 2023-07-18: Merge the API PR Extend the Job API for BackoffLimitPerIndex
- 2023-07-18: Merge the Job Controller PR Support BackoffLimitPerIndex in Jobs
- 2023-08-04: Merge user-facing docs PR Docs update for Job’s backoff limit per index (alpha in 1.28)
- 2023-08-06: Merge KEP update reflecting decisions during the implementation phase Update for KEP3850 “Backoff Limit Per Index”
- 2023-10-02: Update KEP-3850 “Backoff Limit Per Index” for Beta
- 2023-10-20: Introduce the job_finished_indexes_total metric
- 2023-10-23: Graduate BackoffLimitPerIndex to Beta
- 2023-10-24: Indicate Job Backoff Limit Per Index reason consts are beta
- 2023-10-25: Backoff limit per index e2e test
- 2023-11-02: Add remaining e2e tests for Job BackoffLimitPerIndex based on KEP
- 2023-11-02: Benchmark job with backoff limit per index
- 2023-11-02: Update KEP3850 “BackoffLimitPerIndex for Indexed Jobs”
- 2025-02-07: KEP3850: graduate Backoff Limit Per Index for Job to stable
- 2025-02-25: Add Job e2e for tracking failure count per index
- 2025-03-01: Graduate Backoff Limit Per Index as stable
Drawbacks
Alternatives
backoffLimitPerIndex inside new runPolicy
We could nest the new fields (maxFailedIndexes and backoffLimitPerIndex) inside
another field. Proposed alternative names for the field:
runPolicycompletionPolicyfailurePolicy
For example:
apiVersion: v1
kind: Job
spec:
parallelism: 10
completions: 10
completionMode: Indexed
backoffLimit: 4
runPolicy:
backoffLimitPerIndex: true
maxFailedIndexes: 1
...
The option (3.) suggests that the fields are about declaring the Job as failed.
However, the backoffLimitPerIndex field not only allows to count failures
towards the backoff limit per index, but also allows all indexes to execute
despite failures, thus more generic names, like (1.) and (2.) are preferred.
Also the options (1.) and (2.) may be reused in the context of success policy which is subject of Job success/completion policy . It might be beneficial for the API to consider the conditions for the Job success or failure under the same field.
Reasons for deferring / rejecting
It is not clear what is the best name going forward. Also, it seems that the
backoffLimitPerIndex should be next to backoffLimit. It was discussed
and the consensus is that “top-level” is fine
(see here
).
Mark Job Complete if some indexes failed
The alternative to the proposed Job completion strategy.
Allow execution of all indexes, up to .spec.maxFailedIndexes of
failed indexes. Then, mark the Job Complete even if some indexes failed.
The Job is marked Failed only if the number of failed indexes exceeds the
specified .spec.maxFailedIndexes limit, in that case, the reason
field could be FailedIndexes, and the message field would list the failed
indexes up to a couple of them.
Reasons for deferring / rejecting
This approach is less intuitive to the end-users of the API, compared to the proposal. In particular, in some cases it would require custom logic in the user’s controller to determine if the Job is failed.
Support backoffLimitPerIndex when restartPolicy=OnFailure
We’ve considered supporting the backoffLimitPerIndex when pod’s restartPolicy=OnFailure.
Reasons for deferring / rejecting
When restartPolicy=OnFailure it is Kubelet’s responsibility to restart the pod.
On the other hand if the maximal number of restarts would be enforced by the
Job controller, then race conditions are possible. For example, in-between the
checks by the Job controller, Kubelet execute more restarts than the specified
.spec.backoffLimit. The problematic counting of failures in the
restartPolicy=OnFailure has been ticketed
When restartPolicy=OnFailure the calculation for number of retries is not accurate
.
We believe that this feature can be supported well by using the pod-level API, started in this KEP: Add a new field maxRestartTimes to podSpec when running into RestartPolicyOnFailure .
Once the pod-level API is done, it could be considered to support .spec.backoffLimitPerIndex
whenrestartPolicy=OnFailure in pod’s spec. In this case we could set the pod-level
maxRestartTimes field based on the Job-level .spec.backoffLimit, leaving the
responsibility of enforcing the limit to the Kubelet.
We will re-assess the decision of the Pod-level API graduates to GA in the
KEP: Add a new field maxRestartTimes to podSpec when running into RestartPolicyOnFailure
.
For example, when maxRestartTimes is specified for restartPolicy=OnFailure, then
we could support maxFailedIndexes which would allow to control the number of
failed indexes (that exceeded the maxRestartTimes and are marked failed).
Mutually exclusive backoffLimit and backoffLimitPerIndex
We’ve also considered to make the backoffLimit and backoffLimitPerIndex
fields mutually exclusive.
Reasons for deferring / rejecting
There is no way to control downgrade, as the value of backoffLimit would
always default to 6. Also, old API clients may error trying to read or modify
Job objects with backoffLimit=nil.
Use bool field
We’ve considered to use a bool backoffLimitPerIndex field. Here is an example:
apiVersion: v1
kind: Job
spec:
parallelism: 10
completions: 10
completionMode: Indexed
backoffLimit: 1
backoffLimitPerIndex: true
...
Reasons for deferring / rejecting
It does not allow to specify both .spec.backoffLimit and .spec.backoffLimitPerIndex
in the same config. While setting both fields can be confusing in regular use
it can be helpful to support the use case of controlled downgrade.
Use enum field
We’ve considered to use an enum backoffLimitTarget: Job|Index field (another
name for this concept could be backoffLimitGranularity), to specify that the
failures should be tracked per-index. Here, the default would be Job. Here is
an example:
apiVersion: v1
kind: Job
spec:
parallelism: 10
completions: 10
completionMode: Indexed
backoffLimit: 1
backoffLimitTarget: Index
...
Reasons for deferring / rejecting
No other targets, than Job and Index, will be added in a foreseeable
future. Thus, it seems like an unnecessary complication. The dedicated name
backoffLimitPerIndex seems to also better reflect the user’s intention.
Similarly as in the bool case field Use bool field
it does
not allow to set both .spec.backoffLimit and .spec.backoffLimitPerIndex
to control the downgrade.
Global exponential backoff delay
We could also consider leaving the exponential backoff delay as global and
be enabled by a dedicated API field in the future KEP, say backoffDelayPerIndex.
Reasons for deferring / rejecting
The idea of using backoffLimitPerIndex is to make the indexes independent.
Thus, failures or successes in one index should not influence backoff delays
for another index. We are leaving the decision to the community feeback and
discussions though.
Exponential backoff delay with in-memory tracking
Instead of modifying the definition of pod’s finish time (see Exponential backoff delay issue ) we could keep track of the “failure time” for failed pods in-memory.
Reasons for deferring / rejecting
As the number of failed indexes is capped at 10^5 keeping track of failure times for all pods will be at least 8B per failed pod, which is around 1Mi per Job in the worst-case scenario. This is a non-negligible memory increase.
The extra tracking information is not needed counting pods as terminated is done in KEP-3939: Consider terminating pods in job controller . In this case we can assume that the failure time of each pod does not change after its phase is terminal.
Alternative ways to support high number of completions
In the current proposal the high number of completions (like 10^6) is supported
by specifying the .spec.maxFailedIndexes field. This way the size
of the failedIndexes field is controlled.
See below for alternative approaches proposed.
Keep failedIndexes field as a bitmap
In order to squeeze more failed indexes we could use bitmap.
Reasons for deferring / rejecting
- it is not human readable which might be useful for manual inspection
- it is harder to parse by user-provided controllers
- it introduces another format to keeping the succeeded indexes in
.status.completedIndexes
Keep the list of failed indexes in a dedicated API object
The idea is to keep the heavy fields outside of the Job API object itself. It could be a new API object, for example JobFailedIndexes.
Reasons for deferring / rejecting
This approach significantly increases the complexity of the Job controller that needs to register and manage another API object. This may also have performance impact as the Job controller needs to query the object. Finally, it is also a complication to the end users who want to fetch the list of failed indexes.
Implicit limit on the number of failed indexes
An alternative is to have an implicit limit on the number of failed indexes, for
example, by controlling the size of the .status.failedIndexes field down to
300KB. This can allow to run a job with completions at the level of 10^6, without
explicit limit for maximal number of failed indexes.
Reasons for deferring / rejecting
It may behave unpredictably, impacting the user experience. For example,
when a user sets maxFailedIndexes as 10^6 the Job may complete if the indexes
and consecutive, but the Job may also fail if the size of the object exceeds the
limits due to non-consecutive indexes failing.
Skip uncountedTerminatedPods when backoffLimitPerIndex is used
It’s been proposed (see link
)
that when backoffLimitPerIndex is used, then we could skip the interim step of
recording terminated pods in .status.uncountedTerminatedPods.
Reasons for deferring / rejecting
First, if we stop using .status.uncountedTerminatedPods it means that
.status.failed can no longer track the number of failed pods. Thus, it would
require a change of semantic to denote just the number of failed indexes. This
has downsides:
- two different semantics of the field, depending on the used feature
- lost information about some failed pods within an index (some users may care to investigate succeeded indexes with at least one failed pod)
Second, it would only optimize the unhappy path, where there are failures. Also, the saving is only 1 request per 500 failed pods, which does not seem essential.