KEP-4443: More granular Job failure reasons for PodFailurePolicy
KEP-4443: More granular Job failure reasons for PodFailurePolicyRule
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
This KEP proposes to extend the Job API by adding an optional Name field to PodFailurePolicyRule. If unset, it would default to the index of
the rule in the podFailurePolicy.rules slice.
The purpose of giving the rule a name is to expose more detailed failure information inside the
JobFailed condition reason. When a pod failure policy rule triggers a Job failure, the rule name would be appended as a suffix to the JobFailed condition reason, in the
format: PodFailurePolicy_{ruleName}. This will allow users to set multiple pod failure policy rules
and distinguish which one (if any) triggered a Job failure.
Motivation
Higher level K8s APIs are built via a composition of features, using primitive K8S APIs as building blocks to implement more advanced features. These higher level APIs using the Job API as a building block need to be able to distinguish between different types of Job failures in order to make informed decisions about how to react to these failures.
Currently, no mechanism exists in the Job API to propagate granular failure reason information (e.g., container exit codes) up to be
programmatically consumed by higher level software managing Jobs. A PodFailurePolicy can be configured to add a condition reason of PodFailurePolicy
to the JobFailed condition added to the Job when it fails, but different pod failure policies targeting different container exit codes all use the
same condition reason of PodFailurePolicy. This prevents higher level APIs like JobSet from distinguishing them and being able to take different
actions depending on the type of Job failure that occurred.
For a concrete use case, see the JobSet Configurable Failury Policy KEP which illuminated the need for more granular pod failure policy reasons.
Goals
For pod failure policies to be able communicate different failure types to higher level APIs.
Non-Goals
- Modifying PodFailurePolicy behavior
- The Job controller using this new field for any purpose not explicitly defined in the proposal.
Proposal
The proposal is to add an optional Name field to the PodFailurePolicyRule. If unset, it will default to the index of the PodFailurePolicyRule in the PodFailurePolicy.Rules slice.
When a PodFailurePolicyRule matches a pod failure and the Action is FailJob, the Job
controller will append the name of the pod failure policy rule which triggered the failure
to the JobFailed condition
reason. The exact format of the JobFailed condition reason will be PodFailurePolicy_{ruleName}.
User Stories (Optional)
Story 1
As a user, I am using a JobSet to manage a group of jobs, and I want to be able to decide whether to fail the JobSet or not, based on the exact container exit code that caused a child job failure.
Example JobSet for this use case:
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
name: fail-jobset-example
spec:
failurePolicy:
rules:
# If Job fails due to a pod failing with exit code 2, fail the JobSet immediately, without attempting any restarts.
- action: FailJobSet
targetReplicatedJobs:
- workers
onJobFailureReasons:
- PodFailurePolicy_ExitCode2 # Job failure reason format: PodFailurePolicy_{ruleName}
maxRestarts: 10
replicatedJobs:
- name: workers
replicas: 10
template:
spec:
parallelism: 1
completions: 1
backoffLimit: 0
# If a pod fails with exit code 2, fail the job with the user-defined reason.
podFailurePolicy:
rules:
- name: "ExitCode2" # Will be added as a suffix to the reason "PodFailurePolicy" condition reason.
action: FailJob
onExitCodes:
containerName: main
operator: In
values: [2]
template:
spec:
restartPolicy: Never
containers:
- name: main
image: python:3.10
command: ["..."]
Story 2
As a user, I am using a JobSet to manage a group of jobs, each running a HPC simulation. Each job runs a simulation with different random initial parameters. When a simulation ends, the application will exit with one of two exit codes:
- Exit code 2, which indicates the simulation produced an invalid result due to bad starting parameters, and should not be retried.
- Exit code 3, which indicates the simulation produced an invalid result but the initial parameters were reasonable, so the simulation should be restarted.
When a Job fails due to a pod failing with exit code 2, I want my job management software to leave the Job in a failed state.
When a Job fails due to a pod failing with exit code 3, I want my job management software to to restart the Job.
Example JobSet for this use case:
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
name: restart-job-example
annotations:
alpha.jobset.sigs.k8s.io/exclusive-topology: {{topologyDomain}} # 1:1 job replica to topology domain assignment
spec:
failurePolicy:
rules:
# If Job fails due to a pod failing with exit code 2, leave it in a failed state.
- action: FailJob
targetReplicatedJobs:
- simulations
onJobFailureReasons:
- PodFailurePolicy_ExitCode2 # Job failure reason format: PodFailurePolicy_{ruleName}
# If Job fails due to a pod failing with exit code 3, restart that Job.
- action: RestartJob
targetReplicatedJobs:
- simulations
onJobFailureReasons:
- PodFailurePolicy_ExitCode3 # Job failure reason format: PodFailurePolicy_{ruleName}
maxRestarts: 10
replicatedJobs:
- name: simulations
replicas: 10
template:
spec:
parallelism: 1
completions: 1
backoffLimit: 0
# Pod failure policy rules, the names of which are referenced in the JobSet failure policy.
podFailurePolicy:
rules:
- name: ExitCode2
action: FailJob
onExitCodes:
containerName: main
operator: In
values: [2]
- name: ExitCode3
action: FailJob
onExitCodes:
containerName: main
operator: In
values: [3]
template:
spec:
restartPolicy: Never
containers:
- name: main
image: python:3.10
command: ["..."]
Notes/Constraints/Caveats (Optional)
It should be noted that upon pod failure, the Job’s pod failure policy rules are evaluated in order, and only the first matching rule is executed, even if multiple rules match a pod failure.
Risks and Mitigations
There is a risk to making a field that was previously exclusively managed by the controller, to now being configurable by the user. However, as described in the validation section below, we are validating against malformed/invalid inputs.
Design Details
Defaulting
There will be no defaulting for the new pod failure policy Name field.
When Name is unset, the Job controller will set the reason suffix to the index of the rule in the podFailurePolicy.rules slice (e.g. PodFailurePolicy_{index}).
Validation
- Validate all pod failure policy rule names are unique.
- Validate pod failure policy rule names do not have integer values of any of the existing indexes (unless the value is the rule’s own index).
- We will validate the Job failure condition reason that would be generated from a given
PodFailurePolicy rule name (i.e.,
PodFailurePolicy_{ruleName}) will be a valid reason (CamelCase, max length of 128 characters, and matches the regex defined here ). - We will also validate the pod failure policy rule does not conflict with any K8s internal reasons used by the Job controller .
Business logic
When a PodFailurePolicyRule matches a pod failure and the Action is FailJob, the Job controller will
set the JobFailed condition reason deterministically in the format PodFailurePolicy_{ruleName}. This
suffix will be added to the condition condition
here.
Note: If the PodFailurePolicy feature gate is disabled, but the PodFailurePolicyName feature gate is enabled, there will be
no adverse effect and neither feature will actually be used, since the only place the proposed new field Name will be
used is inside of a code block
protected by the PodFailurePolicy feature gate.
Test Plan
[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
Unit tests
k8s.io/kubernetes/pkg/controller/job:02/05/2024-91.5%k8s.io/kubernetes/pkg/apis/batch/v1:06/05/2024-87.3%k8s.io/kubernetes/pkg/apis/batch/v1beta1:06/05/2024-78.3%k8s.io/kubernetes/pkg/apis/batch/validation:06/05/2024-87.7%
Integration tests
Test that when the feature flag is enabled and a Job’s PodFailurePolicy triggers a Job failure, due to a matching PodFailurePolicyRule, check that the
JobFailedcondition has a reason ofPodFailurePolicy_{Name}.Test that when the feature flag is off, but when it was previously enabled, there is an existing Job which already had the
JobFailedcondition reason set with the new suffix (i.e.,PodFailurePolicy_{Name}), that the Job controller does not overwrite the reason toPodFailurePolicy, and that it remains set to the existing value.Add test cases for both onPodConditions and onExitCodes to ensure the
Nameor the rule’s index (whenNameis empty) is properly added.
e2e tests
We will a test case similar to the integration test case:
- When the feature flag is enabled and a Job’s PodFailurePolicy triggers a Job failure, due to a
matching PodFailurePolicyRule, check that the
JobFailedcondition has thePodFailurePolicy_{Reason}reason set correctly.
Graduation Criteria
Alpha
- Feature implemented behind a feature flag
- Initial unit and integration tests are implemented
- Documentation is updated
Beta
- Address reviews and bug reports from Alpha users
- Feature is stable in Alpha for 1 release cycle
- Feature flag enabled by default
GA
- Address reviews and bug reports from Beta users
- Feature is stable in Beta for 2 full release cycles
Upgrade / Downgrade Strategy
After a user upgrades their cluster to a k8s version which supports this feature, the user can use this feature by simply specifying the new field in their podFailurePolicy config.
When a user downgrades from a k8s version that supports this field to one that does not support this field:
- for existing Jobs, this new field will be ignored by the Job controller,
resulting in the condition reason being set to the previous default of
PodFailurePolicyfor any Job failures triggered by a pod failure policy. - for new Jobs, the kube-apiserver would remove this field when the Job is submitted.
Version Skew Strategy
This feature is limited to control plane, so the version skew with kubelet does not matter.
In case kube-apiserver is running in HA mode, and the versions are skewed, then
the old version of kube-apiserver (from before this change) may not handle
the the new Name field if it is set in a Job PodFailurePolicy spec.
In case the version of the kube-controller-manager leader is skewed (old), the
built-in Job controller would reconcile the Jobs with the new Name field and
simply drop the field, thereby not using it when setting the JobFailed condition
reason.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
- Upgrade to k8s version 1.31+
- Enable feature flag
PodFailurePolicyName
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name:
PodFailurePolicyName - Components depending on the feature gate:
- kube-controller-manager
- kube-apiserver
- Feature gate name:
- Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?
Does enabling the feature change any default behavior?
No
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
Yes, by disabling the feature flag PodFailurePolicyName.
What happens if we reenable the feature if it was previously rolled back?
For new Jobs, the apiserver will stop wiping out the new field (Name).
For existing Jobs, the Job controller will stop ignoring the new field, and begin
using it as described in previous sections.
Are there any tests for feature enablement/disablement?
We can add unit tests for:
- feature enabled and field set
- feature disabled and field set
- feature disabled after Jobs have
JobFailedcondition with reason set using the new format
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
If any component has not yet rolled out, or fails to rollout, the existing default behavior will continue to apply, but there is no downtime during partial rollout or rollback.
What specific metrics should inform a rollback?
A substantial increase in the job_sync_duration_seconds metric may suggest the
processing of the configured job pod failure policy rules consumes too much time.
An operator can also observe job_pods_finished_total to check if the reason count
of taken actions (FailJob, Count or Ignore) correlates with the expected
changes based on the Job workload specificity.
Additionally, an operator should check if the failed Jobs have the correct condition
reason set on the JobFailed reason, as described in the design details
.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
Feature is not implemented yet so we cannot test these paths.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
PodFailurePolicy reason format will be deprecated in GA and replaced by PodFailurePolicy_{RuleName}.
Until then, we will maintain both based on conditional logic behind a feature flag.
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
How can someone using this feature know that it is working for their instance?
- Events
- Event Reason:
- API .status
- Condition name:
- Other field:
- Other (treat as last resort)
- Details:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
job_sync_duration_seconds(existing): can be used to see how much the feature enablement increases the time spent in the sync job
- Components exposing the metric: kube-controller-manager
- Metric name:
Are there any missing metrics that would be useful to have to improve observability of this feature?
No.
Dependencies
Does this feature depend on any specific services running in the cluster?
Scalability
Will enabling / using this feature result in any new API calls?
No
Will enabling / using this feature result in introducing new API types?
No.
Will enabling / using this feature result in any new calls to the cloud provider?
No
Will enabling / using this feature result in increasing size or count of the existing API objects?
If the optional name field is specified, the podFailurePolicy object size will increase by 1 byte per
character in the name string. The name field will be no longer than 128 characters, thus the max size
increase of the PodFailurePolicy object will be 128 bytes.
Otherwise, if unset, it will default to the index of the rule, and thus increase by 1 byte per digit in
the index number.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
No
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
What are other known failure modes?
What steps should be taken if SLOs are not being met to determine the problem?
Implementation History
- KEP Published: 02/05/2024
Drawbacks
It is a less intuitive user experience to have the PodFailurePolicy rule name be appended as a suffix to the Job failure reason, rather than
Alternatives
We discussed the idea of having a new optional PodFailurePolicyRule field SetConditionReason, which will enable
the user to explicitly set the condition reason they want on the JobFailed condition set on the Job when that pod failure
policy rule triggers a Job failure. However, ultimately it was decided we didn’t want to open up the reason field to be
explicitly set by the user to any arbitrary value, as this would be tricky to validate, and would diverge from the current
paradigm of having only machine set reasons which are determined programatically.