KEP-4443: More granular Job failure reasons for PodFailurePolicy

KEP-4443: More granular Job failure reasons for PodFailurePolicyRule

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This KEP proposes to extend the Job API by adding an optional Name field to PodFailurePolicyRule. If unset, it would default to the index of the rule in the podFailurePolicy.rules slice.

The purpose of giving the rule a name is to expose more detailed failure information inside the JobFailed condition reason. When a pod failure policy rule triggers a Job failure, the rule name would be appended as a suffix to the JobFailed condition reason, in the format: PodFailurePolicy_{ruleName}. This will allow users to set multiple pod failure policy rules and distinguish which one (if any) triggered a Job failure.

Motivation

Higher level K8s APIs are built via a composition of features, using primitive K8S APIs as building blocks to implement more advanced features. These higher level APIs using the Job API as a building block need to be able to distinguish between different types of Job failures in order to make informed decisions about how to react to these failures.

Currently, no mechanism exists in the Job API to propagate granular failure reason information (e.g., container exit codes) up to be programmatically consumed by higher level software managing Jobs. A PodFailurePolicy can be configured to add a condition reason of PodFailurePolicy to the JobFailed condition added to the Job when it fails, but different pod failure policies targeting different container exit codes all use the same condition reason of PodFailurePolicy. This prevents higher level APIs like JobSet from distinguishing them and being able to take different actions depending on the type of Job failure that occurred.

For a concrete use case, see the JobSet Configurable Failury Policy KEP which illuminated the need for more granular pod failure policy reasons.

Goals

For pod failure policies to be able communicate different failure types to higher level APIs.

Non-Goals

Modifying PodFailurePolicy behavior
The Job controller using this new field for any purpose not explicitly defined in the proposal.

Proposal

The proposal is to add an optional Name field to the PodFailurePolicyRule. If unset, it will default to the index of the PodFailurePolicyRule in the PodFailurePolicy.Rules slice.

When a PodFailurePolicyRule matches a pod failure and the Action is FailJob, the Job controller will append the name of the pod failure policy rule which triggered the failure to the JobFailed condition reason. The exact format of the JobFailed condition reason will be PodFailurePolicy_{ruleName}.

User Stories (Optional)

Story 1

As a user, I am using a JobSet to manage a group of jobs, and I want to be able to decide whether to fail the JobSet or not, based on the exact container exit code that caused a child job failure.

Example JobSet for this use case:

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: fail-jobset-example
spec:
  failurePolicy:
    rules:
    # If Job fails due to a pod failing with exit code 2, fail the JobSet immediately, without attempting any restarts.
    - action: FailJobSet
      targetReplicatedJobs:
      - workers
      onJobFailureReasons:
      - PodFailurePolicy_ExitCode2 # Job failure reason format: PodFailurePolicy_{ruleName}
    maxRestarts: 10
  replicatedJobs:
  - name: workers
    replicas: 10
    template:
      spec:
        parallelism: 1
        completions: 1
        backoffLimit: 0
        # If a pod fails with exit code 2, fail the job with the user-defined reason.
        podFailurePolicy:
          rules:
          - name: "ExitCode2"  # Will be added as a suffix to the reason "PodFailurePolicy" condition reason.
            action: FailJob
            onExitCodes:
              containerName: main
              operator: In
              values: [2]
        template:
          spec:
            restartPolicy: Never
            containers:
            - name: main
              image: python:3.10
              command: ["..."]

Story 2

As a user, I am using a JobSet to manage a group of jobs, each running a HPC simulation. Each job runs a simulation with different random initial parameters. When a simulation ends, the application will exit with one of two exit codes:

Exit code 2, which indicates the simulation produced an invalid result due to bad starting parameters, and should not be retried.
Exit code 3, which indicates the simulation produced an invalid result but the initial parameters were reasonable, so the simulation should be restarted.

When a Job fails due to a pod failing with exit code 2, I want my job management software to leave the Job in a failed state.

When a Job fails due to a pod failing with exit code 3, I want my job management software to to restart the Job.

Example JobSet for this use case:

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: restart-job-example
  annotations:
    alpha.jobset.sigs.k8s.io/exclusive-topology: {{topologyDomain}} # 1:1 job replica to topology domain assignment
spec:
  failurePolicy:
    rules:
    # If Job fails due to a pod failing with exit code 2, leave it in a failed state.
    - action: FailJob
      targetReplicatedJobs:
      - simulations
      onJobFailureReasons:
      - PodFailurePolicy_ExitCode2  # Job failure reason format: PodFailurePolicy_{ruleName}
    # If Job fails due to a pod failing with exit code 3, restart that Job.
    - action: RestartJob
      targetReplicatedJobs:
      - simulations
      onJobFailureReasons:
      - PodFailurePolicy_ExitCode3  # Job failure reason format: PodFailurePolicy_{ruleName}
    maxRestarts: 10
  replicatedJobs:
  - name: simulations
    replicas: 10
    template:
      spec:
        parallelism: 1
        completions: 1
        backoffLimit: 0
        # Pod failure policy rules, the names of which are referenced in the JobSet failure policy.
        podFailurePolicy:
          rules:
          - name: ExitCode2
            action: FailJob
            onExitCodes:
              containerName: main
              operator: In
              values: [2]
          - name: ExitCode3
            action: FailJob
            onExitCodes:
              containerName: main
              operator: In
              values: [3]
        template:
          spec:
            restartPolicy: Never
            containers:
            - name: main
              image: python:3.10
              command: ["..."]

Notes/Constraints/Caveats (Optional)

It should be noted that upon pod failure, the Job’s pod failure policy rules are evaluated in order, and only the first matching rule is executed, even if multiple rules match a pod failure.

Risks and Mitigations

There is a risk to making a field that was previously exclusively managed by the controller, to now being configurable by the user. However, as described in the validation section below, we are validating against malformed/invalid inputs.

Design Details

Defaulting

There will be no defaulting for the new pod failure policy Name field. When Name is unset, the Job controller will set the reason suffix to the index of the rule in the podFailurePolicy.rules slice (e.g. PodFailurePolicy_{index}).

Validation

Validate all pod failure policy rule names are unique.
Validate pod failure policy rule names do not have integer values of any of the existing indexes (unless the value is the rule’s own index).
We will validate the Job failure condition reason that would be generated from a given PodFailurePolicy rule name (i.e., PodFailurePolicy_{ruleName}) will be a valid reason (CamelCase, max length of 128 characters, and matches the regex defined here ).
We will also validate the pod failure policy rule does not conflict with any K8s internal reasons used by the Job controller .

Business logic

When a PodFailurePolicyRule matches a pod failure and the Action is FailJob, the Job controller will set the JobFailed condition reason deterministically in the format PodFailurePolicy_{ruleName}. This suffix will be added to the condition condition here.

Note: If the PodFailurePolicy feature gate is disabled, but the PodFailurePolicyName feature gate is enabled, there will be no adverse effect and neither feature will actually be used, since the only place the proposed new field Name will be used is inside of a code block protected by the PodFailurePolicy feature gate.

Test Plan

[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

k8s.io/kubernetes/pkg/controller/job: 02/05/2024 - 91.5%
k8s.io/kubernetes/pkg/apis/batch/v1: 06/05/2024 - 87.3%
k8s.io/kubernetes/pkg/apis/batch/v1beta1: 06/05/2024 - 78.3%
k8s.io/kubernetes/pkg/apis/batch/validation: 06/05/2024 - 87.7%

Integration tests

Test that when the feature flag is enabled and a Job’s PodFailurePolicy triggers a Job failure, due to a matching PodFailurePolicyRule, check that the JobFailed condition has a reason of PodFailurePolicy_{Name}.
Test that when the feature flag is off, but when it was previously enabled, there is an existing Job which already had the JobFailed condition reason set with the new suffix (i.e., PodFailurePolicy_{Name}), that the Job controller does not overwrite the reason to PodFailurePolicy, and that it remains set to the existing value.
Add test cases for both onPodConditions and onExitCodes to ensure the Name or the rule’s index (when Name is empty) is properly added.

e2e tests

We will a test case similar to the integration test case:

When the feature flag is enabled and a Job’s PodFailurePolicy triggers a Job failure, due to a matching PodFailurePolicyRule, check that the JobFailed condition has the PodFailurePolicy_{Reason} reason set correctly.

Graduation Criteria

Alpha

Feature implemented behind a feature flag
Initial unit and integration tests are implemented
Documentation is updated

Beta

Address reviews and bug reports from Alpha users
Feature is stable in Alpha for 1 release cycle
Feature flag enabled by default

GA

Address reviews and bug reports from Beta users
Feature is stable in Beta for 2 full release cycles

Upgrade / Downgrade Strategy

After a user upgrades their cluster to a k8s version which supports this feature, the user can use this feature by simply specifying the new field in their podFailurePolicy config.

When a user downgrades from a k8s version that supports this field to one that does not support this field:

for existing Jobs, this new field will be ignored by the Job controller, resulting in the condition reason being set to the previous default of PodFailurePolicy for any Job failures triggered by a pod failure policy.
for new Jobs, the kube-apiserver would remove this field when the Job is submitted.

Version Skew Strategy

This feature is limited to control plane, so the version skew with kubelet does not matter.

In case kube-apiserver is running in HA mode, and the versions are skewed, then the old version of kube-apiserver (from before this change) may not handle the the new Name field if it is set in a Job PodFailurePolicy spec.

In case the version of the kube-controller-manager leader is skewed (old), the built-in Job controller would reconcile the Jobs with the new Name field and simply drop the field, thereby not using it when setting the JobFailed condition reason.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

Upgrade to k8s version 1.31+
Enable feature flag PodFailurePolicyName

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: PodFailurePolicyName
- Components depending on the feature gate:
  - kube-controller-manager
  - kube-apiserver
Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, by disabling the feature flag PodFailurePolicyName.

What happens if we reenable the feature if it was previously rolled back?

For new Jobs, the apiserver will stop wiping out the new field (Name). For existing Jobs, the Job controller will stop ignoring the new field, and begin using it as described in previous sections.

Are there any tests for feature enablement/disablement?

We can add unit tests for:

feature enabled and field set
feature disabled and field set
feature disabled after Jobs have JobFailed condition with reason set using the new format

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

If any component has not yet rolled out, or fails to rollout, the existing default behavior will continue to apply, but there is no downtime during partial rollout or rollback.

What specific metrics should inform a rollback?

A substantial increase in the job_sync_duration_seconds metric may suggest the processing of the configured job pod failure policy rules consumes too much time.

An operator can also observe job_pods_finished_total to check if the reason count of taken actions (FailJob, Count or Ignore) correlates with the expected changes based on the Job workload specificity.

Additionally, an operator should check if the failed Jobs have the correct condition reason set on the JobFailed reason, as described in the design details .

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Feature is not implemented yet so we cannot test these paths.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

PodFailurePolicy reason format will be deprecated in GA and replaced by PodFailurePolicy_{RuleName}. Until then, we will maintain both based on conditional logic behind a feature flag.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

Events
- Event Reason:
API .status
- Condition name:
- Other field:
Other (treat as last resort)
- Details:

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name:
  - job_sync_duration_seconds (existing): can be used to see how much the feature enablement increases the time spent in the sync job
- Components exposing the metric: kube-controller-manager

Are there any missing metrics that would be useful to have to improve observability of this feature?

No.

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

If the optional name field is specified, the podFailurePolicy object size will increase by 1 byte per character in the name string. The name field will be no longer than 128 characters, thus the max size increase of the PodFailurePolicy object will be 128 bytes. Otherwise, if unset, it will default to the index of the rule, and thus increase by 1 byte per digit in the index number.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

2024-05-24: KEP Published
2026-06-09: Update the milestone to v1.38

Drawbacks

It is a less intuitive user experience to have the PodFailurePolicy rule name be appended as a suffix to the Job failure reason, rather than

Alternatives

We discussed the idea of having a new optional PodFailurePolicyRule field SetConditionReason, which will enable the user to explicitly set the condition reason they want on the JobFailed condition set on the Job when that pod failure policy rule triggers a Job failure. However, ultimately it was decided we didn’t want to open up the reason field to be explicitly set by the user to any arbitrary value, as this would be tricky to validate, and would diverge from the current paradigm of having only machine set reasons which are determined programatically.