KEP-4017: Pod Index Label

Implementation History
STABLE Implemented
Created 2023-05-16
Latest v1.32
Milestones
Beta v1.28
Stable v1.32
Ownership
Owning SIG
SIG Apps
Primary Authors

KEP-4017: Pod Index Label for StatefulSets and Indexed Jobs

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
    • e2e Tests for all Beta API Operations (endpoints)
    • (R) Ensure GA e2e tests meet requirements for Conformance Tests
    • (R) Minimum Two Week Window for GA e2e tests to prove flake free
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • “Implementation History” section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Currently, StatefulSet pods do not have their pod index as a label or annotation, and Indexed Jobs only set the pod completion index as an annotation. This KEP proposes to set the pod index as a label on pods for StatefulSets and Indexed Jobs.

Motivation

StatefulSets, similar to Indexed Jobs, assign a pod ordinal to each of their pods (this is analogous to the pod completion index for Indexed Jobs).

Indexed Jobs set the pod index as an annotation, which means it can be utilized by the downward API, however, it cannot easily be used for selecting pods based on their index or filtering metrics/logs, or for doing things like routing network traffic to a pod with a specific index (e.g., index 0).

StatefulSets include the pod index as part of the pod name. This means that the only way for a StatefulSet pod to know its own index is to parse the pod name , which is somewhat hacky solution. Furthermore, the pod index cannot be utilized by the downward API, nor used for use cases such as filtering metrics/logs or routing traffic to a pod with a specific index.

To address these issues, we propose to set the pod index as a label on pods for both StatefulSets and Indexed Jobs. This will allow the pod index to be utilized via the downward API, so StatefulSet pods can easily know their own index. In addition, this will allow metrics and logs to more easily be filtered by pod index, for both StatefulSets and Indexed Jobs.

Goals

  • Set pod index as label on pods for StatefulSet/Indexed Jobs.
  • Adding the label should not be disruptive to existing workloads.

Non-Goals

Proposal

At a high level, the proposal is to modify the StatefulSet and Job controllers to set the pod index as a pod label at pod creation time (for jobs, this would only apply to jobs in Indexed completion mode). The details of this are outlined in the Design Details section below.

  • StatefulSet pod label: apps.kubernetes.io/pod-index
  • IndexedJob pod label: batch.kubernetes.io/job-completion-index (same as existing annotation)

User Stories (Optional)

Story 1

As a user, I would like to lookup a job’s pod logs by its index.

Story 2

As a user, I would like to target traffic to a specific pod index (e.g., index 0) in a StatefulSet or Indexed Job. Instead of creating a service which matches an entire Job, I’d like to create a service which matches only the “head” pod, which will be more performant, especially for a large number of pods.

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

One thing that must be considered is how enabling this new feature will interact with existing workloads. There are a couple of options:

  1. Only inject the label on newly created pods, so an existing StatefulSet/Indexed Jobs may include pods with the label and some without it. This means for the user to utilize the label via the downward API, or to use the label for pod selection, they will need to recreate the StatefulSet so the label is present on all pods.

  2. Inject the label only on pods for newly created StatefulSets/Indexed Jobs. We can track this by annotating newly created StatefulSets/Indexed Jobs to distinguish existing ones from newly created ones. Using this strategy, for a given StatefulSet/IndexedJob, either none of the pods have this label, or all of them do, which will provide a more consistent user experience. However, in the case of a cluster downgrade to a version without this feature, new pods would start getting created without this label again.

  3. Inject the label on all pods (pods existing prior to feature enablement and pods created after feature enablement). However, retroactively modifying pods of existing workloads would risk being too disruptive to existing workloads which may have logic depending on pod labels, so this option should not be considered.

Both options 1 and 2 will not be disruptive to existing workloads. Option 1 is more straightforward and does not risk locking us into adding this somewhat hacky annotation to StatefulSets/Jobs indefinitely like Option 2 does. On the other hand, outside of the cluster downgrade edge case, Option 2 will ensure consistency within a single StatefulSet/Indexed Job and therefore a more predictable user experience.

After considering these trade-offs, I propose we move forward with Option 1 for simplicity and to avoid being stuck adding this annotation to StatefulSets/Indexed Jobs. In addition, the downside of existing workloads having only a subset of pods with the new label will not cause any serious issues.

Design Details

The StatefulSet controller will only need a minor update to the newStatefulSetPod function, to set the pod ordinal as the label apps.kubernetes.io/pod-index. This call is downstream from the newVersionedStatefulSetPod call, which generates the StatefulSet pods before creating them as necessary in CreateStatefulPod .

Similarly, the Job controller would need to add the completion index as a label here where it adds the corresponding annotation.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates
Unit tests
  • k8s.io/kubernetes/pkg/controller/job: 10/09/2024 - 92%
  • k8s.io/kubernetes/pkg/controller/statefulset: 10/09/2024 - 85.6%
Integration tests
  • Existing Integration will be updated as a criteria for GA
e2e tests

The e2e test check for value of the label: https://github.com/kubernetes/kubernetes/blob/d9c46d8ecb1ede9be30545c9803e17682fcc4b50/test/e2e/apps/job.go#L435-L467

Graduation Criteria

We will release the feature directly in Beta state since there is no benefit in having an alpha release, since we are simply adding a new label so there is very little risk (unlike removing an existing label which other things may depend on, for example).

Beta

  • Feature implemented behind the PodIndexLabel feature gate.
  • Unit and integration tests passing.
  • Docs are clear that it is managed by the workload controller(s), and it is NOT guaranteed for every pod.
  • Docs are clear about what happens if two pods get the same value (it is set by workload controllers, nothing in the API system will prevent collisions from happening).

GA

  • the PodIndexLabel feature-gate will be locked and the code will ignore it
  • Add integration/e2e test for StatefulSet controller, PodIndexLabel feature
  • Update existing integration test for IndexedJob to validate the label value

Upgrade / Downgrade Strategy

After a user upgrades their cluster to a version which supports this feature (and has the feature gate enabled) the user will need to redeploy their StatefulSets / Indexed Jobs so that all pods have the pod index label, since after the upgrade only newly created pods will have this pod index label added.

Version Skew Strategy

N/A. This feature doesn’t require coordination between control plane components, the changes to each controller are self-contained.

If there were version skew between the control plane components and the node components, where the control plane components were at version N where this feature exists, and the node components were at version N-1 where this feature does not exist, there would be no adverse affects, the new label would simply be added to StatefulSet/Indexed Job pods.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

The feature can be safely rolled back.

How can this feature be enabled / disabled in a live cluster?
  • Feature gate (also fill in values in kep.yaml)
    • Feature gate name: PodIndexLabel
    • Components depending on the feature gate:
      • kube-controller-manager
  • Other
    • Describe the mechanism:
    • Will enabling / disabling the feature require downtime of the control plane?
    • Will enabling / disabling the feature require downtime or reprovisioning of a node?
Does enabling the feature change any default behavior?

Yes, a new label is added to pods created for StatefulSet (apps.kubernetes.io/pod-index) and Indexed Jobs (batch.kubernetes.io/job-completion-index)

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. If the feature gate is disabled, the StatefulSet/Job controller will not add the pod index as a label. Already existing pods will not be modified.

What happens if we reenable the feature if it was previously rolled back?

The StatefulSet/Job controller will begin adding the pod index as a label to pods created while the feature is enabled, and existing pods will be unaffected.

Are there any tests for feature enablement/disablement?

Given that this feature doesn’t introduce any new API field, enablement/disablement tests will not provide reasonable value and won’t be added.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

It will not impact already running workloads.

What specific metrics should inform a rollback?
  • Users can monitor queue related metrics (e.g., queue depth and work duration) to make sure they aren’t growing.
  • For Indexed Jobs, users can also monitor job_sync_duration_seconds.
  • For StatefulSets: the kube_statefulset_status_replicas metric can be monitored against the kube_statefulset_replicas metric to check the expected number of replicas to the actual number of pods matched by this StatefulSet’s selector. If there is a divergence between these fields during steady state operations, this can indicate that the number of replicas being created by the StatefulSet do not match the expected number of replicas.

On a large scale (across a large number of StatefulSets) the distribution of the ratio of these two metrics should not change when enabling this feature. If this ratio changes significantly after enabling this feature, it could indicate a problem and could indicate a rollback is necessary.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

For StatefulSet

  1. kind kubernetes 1.31 cluster was created
# k version
Client Version: v1.30.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.31.0
  1. A sample statefulset was created, since default value feature gate is PodIndexLabel is true, the pods had following labels:
# k get pods -oyaml | grep '    name: example-statefulset-\|index'
      apps.kubernetes.io/pod-index: "0"
    name: example-statefulset-0
      apps.kubernetes.io/pod-index: "1"
    name: example-statefulset-1
      apps.kubernetes.io/pod-index: "2"
    name: example-statefulset-2
  1. The controller-manager yaml was modified to disable the feature gate, for testing downgrades:
# k logs -f -n kube-system kube-controller-manager-kind-1.31-dra-control-plane  | grep feature
I1008 21:12:12.361613       1 flags.go:64] FLAG: --feature-gates=":DynamicResourceAllocation=true,:PodIndexLabel=false"
I1008 21:12:30.602829       1 controllermanager.go:749] "Controller is disabled by a feature gate" controller="storageversion-garbage-collector-controller" requiredFeatureGates=["APIServerIdentity","StorageVersionAPI"]
I1008 21:12:30.653581       1 controllermanager.go:749] "Controller is disabled by a feature gate" controller="service-cidr-controller" requiredFeatureGates=["MultiCIDRServiceAllocator"]

The controller did not re-write the pod labels, as expected

# k get pods -oyaml | grep '    name: example-statefulset-\|index'
      apps.kubernetes.io/pod-index: "0"
    name: example-statefulset-0
      apps.kubernetes.io/pod-index: "1"
    name: example-statefulset-1
      apps.kubernetes.io/pod-index: "2"
    name: example-statefulset-2
  1. The statefulset was deleted and re-created, pods were created without the index label
# k get pods -oyaml | grep '    name: example-statefulset-\|index'
    name: example-statefulset-0
    name: example-statefulset-1
    name: example-statefulset-2
  1. The controller-manager yaml was modified to enable the feature gate, for testing upgrade
# k logs -f -n kube-system kube-controller-manager-kind-1.31-dra-control-plane  | grep feature
I1008 21:14:46.348747       1 flags.go:64] FLAG: --feature-gates=":DynamicResourceAllocation=true"

The controller-manager did not update the labels

# k get pods -oyaml | grep '    name: example-statefulset-\|index'
    name: example-statefulset-0
    name: example-statefulset-1
    name: example-statefulset-2
  1. The statefulset was deleted and re-created, pods were created with the index label
# k get pods -oyaml | grep '    name: example-statefulset-\|index'
      apps.kubernetes.io/pod-index: "0"
    name: example-statefulset-0
      apps.kubernetes.io/pod-index: "1"
    name: example-statefulset-1
      apps.kubernetes.io/pod-index: "2"
    name: example-statefulset-2

For IndexedJob

  1. kind kubernetes 1.31 cluster was created
# k version
Client Version: v1.30.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.31.0
  1. A sample IndexedJob was created, since default value feature gate is PodIndexLabel is true, the pods had following labels:
# k get pods -oyaml | grep '    name: sample-indexed-job-[0-9]\|job-completion-index'
      batch.kubernetes.io/job-completion-index: "0"
      batch.kubernetes.io/job-completion-index: "0"
    name: sample-indexed-job-0-8sgb7
            fieldPath: metadata.labels['batch.kubernetes.io/job-completion-index']
      batch.kubernetes.io/job-completion-index: "1"
      batch.kubernetes.io/job-completion-index: "1"
    name: sample-indexed-job-1-f9mz4
            fieldPath: metadata.labels['batch.kubernetes.io/job-completion-index']
      batch.kubernetes.io/job-completion-index: "2"
      batch.kubernetes.io/job-completion-index: "2"
    name: sample-indexed-job-2-5gxwz
            fieldPath: metadata.labels['batch.kubernetes.io/job-completion-index']
  1. The controller-manager yaml was modified to disable the feature gate, for testing downgrades:
# k logs -f -n kube-system kube-controller-manager-kind-1.31-dra-control-plane  | grep feature
I1010 02:33:21.331424       1 flags.go:64] FLAG: --feature-gates=":DynamicResourceAllocation=true,:PodIndexLabel=false"

The controller did not re-write the pod labels, as expected

# k get pods -oyaml | grep '    name: sample-indexed-job-[0-9]\|job-completion-index'
      batch.kubernetes.io/job-completion-index: "0"
      batch.kubernetes.io/job-completion-index: "0"
    name: sample-indexed-job-0-8sgb7
            fieldPath: metadata.labels['batch.kubernetes.io/job-completion-index']
      batch.kubernetes.io/job-completion-index: "1"
      batch.kubernetes.io/job-completion-index: "1"
    name: sample-indexed-job-1-f9mz4
            fieldPath: metadata.labels['batch.kubernetes.io/job-completion-index']
      batch.kubernetes.io/job-completion-index: "2"
      batch.kubernetes.io/job-completion-index: "2"
    name: sample-indexed-job-2-5gxwz
            fieldPath: metadata.labels['batch.kubernetes.io/job-completion-index']
  1. The IndexedJob was deleted and re-created, pods were created without the index label (some of the output is truncated for brevity)
# k get pods -oyaml | grep -A 4 '    name: sample-indexed-job-[0-9]\|labels'
    labels:
      batch.kubernetes.io/controller-uid: bf96f9c0-b7ec-4c7e-9a4c-9cca20b26d35
      batch.kubernetes.io/job-name: sample-indexed-job
      controller-uid: bf96f9c0-b7ec-4c7e-9a4c-9cca20b26d35
      job-name: sample-indexed-job
    name: sample-indexed-job-0-8ttb5
--
    labels:
      batch.kubernetes.io/controller-uid: bf96f9c0-b7ec-4c7e-9a4c-9cca20b26d35
      batch.kubernetes.io/job-name: sample-indexed-job
      controller-uid: bf96f9c0-b7ec-4c7e-9a4c-9cca20b26d35
      job-name: sample-indexed-job
    name: sample-indexed-job-1-tvjqc
--
    labels:
      batch.kubernetes.io/controller-uid: bf96f9c0-b7ec-4c7e-9a4c-9cca20b26d35
      batch.kubernetes.io/job-name: sample-indexed-job
      controller-uid: bf96f9c0-b7ec-4c7e-9a4c-9cca20b26d35
      job-name: sample-indexed-job
    name: sample-indexed-job-2-r75jw
  1. The controller-manager yaml was modified to enable the feature gate, for testing upgrade
# k logs -f -n kube-system kube-controller-manager-kind-1.31-dra-control-plane  | grep feature
I1010 02:39:22.329026       1 flags.go:64] FLAG: --feature-gates=":DynamicResourceAllocation=true,:PodIndexLabel=true"

The controller-manager did not update the labels

# k get pods -oyaml | grep -A 4 '    name: sample-indexed-job-[0-9]\|labels'
    labels:
      batch.kubernetes.io/controller-uid: bf96f9c0-b7ec-4c7e-9a4c-9cca20b26d35
      batch.kubernetes.io/job-name: sample-indexed-job
      controller-uid: bf96f9c0-b7ec-4c7e-9a4c-9cca20b26d35
      job-name: sample-indexed-job
    name: sample-indexed-job-0-8ttb5
--
    labels:
      batch.kubernetes.io/controller-uid: bf96f9c0-b7ec-4c7e-9a4c-9cca20b26d35
      batch.kubernetes.io/job-name: sample-indexed-job
      controller-uid: bf96f9c0-b7ec-4c7e-9a4c-9cca20b26d35
      job-name: sample-indexed-job
    name: sample-indexed-job-1-tvjqc
--
    labels:
      batch.kubernetes.io/controller-uid: bf96f9c0-b7ec-4c7e-9a4c-9cca20b26d35
      batch.kubernetes.io/job-name: sample-indexed-job
      controller-uid: bf96f9c0-b7ec-4c7e-9a4c-9cca20b26d35
      job-name: sample-indexed-job
    name: sample-indexed-job-2-r75jw
  1. The IndexedJob was deleted and re-created, pods were created with the index label
# k get pods -oyaml | grep '    name: sample-indexed-job-[0-9]\|job-completion-index'
      batch.kubernetes.io/job-completion-index: "0"
      batch.kubernetes.io/job-completion-index: "0"
    name: sample-indexed-job-0-d7d7m
            fieldPath: metadata.labels['batch.kubernetes.io/job-completion-index']
      batch.kubernetes.io/job-completion-index: "1"
      batch.kubernetes.io/job-completion-index: "1"
    name: sample-indexed-job-1-gg9sv
            fieldPath: metadata.labels['batch.kubernetes.io/job-completion-index']
      batch.kubernetes.io/job-completion-index: "2"
      batch.kubernetes.io/job-completion-index: "2"
    name: sample-indexed-job-2-nfxlr
            fieldPath: metadata.labels['batch.kubernetes.io/job-completion-index']
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

N/A

How can an operator determine if the feature is in use by workloads?
  • Check if StatefulSet pods have the label apps.kubernetes.io/pod-index.
  • Check if Indexed Job pods have the label batch.kubernetes.io/job-completion-index.
How can someone using this feature know that it is working for their instance?
  • Events
    • Event Reason:
  • API .metadata
    • Condition name:
    • Other field:
      • .metadata.labels['apps.kubernetes.io/pod-index'] for StatefulSets
      • .metadata.labels['batch.kubernetes.io/job-completion-index'] for Indexed Jobs
  • Other (treat as last resort)
    • Details:
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
  • Jobs: 99% percentile over day for Job syncs is <= 15s for a client-side 50 QPS limit.
  • StatefulSets: the ratio of kube_statefulset_status_replicas/kube_statefulset_replicas should be near 1.0, although as unhealthy replicas are often an application error rather than a problem with the stateful set controller, this will need to be tuned by an operator on a per-cluster basis.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Jobs:

  • Metric name: job_sync_duration_seconds, job_sync_total.
  • Components exposing the metric: kube-controller-manager

StatefulSets:

  • Metric name: statefulset_reconcile_delay
    • [Optional] Aggregation method: quantile
    • Components exposing the metric: kube-controller-manager
  • Metric name: kube_statefulset_replicas
    • [Optional] Aggregation method: gauge
    • Components exposing the metric: kube-controller-manager
  • Metric name: kube_statefulset_status_replicas
    • [Optional] Aggregation method: gauge
    • Components exposing the metric: kube-controller-manager
  • Metric name: kube_statefulset_ordinals_start
    • [Optional] Aggregation method: gauge
    • Components exposing the metric: kube-controller-manager
Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

New pod label of size 34B plus value of size N where N is the number of digits in the pod ordinal. Worst case for N would be the max number of pods per cluster. Per the docs on large clusters this is 150,000 (6 digits). So max label size would be 34 + 6 = 40B.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

Pods cannot be created, this feature doesn’t change it though

What are other known failure modes?

N/A

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

  • 2023-05-17: KEP published
  • 2023-07-14: Feature merged with feature gate in beta
  • 2024-10-09: Feature graduated to GA

Drawbacks

Alternatives

Add pod index as annotation

This was discussed but there are use cases where a label is required (filtering metrics, logs). See discussion link .

Add pod index as both label and annotation

This was discussed but there are no concrete use cases that a label cannot fulfill and an annotation is required. See discussion link .

Infrastructure Needed (Optional)