KEP-5067: Pod Generation

Implementation History
STABLE Implemented
Created 2025-01-21
Latest v1.35
Milestones
Alpha v1.33
Beta v1.34
Stable v1.35
Ownership
Owning SIG
SIG Node

KEP-5067: Pod Generation

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
    • e2e Tests for all Beta API Operations (endpoints)
    • (R) Ensure GA e2e tests meet requirements for Conformance Tests
    • (R) Minimum Two Week Window for GA e2e tests to prove flake free
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • “Implementation History” section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This proposal aims to allow the pod status to express which pod updates are currently being reflected in the pod status. The idea is to leverage the existing metadata.Generation field and add a new status.observedGeneration field to the pod status.

Motivation

One of the motivations for this KEP comes from the existing ResizeStatus field. In its original implementation, the ResizeStatus field was written to by both the API server and the Kubelet, creating a race condition on updates. Removing the Proposed state makes the Kubelet the only writer to that field, but leaves a gap in knowing whether the latest resize has been acknowledged by the Kubelet. The changes proposed in this KEP resolve this gap.

The ResizeStatus is used as an example here, but in practice this issue can be generally found in any type of pod update.

Goals

  • Provide a general solution for the pod status to express which pod update is currently being reflected.

Non-Goals

  • Expand the set of mutable fields.

Proposal

Current behavior

The pod metadata.generation field does exist today, and is documented as “a sequence number representing a specific generation of the desired state. Set by the system and monotonically increasing, per-resource.”

Its current behavior in pods is:

  • Pod metadata.generation is not populated by default by the system.
  • The client can custom-set the metadata.generation on pod create.
  • metadata.generation cannot be updated by the client.
  • metadata.generation does not get incremented by the system when the podspec is updated.
  • metadata.generation does get incremented by the system when DeletionTimestamp is set.

API Changes

Generation

The metadata.generation field is currently unused on pods, but we can start leveraging it to help track which pod state is currently being reflected in the pod status. For consistency, pod metadata.generation will be incremented whenever the pod has changes that the kubelet needs to actuate.

ObservedGeneration

A new optional field status.observedGeneration field will be added to the pod status. Kubelet will set this to communicate which pod state is being expressed in the current pod status. This is analogous to the status.observedGeneration that exists in other resources’ statuses such as StatefulSets .

The status.observedGeneration may not necessarily be a reflection of every single field in the pod status. Instead, it reports the latest generation that the kubelet has seen. This means that status.observedGeneration captures the kubelet’s decision to admit a change, and acknowledge that it has seen the pod’s new spec. There are cases when status.observedGeneration may be behind - other status values already reflect a next generation, but the next update from kubelet SHOULD bring the status.observedGeneration to the current value. It also will not necessarily be able to reflect that the kubelet has completed actuation of certain fields. There is a field-by-field analysis written up in this doc . We will have to carefully document the nuanced meaning of status.observedGeneration to avoid confusion.

Likewise, a new optional observedGeneration field will be added to the pod’s status.condition struct. This is to keep parity with the metav1. Condition struct .

The net result will be a new status.observedGeneration field for the kubelet to express which generation the top-level status relates to, and a new status.conditions[i].observedGeneration field for the writer of that condition to express which generation that condition relates to.

Notes/Constraints/Caveats (Optional)

Because there is only one singular status.observedGeneration in the pod status, only one writer can set it. This is consistent with other object types that have status.observedGeneration , and the expectation is that the primary controller for the object sets the field. For a pod, that primary controller is the kubelet, so once a pod is bound to a node, we expect that the kubelet on that node is the sole writer of status.observedGeneration.

Risks and Mitigations

Custom-set metadata.generation

Today, it is possible for a client to custom-set a pod’s metadata.generation on creation, but once set, the metadata.generation cannot be updated. That means that there is a possibility for an external client to be setting metadata.generation on pod create and depend on that fixed value somehow in its own reconciliation logic.

That said, the metadata.generation field is described as “set by the system and monotonically increasing” , and thus it should be clear enough that metadata.generation was not intended to be used in this way.

Infinite loop caused by misbehaving mutating webhooks

It is possible that today there exists mutating webhooks that overwrite a pod’s status. These older webhooks would not know about status.observedGeneration and could be clearing it. This would cause an infinite loop of the kubelet attempting to update a pod’s status.observedGeneration and the webhook clearing it.

Status-mutating webhooks could break more pod features than just what is proposed in this KEP, so we will not attempt to solve this here. This risk can be mitigated by improving the documentation on webhooks and how they can be written to avoid these kinds of scenarios.

Symptoms of this scenario that users can look out for include:

  • Unexpected sudden spikes in pod status update API calls that occur right after either upgrading the kubelet or creating a new status-mutating webhook.
  • status.observedGeneration remains unchanged after a pod sync loop even when metadata.generation is changing. This occurs because the API server would be preserving any existing value of status.observedGeneration whenever the webhook attempts to clear it.

Design Details

API server and generation

For a newly created pod, the API server will set metadata.generation to 1. For any updates to the PodSpec , the API server will increment metadata.generation by 1.

As described in the field-by-field analysis doc above, the PodSpec mutable fields today are:

  • Resources
  • Ephemeral Containers
  • Container image
  • ActiveDeadlineSeconds
  • TerminationGracePeriodSeconds
  • Tolerations

If any new mutable fields are added to the PodSpec in the future, they will also cause the API server to increment metadata.generation.

The pod metadata.generation will also continue to be incremented on graceful delete or deferred delete, just as the API server currently does for pods and other objects today.

Pod updates that would not result in metadata.generation being incremented include:

  • Changes to metadata (with the exception of DeletionTimestamp). This means that if a Pod uses the downward API to make pod metadata available to containers, Pod behavior can change without the generation being incremented. We will consider this working as intended.
  • Changes to status.

The logic to set new pods’ metadata.generation to 1 and to increment metadata.generation on update will run after all mutating webhooks have finished.

Client requests to update generation

Any attempts by clients to set or modify the metadata.generation field themselves will be ignored and overridden by the API Server. This is consistent with existing behavior of the metadata.generation field in all other objects.

Kubelet and observedGeneration

When the Kubelet updates the pod status as part of the pod sync loop , it will set the status.observedGeneration in the pod to reflect the pod metadata.generation corresponding to the snapshot of the pod currently being synced. That means if the pod spec gets updated concurrently while the kubelet is performing a pod sync loop on a previous update, the status.observedGeneration will be behind metadata.generation.

Outside of the pod sync loop, another place where the kubelet sets the pod status is when a pod is rejected (during HandlePodAdditions) . This code will also be modified to populate status.observedGeneration to express which metadata.generation was rejected. In this case, the kubelet will not be updating the pod through the sync loop (due to the rejection).

The only other place where the pod status is updated is the readiness and probe updates, but we will leave status.observedGeneration unchanged here as the probe that updated the status would be the one synced in the last pod sync loop.

Mutable Fields Analysis

The field-by-field analysis doc referenced above goes into further detail about what status.observedGeneration means in relation to the other fieds in the pod status. Here is a summary of the conclusions:

  • For some fields, the allocated spec is reflected directly in the pod status, so their associated generation is reflected directly by status.observedGeneration. This is the case for allocated resources, resize status, ephemeral containers.
  • For other fields, the status is an indirect result of actuating the PodSpec, and the associated generation for those fields are from the generation before what is reflected by status.observedGeneration. This is the case for actual resources, container image, activeDeadlineSeconds, and terminationGracePeriodSeconds.

To keep things simple and avoid having to add a new field to track the latter, the kubelet will output the PodSpec metadata.generation that was observed at the time of the current sync loop, even though the kubelet has not actuated the change yet. We will document this very clearly and explicitly to avoid confusion.

Other writers of pod status

There are other writers of pod status besides the Kubelet including the scheduler and the node lifecycle controller. These should also populate status.observedGeneration whenever they make a status update (excluding any updates to pod conditions, which have their own dedicated observedGeneration field). Once the pod is bound to a node, however, the expectation is that only the kubelet will be writing to status.observedGeneration.

The scheduler and node lifecycle controller also write pod conditions. Whenever they set a pod condition, they should just populate the condition.observedGeneration field with the relevant generation.

Client requests to update observedGeneration

During status update, if the incoming update clears status.observedGeneration back to 0, the API server will preserve the previously existing value. All other updates to status.observedGeneration will be permitted by the API validation, including regressions back to decreasing values.

Mirror pods

For this KEP, we will not treat mirror pods in any special way. Due to the way they are currently implemented in the kubelet and apiserver, this means:

  1. If a mirror pod’s spec is modified manually by a client via the apiserver, its metadata.generation will be bumped accordingly.
  2. If a static pod’s manifest is updated, the kubelet treats this as a pod deletion followed by a pod creation, which will reset the metadata.generation of the corresponding mirror pod to 1.
  3. The kubelet does not currently propagate the mirror pod’s metadata.generation to the place where the pod status is updated today, so the observedGeneration fields of mirror pods will remain unpopulated.

Future enhancements

We may at some future point reconsider mirror pods and potentially populate metadata.generation and status.observedGeneration on them.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates
Unit tests

Unit tests will be implemented to cover code changes that implement the feature, in the API server code and the kubelet code.

Core packages touched:

  • pkg/registry/core/pod/strategy.go: 2025-06-16 - 71.1
  • pkg/registry/core/pod/util.go: 2025-06-16 - 74
  • pkg/apis/core/validation/validation.go: 2025-06-16 - 84.6
  • pkg/kubelet: 2025-06-16 - 71
  • pkg/kubelet/status: 2025-06-16 - 86.8
Integration tests

Integration tests are added to cover node life cycle controller. See this commit:

e2e tests

E2E tests will be implemented to cover the following cases:

  • Verify that newly created pods have a metadata.generation set to 1.
  • Verify that PodSpec updates (such as tolerations or container images), resize requests, adding ephemeral containers, and binding requests cause the metadata.generation to be incremented by 1 for each update.
  • Verify that deletion of a pod causes the metadata.generation to be incremented by 1.
  • Issue ~500 pod updates (1 every 100ms) and verify that metadata.generation and status.observedGeneration converge to the final expected value.
  • Verify that various conditions each have observedGeneration populated.
  • Verify that mirror pods have metadata.generation and observedGeneration fields set to 1, and that they never change.

Added tests:

Graduation Criteria

Alpha

  • Initial e2e tests completed and enabled
  • metadata.generation functionality implemented
  • status.observedGeneration functionality implemented behind feature flag
  • status.conditions[i].observedGeneration field added to the API
  • status.conditions[i].observedGeneration functionality implemented behind feature flag

Beta

  • metadata.generation, status.observedGeneration, status.conditions[i].observedGeneration functionality have been implemented and running as alpha for at least one release

GA

  • No major bugs reported for three months.
  • No negative user feedback.
  • Promote the primary e2e tests to Conformance.

Upgrade / Downgrade Strategy

API server should be upgraded before Kubelets. Kubelets should be downgraded before the API server.

Version Skew Strategy

Previous versions of clients unaware of metadata.generation functionality would either not set the pod metadata.generation field (having the effective value of 0) or set it to some custom value, though the latter is unlikely. In either case, the API server will ignore whatever value of metadata.generation is set by the client and will manage metadata.generation itself (setting it to 1 for newly created pods, or incrementing it for pod updates).

Already running pods will likewise either not have the pod metadata.generation set (and thus have a default value of 0), or will have a custom value if metadata.generation was explicitly set by a client. On the first update after the API server is upgraded, the API server will increment the value of metadata.generation by 1 from whatever it was set to previously. This means that the first update to a pod that did not yet have a metadata.generation will now have a metadata.generation of 1.

If a pod that has a metadata.generation set or incremented via the new API server is later updated by an older API server, the older API server will not modify the metadata.generation field and it will stay fixed at its current value. That means that if there is version skew between multiple apiservers, the metadata.generation may or may not be incremented. To address this, we will not feature-gate the logic in the apiserver that increments metadata.generation. That means that by the time ObservedGeneration goes to beta, there will be 2 versions of apiservers updating the metadata.generation field, removing the issue.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?
  • Feature gate (also fill in values in kep.yaml)
    • Feature gate name: PodObservedGenerationTracking
    • Components depending on the feature gate: kubelet, kube-controller-manager, kube-scheduler
  • Other
    • Describe the mechanism:
      • Writers to status.observedGeneration will propagate the pod’s metadata.generation to status.observedGeneration if the feature gate is enabled OR if status.observedGeneration is already set. We will not attempt to clear status.observedGeneration if set in order to avoid an infinite loop between attempting to clear the field and the API server preserving the existing value when an incoming update attempts to clear it.
    • Will enabling / disabling the feature require downtime of the control plane?
      • No.
    • Will enabling / disabling the feature require downtime or reprovisioning of a node?
      • No.
Does enabling the feature change any default behavior?

The pod’s metadata.generation field is currently unset by default and both new observedGeneration fields will be new fields in the pod status, so the feature will not introduce any breaking changes of default behavior.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

The status.observedGeneration feature can be disabled by setting the flag to ‘false’ and restarting the kubelet. Disabling the feature in the kubelet means that the kubelet will not propagate metadata.generation to status.observedGeneration for new pods. For existing pods, if status.observedGeneration is already set, the kubelet will continue to propagate metadata.generation to status.observedGeneration. The kubelet will not attempt to clear status.observedGeneration if set in order to avoid an infinite loop between the kubelet attempting to clear the field and the API server preserving the existing value when an incoming update attempts to clear it.

Likewise, the conditions[i].observedGeneration feature can be disabled by setting the flag to ‘false’ in the kubelet, node lifecycle controller, and scheduler. When the feature flag is disabled, the condition’s observedGeneration will no longer be populated.

The metadata.generation functionality will intentionally not be behind a feature gate so cannot be disabled except by downgrading the API server.

What happens if we reenable the feature if it was previously rolled back?

The API server will start incrementing metadata.generation , the kubelet will start setting status.observedGeneration , and writers of pod conditions will start incrementing those conditions’ observedGeneration s.

Are there any tests for feature enablement/disablement?

Unit tests will be added to cover the code that implements the feature, and will cover the cases of the feature gate being both enabled and disabled.

The following unit test covers what happens if I disable a feature gate after having objects written with the new field (in this case, the field should persist).

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

A rollout or rollback won’t have significant impact on any components, even if they restart mid-rollout. Already running workloads likewise won’t be significantly impacted.

What specific metrics should inform a rollback?

If users see the metadata.generation and status.observedGeneration fields are not being updated or are significantly misaligned, that indicates that the feature is not working as expected.

Some metrics to look at that could indicate a problem include:

  • kubelet_pod_start_total_duration_seconds
  • kubelet_pod_status_sync_duration_seconds
  • kubelet_pod_worker_duration_seconds

You could also check the Pod Startup Latency SLI .

Any of these being significantly elevated could indicate an issue with the feature.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Testing steps:

  1. Create test pod with old version of API server and node; expected outcome: generation and observedGeneration fields are not populated
  2. Upgrade API server
  3. Send an update request to the running pod; expected outcome: generation is set to 1 and observedGeneration fields are not populated
  4. Create a new pod; expected outcome: generation is set to 1 and observedGeneration fields are not populated
  5. Create upgraded node
  6. Create second test pod on the upgraded node; expected outcome: generation and observedGeneration fields are set to 1
  7. Restart the upgraded node with the feature disabled
  8. Send an update request to the second pod; expected outcome: generation and observedGeneration continue to be updated so are set to 2
  9. Restart the upgraded node with the feature enabled
  10. Send an update request to the second pod; expected outcome: generation and observedGeneration are set to 3
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

They can check if metadata.generation is set on the pod and that observedGeneration is being updated.

How can someone using this feature know that it is working for their instance?
  • API .status
    • Other field: metadata.generation, status.observedGeneration, status.conditions[].observedGeneration

Each pod should have its metadata.generation set, starting at 1 and incremented by 1 for each update.

Each pod’s status.observedGeneration should be populated to reflect the metadata.generation that was last observed by the kubelet.

Each pod’s status.conditions[].observedGeneration should be populated to reflect the metadata.generation that was last observed by the component owning the corresponding condition.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

We can reuse the Pod Startup Latency SLI/SLO here.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

We can reuse the Pod Startup Latency SLI/SLO here.

Are there any missing metrics that would be useful to have to improve observability of this feature?

N/A

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

Yes, enabling this feature could result in additional API calls. If the pod sync loop results in a new status where status.observedGeneration is the only status field changed, there will be a new status update call. This can occur when a pod generation is updated in the middle of a pod sync loop, but the next sync loop does not have any other status changes.

Will enabling / using this feature result in introducing new API types?

No, this feature does not introduce any new API types.

Will enabling / using this feature result in any new calls to the cloud provider?

No, there will not be any new calls to the cloud provider.

Will enabling / using this feature result in increasing size or count of the existing API objects?

Enabling this feature would negligibly increase the size of pods, since they will have new fields metadata.generation , status.observedGeneration , and conditions[i].observedGeneration populated.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No, this feature will not result in any noticeable performance change.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No, this feature will not increase resource usage.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No, this feature will not result in resource exhaustion.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

The feature depends on the API server. If the API server is unavailable, the new fields will not be updated.

What are other known failure modes?

Other failure modes are described under Risks and Mitigations.

Detection and mitigation of the infinite status-update loop by a badly-behaving admission webhook is covered in these docs: https://kubernetes.io/docs/concepts/cluster-administration/admission-webhooks-good-practices/#why-good-webhook-design-matters .

What steps should be taken if SLOs are not being met to determine the problem?

One could disable the feature gate and restart the API server. Additionally, one could investigate the apiserver and/or kubelet logs errors.

Detection and mitigation of the infinite status-update loop by a badly-behaving admission webhook is covered in these docs . Specifically, the section about detecting loops caused by competing controllers can be helpful.

Implementation History

2025-01-21: initial KEP draft created 2025-02-12: PR feedback addressed, KEP moved to “implementable” and merged 2025-06-05: proposed promotion to beta 2025-09-23: proposed promotion to stable 2026-01-20: mark as implemented after GA release

Drawbacks

We are not currently aware of any drawbacks.

Alternatives

We could fully reflect the version of the spec that the Kubelet is operating on, such as by adding a “desired resources” field to the status or “observed spec” field to the status where we copy the whole podspec in.

ObservedGeneration is preferable over these alternatives because it expresses the same amount of inforrmation while being significantly more concise, and because it is consistent with what other resources such as StatefulSet are already doing.