KEP-5067: Pod Generation
KEP-5067: Pod Generation
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
This proposal aims to allow the pod status to express which pod updates are
currently being reflected in the pod status. The idea is to leverage the
existing metadata.Generation field and add a new status.observedGeneration field to
the pod status.
Motivation
One of the motivations for this KEP comes from the existing ResizeStatus field. In its original implementation, the ResizeStatus field was written to by both the API server and the Kubelet, creating a race condition on updates. Removing the Proposed state makes the Kubelet the only writer to that field, but leaves a gap in knowing whether the latest resize has been acknowledged by the Kubelet. The changes proposed in this KEP resolve this gap.
The ResizeStatus is used as an example here, but in practice this issue can be generally found in any type of pod update.
Goals
- Provide a general solution for the pod status to express which pod update is currently being reflected.
Non-Goals
- Expand the set of mutable fields.
Proposal
Current behavior
The pod metadata.generation field does exist today, and is
documented as “a sequence number representing a specific generation of the
desired state. Set by the system and monotonically increasing, per-resource.”
Its current behavior in pods is:
- Pod
metadata.generationis not populated by default by the system. - The client can custom-set the
metadata.generationon pod create. metadata.generationcannot be updated by the client.metadata.generationdoes not get incremented by the system when the podspec is updated.metadata.generationdoes get incremented by the system whenDeletionTimestampis set.
API Changes
Generation
The metadata.generation field is currently unused on pods, but we can start
leveraging it to help track which pod state is currently being reflected in the
pod status. For consistency, pod
metadata.generation will be incremented whenever the pod has changes that the kubelet
needs to actuate.
ObservedGeneration
A new optional field status.observedGeneration field will be added to the pod status.
Kubelet will set this to communicate which pod state is being expressed in the
current pod status. This is analogous to the status.observedGeneration that exists
in other resources’ statuses such as StatefulSets
.
The status.observedGeneration may not necessarily be a reflection of every single
field in the pod status. Instead, it reports the latest generation that the kubelet
has seen. This means that status.observedGeneration captures the kubelet’s
decision to admit a change, and acknowledge that it has seen the pod’s new spec.
There are cases when status.observedGeneration may be behind - other status values
already reflect a next generation, but the next update from kubelet SHOULD bring
the status.observedGeneration to the current value.
It also will not necessarily be able to reflect that the kubelet has completed actuation
of certain fields. There is a field-by-field analysis written up in this doc
. We will
have to carefully document the nuanced meaning of status.observedGeneration to avoid confusion.
Likewise, a new optional observedGeneration field will be added to the pod’s
status.condition struct. This is to keep parity with the
metav1. Condition struct
.
The net result will be a new status.observedGeneration field for the kubelet
to express which generation the top-level status relates to, and a new status.conditions[i].observedGeneration field for the writer of that condition to express which generation that condition
relates to.
Notes/Constraints/Caveats (Optional)
Because there is only one singular status.observedGeneration in the pod status, only
one writer can set it. This is consistent with other object types that have
status.observedGeneration , and the expectation is that the primary controller for
the object sets the field. For a pod, that primary controller is the kubelet, so
once a pod is bound to a node, we expect that the kubelet on that node is the
sole writer of status.observedGeneration.
Risks and Mitigations
Custom-set metadata.generation
Today, it is possible for a client to custom-set a pod’s metadata.generation on creation, but once
set, the metadata.generation cannot be updated. That means that there is a
possibility for an external client to be setting metadata.generation on pod
create and depend on that fixed value somehow in its own reconciliation logic.
That said, the metadata.generation field is described as “set by the system and monotonically increasing”
,
and thus it should be clear enough that metadata.generation was not intended to be
used in this way.
Infinite loop caused by misbehaving mutating webhooks
It is possible that today there exists mutating webhooks that overwrite a pod’s
status. These older webhooks would not know about status.observedGeneration and
could be clearing it. This would cause an infinite loop of the kubelet
attempting to update a pod’s status.observedGeneration and the webhook
clearing it.
Status-mutating webhooks could break more pod features than just what is proposed in this KEP, so we will not attempt to solve this here. This risk can be mitigated by improving the documentation on webhooks and how they can be written to avoid these kinds of scenarios.
Symptoms of this scenario that users can look out for include:
- Unexpected sudden spikes in pod status update API calls that occur right after either upgrading the kubelet or creating a new status-mutating webhook.
status.observedGenerationremains unchanged after a pod sync loop even whenmetadata.generationis changing. This occurs because the API server would be preserving any existing value ofstatus.observedGenerationwhenever the webhook attempts to clear it.
Design Details
API server and generation
For a newly created pod, the API server will set metadata.generation to 1. For any updates
to the PodSpec
, the API server will increment metadata.generation by 1.
As described in the field-by-field analysis doc above, the PodSpec mutable fields today are:
- Resources
- Ephemeral Containers
- Container image
- ActiveDeadlineSeconds
- TerminationGracePeriodSeconds
- Tolerations
If any new mutable fields are added to the PodSpec in the future, they will also
cause the API server to increment metadata.generation.
The pod metadata.generation will also continue to be incremented on graceful delete or
deferred delete, just as the API server currently does for pods and other
objects today.
Pod updates that would not result in metadata.generation being incremented include:
- Changes to metadata (with the exception of
DeletionTimestamp). This means that if a Pod uses the downward API to make pod metadata available to containers, Pod behavior can change without the generation being incremented. We will consider this working as intended. - Changes to status.
The logic to set new pods’ metadata.generation to 1 and to increment metadata.generation
on update will run after all mutating webhooks have finished.
Client requests to update generation
Any attempts by clients to set or modify the metadata.generation field themselves will be
ignored and overridden by the API Server. This is consistent with existing
behavior of the metadata.generation field in all other objects.
Kubelet and observedGeneration
When the Kubelet updates the pod status as part of the pod sync loop
,
it will set the status.observedGeneration in the pod to reflect the pod metadata.generation corresponding
to the snapshot of the pod currently being synced. That means if the
pod spec gets updated concurrently while the kubelet is performing a pod sync loop
on a previous update, the status.observedGeneration will be behind metadata.generation.
Outside of the pod sync loop, another place where the kubelet sets the pod status is
when a pod is rejected (during HandlePodAdditions)
. This
code will also be modified to populate status.observedGeneration to express which
metadata.generation was rejected. In this case, the kubelet will not be updating the pod
through the sync loop (due to the rejection).
The only other place where the pod status is updated is the readiness and probe
updates, but we will leave status.observedGeneration unchanged here as the probe that
updated the status would be the one synced in the last pod sync loop.
Mutable Fields Analysis
The field-by-field analysis doc
referenced above
goes into further detail about what status.observedGeneration means in relation
to the other fieds in the pod status. Here is a summary of the conclusions:
- For some fields, the allocated spec is reflected directly in the pod status, so
their associated generation is reflected directly by
status.observedGeneration. This is the case for allocated resources, resize status, ephemeral containers. - For other fields, the status is an indirect result of actuating the PodSpec,
and the associated generation for those fields are from the generation before
what is reflected by
status.observedGeneration. This is the case for actual resources, container image, activeDeadlineSeconds, and terminationGracePeriodSeconds.
To keep things simple and avoid having to add a new field to track the latter, the
kubelet will output the PodSpec metadata.generation that was observed at the time
of the current sync loop, even though the kubelet has not actuated the change yet.
We will document this very clearly and explicitly to avoid confusion.
Other writers of pod status
There are other writers of pod status besides the Kubelet including the scheduler
and the node lifecycle controller. These should also populate status.observedGeneration whenever
they make a status update (excluding any updates to pod conditions, which
have their own dedicated observedGeneration field). Once the pod is bound to a node,
however, the expectation is that only the kubelet will be writing to status.observedGeneration.
The scheduler and node lifecycle controller also write pod conditions. Whenever
they set a pod condition, they should just populate the condition.observedGeneration field
with the relevant generation.
Client requests to update observedGeneration
During status update, if the incoming update clears status.observedGeneration back
to 0, the API server will preserve the previously existing value. All other updates to status.observedGeneration will be permitted by the API validation, including
regressions back to decreasing values.
Mirror pods
For this KEP, we will not treat mirror pods in any special way. Due to the way they are currently implemented in the kubelet and apiserver, this means:
- If a mirror pod’s spec is modified manually by a client via the apiserver, its
metadata.generationwill be bumped accordingly. - If a static pod’s manifest is updated, the kubelet treats this as a pod deletion followed by a pod creation,
which will reset the
metadata.generationof the corresponding mirror pod to 1. - The kubelet does not currently propagate the mirror pod’s
metadata.generationto the place where the pod status is updated today, so theobservedGenerationfields of mirror pods will remain unpopulated.
Future enhancements
We may at some future point reconsider mirror pods and potentially populate
metadata.generation and status.observedGeneration on them.
Test Plan
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
Unit tests
Unit tests will be implemented to cover code changes that implement the feature, in the API server code and the kubelet code.
Core packages touched:
pkg/registry/core/pod/strategy.go:2025-06-16-71.1pkg/registry/core/pod/util.go:2025-06-16-74pkg/apis/core/validation/validation.go:2025-06-16-84.6pkg/kubelet:2025-06-16-71pkg/kubelet/status:2025-06-16-86.8
Integration tests
Integration tests are added to cover node life cycle controller. See this commit:
e2e tests
E2E tests will be implemented to cover the following cases:
- Verify that newly created pods have a
metadata.generationset to 1. - Verify that PodSpec updates (such as tolerations or container images), resize requests, adding ephemeral containers, and binding requests cause the
metadata.generationto be incremented by 1 for each update. - Verify that deletion of a pod causes the
metadata.generationto be incremented by 1. - Issue ~500 pod updates (1 every 100ms) and verify that
metadata.generationandstatus.observedGenerationconverge to the final expected value. - Verify that various conditions each have
observedGenerationpopulated. - Verify that mirror pods have
metadata.generationandobservedGenerationfields set to 1, and that they never change.
Added tests:
pod generation should start at 1 and increment per update: SIG Node, https://storage.googleapis.com/k8s-triage/index.html?test=Pod%20Generationcustom-set generation on new pods and graceful delete: SIG Node, https://storage.googleapis.com/k8s-triage/index.html?test=Pod%20Generationissue 500 podspec updates and verify generation and observedGeneration eventually converge: SIG Node, https://storage.googleapis.com/k8s-triage/index.html?test=Pod%20Generationpod rejected by kubelet should have updated generation and observedGeneration: SIG Node, https://storage.googleapis.com/k8s-triage/index.html?test=Pod%20Generationpod observedGeneration field set in pod conditions: SIG Node, https://storage.googleapis.com/k8s-triage/index.html?test=Pod%20Generationpod-resize-scheduler-tests: SIG Node, https://storage.googleapis.com/k8s-triage/index.html?test=pod-resize-scheduler-tests -mirror pod updates: SIG Node, https://storage.googleapis.com/k8s-triage/index.html?test=mirror%20pod%20updates
Graduation Criteria
Alpha
- Initial e2e tests completed and enabled
metadata.generationfunctionality implementedstatus.observedGenerationfunctionality implemented behind feature flagstatus.conditions[i].observedGenerationfield added to the APIstatus.conditions[i].observedGenerationfunctionality implemented behind feature flag
Beta
metadata.generation,status.observedGeneration,status.conditions[i].observedGenerationfunctionality have been implemented and running as alpha for at least one release
GA
- No major bugs reported for three months.
- No negative user feedback.
- Promote the primary e2e tests to Conformance.
Upgrade / Downgrade Strategy
API server should be upgraded before Kubelets. Kubelets should be downgraded before the API server.
Version Skew Strategy
Previous versions of clients unaware of metadata.generation functionality would either
not set the pod metadata.generation field (having the effective value of 0) or set it to
some custom value, though the latter is unlikely. In either case, the API server
will ignore whatever value of metadata.generation is set by the client and will manage
metadata.generation itself (setting it to 1 for newly created pods, or incrementing it
for pod updates).
Already running pods will likewise either not have the pod metadata.generation set (and
thus have a default value of 0), or will have a custom value if metadata.generation was
explicitly set by a client. On the first update after the API server is upgraded,
the API server will increment the value of metadata.generation by 1 from whatever it
was set to previously. This means that the first update to a pod that did not
yet have a metadata.generation will now have a metadata.generation of 1.
If a pod that has a metadata.generation set or incremented via the new API server is
later updated by an older API server, the older API server will not modify the
metadata.generation field and it will stay fixed at its current value. That means that
if there is version skew between multiple apiservers, the metadata.generation may or may
not be incremented. To address this, we will not feature-gate the logic in the
apiserver that increments metadata.generation. That means that by the time
ObservedGeneration goes to beta, there will be 2 versions of apiservers updating
the metadata.generation field, removing the issue.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name: PodObservedGenerationTracking
- Components depending on the feature gate: kubelet, kube-controller-manager, kube-scheduler
- Other
- Describe the mechanism:
- Writers to
status.observedGenerationwill propagate the pod’smetadata.generationtostatus.observedGenerationif the feature gate is enabled OR ifstatus.observedGenerationis already set. We will not attempt to clearstatus.observedGenerationif set in order to avoid an infinite loop between attempting to clear the field and the API server preserving the existing value when an incoming update attempts to clear it.
- Writers to
- Will enabling / disabling the feature require downtime of the control
plane?
- No.
- Will enabling / disabling the feature require downtime or reprovisioning
of a node?
- No.
- Describe the mechanism:
Does enabling the feature change any default behavior?
The pod’s metadata.generation field is currently unset by default and
both new observedGeneration fields will be new fields in the pod status, so the feature
will not introduce any breaking changes of default behavior.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
The status.observedGeneration feature can be disabled by setting the flag to ‘false’
and restarting the kubelet. Disabling the feature in the kubelet means that the kubelet will not propagate
metadata.generation to status.observedGeneration for new pods. For existing
pods, if status.observedGeneration is already set, the kubelet will continue
to propagate metadata.generation to status.observedGeneration. The kubelet will not attempt to clear
status.observedGeneration if set in order to avoid an infinite loop between
the kubelet attempting to clear the field and the API server preserving the
existing value when an incoming update attempts to clear it.
Likewise, the conditions[i].observedGeneration feature can be disabled by setting
the flag to ‘false’ in the kubelet, node lifecycle controller, and scheduler. When
the feature flag is disabled, the condition’s observedGeneration will no longer be populated.
The metadata.generation functionality will intentionally not be behind a feature gate so cannot be
disabled except by downgrading the API server.
What happens if we reenable the feature if it was previously rolled back?
The API server will start incrementing metadata.generation , the kubelet will start
setting status.observedGeneration , and writers of pod conditions will start
incrementing those conditions’ observedGeneration s.
Are there any tests for feature enablement/disablement?
Unit tests will be added to cover the code that implements the feature, and will cover the cases of the feature gate being both enabled and disabled.
The following unit test covers what happens if I disable a feature gate after having objects written with the new field (in this case, the field should persist).
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
A rollout or rollback won’t have significant impact on any components, even if they restart mid-rollout. Already running workloads likewise won’t be significantly impacted.
What specific metrics should inform a rollback?
If users see the metadata.generation and status.observedGeneration fields
are not being updated or are significantly misaligned, that indicates that
the feature is not working as expected.
Some metrics to look at that could indicate a problem include:
kubelet_pod_start_total_duration_secondskubelet_pod_status_sync_duration_secondskubelet_pod_worker_duration_seconds
You could also check the Pod Startup Latency SLI .
Any of these being significantly elevated could indicate an issue with the feature.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
Testing steps:
- Create test pod with old version of API server and node; expected outcome:
generationandobservedGenerationfields are not populated - Upgrade API server
- Send an update request to the running pod; expected outcome:
generationis set to 1 andobservedGenerationfields are not populated - Create a new pod; expected outcome:
generationis set to 1 andobservedGenerationfields are not populated - Create upgraded node
- Create second test pod on the upgraded node; expected outcome:
generationandobservedGenerationfields are set to 1 - Restart the upgraded node with the feature disabled
- Send an update request to the second pod; expected outcome:
generationandobservedGenerationcontinue to be updated so are set to 2 - Restart the upgraded node with the feature enabled
- Send an update request to the second pod; expected outcome:
generationandobservedGenerationare set to 3
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
They can check if metadata.generation is set on the pod and that observedGeneration
is being updated.
How can someone using this feature know that it is working for their instance?
- API .status
- Other field:
metadata.generation,status.observedGeneration,status.conditions[].observedGeneration
- Other field:
Each pod should have its metadata.generation set, starting at 1 and incremented by 1 for each update.
Each pod’s status.observedGeneration should be populated to reflect the metadata.generation that was last
observed by the kubelet.
Each pod’s status.conditions[].observedGeneration should be populated to reflect the metadata.generation
that was last observed by the component owning the corresponding condition.
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
We can reuse the Pod Startup Latency SLI/SLO here.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
We can reuse the Pod Startup Latency SLI/SLO here.
Are there any missing metrics that would be useful to have to improve observability of this feature?
N/A
Dependencies
Does this feature depend on any specific services running in the cluster?
No.
Scalability
Will enabling / using this feature result in any new API calls?
Yes, enabling this feature could result in additional API calls. If the pod
sync loop results in a new status where status.observedGeneration is the only
status field changed, there will be a new status update call. This can occur
when a pod generation is updated in the middle of a pod sync loop, but the
next sync loop does not have any other status changes.
Will enabling / using this feature result in introducing new API types?
No, this feature does not introduce any new API types.
Will enabling / using this feature result in any new calls to the cloud provider?
No, there will not be any new calls to the cloud provider.
Will enabling / using this feature result in increasing size or count of the existing API objects?
Enabling this feature would negligibly increase the size of pods, since
they will have new fields metadata.generation , status.observedGeneration ,
and conditions[i].observedGeneration populated.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No, this feature will not result in any noticeable performance change.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
No, this feature will not increase resource usage.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No, this feature will not result in resource exhaustion.
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
The feature depends on the API server. If the API server is unavailable, the new fields will not be updated.
What are other known failure modes?
Other failure modes are described under Risks and Mitigations.
Detection and mitigation of the infinite status-update loop by a badly-behaving admission webhook is covered in these docs: https://kubernetes.io/docs/concepts/cluster-administration/admission-webhooks-good-practices/#why-good-webhook-design-matters .
What steps should be taken if SLOs are not being met to determine the problem?
One could disable the feature gate and restart the API server. Additionally, one could investigate the apiserver and/or kubelet logs errors.
Detection and mitigation of the infinite status-update loop by a badly-behaving admission webhook is covered in these docs . Specifically, the section about detecting loops caused by competing controllers can be helpful.
Implementation History
2025-01-21: initial KEP draft created 2025-02-12: PR feedback addressed, KEP moved to “implementable” and merged 2025-06-05: proposed promotion to beta 2025-09-23: proposed promotion to stable 2026-01-20: mark as implemented after GA release
Drawbacks
We are not currently aware of any drawbacks.
Alternatives
We could fully reflect the version of the spec that the Kubelet is operating on, such as by adding a “desired resources” field to the status or “observed spec” field to the status where we copy the whole podspec in.
ObservedGeneration is preferable over these alternatives because it expresses the same amount of inforrmation while being significantly more concise, and because it is consistent with what other resources such as StatefulSet are already doing.