KEP-3541: Add Recreate Update Strategy to StatefulSet
KEP-3541: Add Recreate Update Strategy to StatefulSet
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
StatefulSets currently offer two update strategies: OnDelete (manual) and RollingUpdate (automatic, default). When using RollingUpdate with the default podManagementPolicy: OrderedReady, StatefulSets follow sequential ordering where each individual pod must be Running and Ready before the controller proceeds to update the next pod. Even with the maxUnavailable option (which allows multiple pods to be updated simultaneously), the controller still requires each pod to reach Ready state before moving forward but stuck pods halt the entire update process. While podManagementPolicy: Parallel allows pods to be updated simultaneously without waiting for Ready state, stuck pods remain and are not automatically replaced. This design ensures data safety for stateful workloads but creates a critical operational problem.
When a StatefulSet update results in pods that fail to reach Ready state (due to configuration errors, resource constraints, etc..), the rolling update process becomes permanently stuck. Even after applying a corrected configuration, the controller will not automatically replace the broken pods, requiring manual intervention to delete stuck pods.
This behavior has generated significant user frustration across multiple GitHub issues (#67250 , #60164 , #109597 ) with users reporting:
- Broken CI/CD pipelines requiring manual intervention
- Inability to automatically recover from configuration mistakes
- Operational burden in managing stateful applications
This KEP proposes adding a new Recreate update strategy to StatefulSets, mirroring the behavior of Deployments’
Recreate strategy. This strategy deletes all pods, waits for full termination, then creates new pods according
to podManagementPolicy. This provides a simple, predictable way to handle stuck pods and enables automated recovery for workloads that can tolerate downtime (CI/CD environments, stateless applications using StatefulSet
for pod identity, applications with external data storage, and use cases like LeaderWorkerSet). The Recreate
strategy offers a clean parallel with existing Kubernetes patterns, simplifies controller logic, and provides users with explicit control over update behavior.
Motivation
Current Behavior and Problems
StatefulSets with RollingUpdate strategy follow this algorithm:
if
podManagementPolicy: OrderedReady(default)- Update pods in reverse ordinal order (N-1, N-2, …, 0)
- For each pod, wait until it becomes Running and Ready before proceeding to the next
- If any pod fails to become Ready, the entire update process halts
- Even when a corrected configuration is applied, stuck pods are never automatically replaced
if
podManagementPolicy: Parallel- Update all pods simultaneously (or up to
maxUnavailableat a time if specified) - Pods are created/deleted without waiting for Ready state
- Stuck pods do not block other pods from being updated
- Even when a corrected configuration is applied, stuck pods are never automatically replaced
- Update all pods simultaneously (or up to
The current approach was designed for stateful workloads where data persistence is critical, pod identity and storage are tightly coupled, or automatic pod deletion could cause data loss.
This behavior has significant impact across multiple scenarios:
CI/CD Pipeline Failures: Teams report broken deployments that require manual intervention, breaking automation:
# Example: A typo in image name breaks the entire update
apiVersion: apps/v1
kind: StatefulSet
spec:
template:
spec:
containers:
- name: app
image: myapp:v2.0.0-typo # ImagePullBackOff
# Update gets stuck, requires manual pod deletion
Operational Overhead: Platform teams must build custom controllers or fix it manually to handle stuck updates.
Why Existing Solutions Are Insufficient
MaxUnavailable doesn’t address the core issue. The
maxUnavailableoption inRollingUpdatestrategy allows multiple pods to be updated simultaneously, but its behavior depends onpodManagementPolicy.spec: podManagementPolicy: Parallel updateStrategy: type: RollingUpdate rollingUpdate: maxUnavailable: 2 # Can update 2 pods at onceWith
podManagementPolicy: Parallel+maxUnavailable: 2, multiple pods can be updated simultaneously, but if any pod fails to reach Ready state, it remains stuck and requires manual cleanup. Stuck pods don’t block other pods from updating, but they are never automatically replaced (see section 2 below).With
podManagementPolicy: OrderedReady, updates happen one pod at a time in reverse ordinal order. If any pod fails to reach Ready state, the entire rolling update process halts completely, even withmaxUnavailableconfigured. The controller waits indefinitely for stuck pods to become Ready.Example Scenario with
podManagementPolicy: OrderedReady: - StatefulSet with 5 replicas - Update podapp-4first -app-4gets stuck inImagePullBackOff- Even after fixing the image name,app-4remains stuck - Update process cannot proceed toapp-3,app-2,app-1, orapp-0- Manual intervention still required:kubectl delete pod app-4Custom Controllers Some teams have built custom controllers to delete stuck pods, but this:
- Duplicates StatefulSet controller logic
- Creates maintenance burden
- May conflict with StatefulSet controller behavior
- Lacks integration with StatefulSet status and events
Proposed Solution Benefits
Adding Recreate update strategy to StatefulSets addresses these issues by:
- Stuck pods are cleared and replaced during updates without manual intervention
- Clean algorithm with no complexity around timeout tracking or transient failure detection
- Consistency with Kubernetes Patterns (Deployment) Recreate strategy.
- Handles All Stuck Scenarios, regardless of whether pods are stuck in ImagePullBackOff, Pending, CrashLoopBackOff, or any other state
Goals
- Add a new
Recreateupdate strategy type to StatefulSet, providing a third option alongsideOnDeleteandRollingUpdate - Align StatefulSet update strategies with Deployment patterns for API consistency
- Enable automated recovery from stuck pod states without manual intervention
- Provide a simple, predictable update behavior for workloads that can tolerate downtime
- Support use cases like CI/CD environments, stateless applications, external storage applications, and LeaderWorkerSet patterns
- Add
Progressingstate condition toStatefulSetstatus for all strategies
Non-Goals
- Change default behavior of StatefulSet updates (opt-in via explicit
type: Recreateconfiguration) - Add timeout-based progressive failure detection (use Recreate for simplicity)
- Change Recreate deletion semantics (all pods are always deleted simultaneously but recreate ordering follows
podManagementPolicy) - Replace Deployment-style revision management (StatefulSets continue to directly manage Pods)
Proposal
User Stories
Story 1: CI/CD Platform Team
Context: A platform team manages hundreds of StatefulSet deployments across development and staging environments. Their CI/CD system requires end-to-end automation, but StatefulSet rolling updates break automation when pods get stuck. The team either has to implement custom “garbage collection” logic or accept that automated deployments will fail and require manual intervention. Since these are non-production environments, downtime during updates is acceptable.
Solution: With updateStrategy: type: Recreate configured, when an update with incorrect configuration is applied, all pods are deleted and new pods are created. If they fail, the deployment fails quickly and clearly. When a corrected configuration is applied, the Recreate strategy deletes all broken pods and creates fresh ones, allowing the CI/CD pipeline to complete without manual intervention. The downtime is acceptable in CI/CD environments where fast, automated recovery is more important than uptime.
Story 2: Stateless Web Application
Context: A web application uses StatefulSet for predictable pod naming but doesn’t store critical data locally. When resource limit typos cause pods to get stuck in Pending state, the entire update halts even though pod replacement is safe. The application can tolerate brief downtime during updates.
Solution: With updateStrategy: type: Recreate configured, when an update encounters issues, all pods are deleted and recreated cleanly. This eliminates the need for manual pod deletion since stuck pods are automatically cleared. The brief downtime is acceptable for this stateless application that primarily uses StatefulSet for pod identity rather than stateful semantics.
Story 3: Development/Experiment Environment
Context: Developers using StatefulSet for experiments face constant frustration - every time a rolling update breaks due to configuration errors, they must manually delete stuck pods after applying fixes. This manual intervention disrupts the development workflow. Uptime is not a concern in development environments.
Solution: With updateStrategy: type: Recreate configured, developers get fast, clean resets - when an update fails, applying a fix automatically deletes all broken pods and creates fresh ones. This enables a smoother development experience without requiring cluster operator intervention or manual pod cleanup. The Recreate strategy’s simplicity makes it ideal for rapid iteration in development.
Story 4: External Data Storage
Context: A database application stores all persistent data on network-attached storage (not local pod storage). Pod replacement is completely safe since no local data would be lost, but the StatefulSet controller treats it as a traditional stateful workload and requires manual intervention. The application can tolerate brief downtime for clean updates.
Solution: With updateStrategy: type: Recreate configured, the controller automatically deletes and recreates all pods during updates, which is safe for this architecture since all data persists externally. The Recreate strategy provides clean, predictable updates without concerns about stuck pods, and the brief downtime is acceptable given the data safety guarantees from external storage.
Story 5: LeaderWorkerSet (LWS) Use Case
Context: Developers use StatefulSet as the high-level controller workload for LWS . However, it behaves more like a Deployment - there’s no ordering dependency between different replicas. They only need the ordinal index for pod identification. When a replica fails during updates, the entire StatefulSet update gets stuck, even though there’s no actual ordering requirement between replicas. The LeaderWorkerSet pattern can tolerate brief downtime for updates.
Solution: With updateStrategy: type: Recreate configured, all replicas are cleanly deleted and recreated during updates, eliminating stuck pod scenarios entirely. This aligns perfectly with the deployment-like nature of LWS workloads, providing simple and predictable updates for applications that use StatefulSet primarily for pod identity rather than traditional stateful semantics. The Recreate strategy’s “all or nothing” approach matches the LWS pattern where all workers restart together.
Notes/Constraints/Caveats
- Strategy Type Change Does Not Trigger Rollout: changing only
.spec.updateStrategy.typefromRollingUpdatetoRecreate(or vice versa) does not trigger a new rollout. This is consistent with Deployment behavior. The StatefulSet controller uses thecontroller-revision-hashlabel to identify pod revisions, which is computed from.spec.templatecontent only.
The Recreate behavior will only be triggered when users either:
- Make a change to
.spec.template - Force a rollout using
kubectl rollout restart
Risks and Mitigations
Risk: Unintended Data Loss
Risk Description: If Recreate strategy is used on StatefulSets with local persistent data and PersistentVolumeClaims, the downtime could affect applications expecting sequential updates. However, data on PVCs is preserved since Recreate only deletes pods, not volumes.
Mitigation Strategies:
- Documentation: Clear guidance on when to use
Recreatestrategy - suitable for workloads that can tolerate downtime - No Default Change: Opt-in behavior - existing workloads continue using safe
RollingUpdate(current behavior unchanged) - Explicit Strategy Selection: Users must explicitly set
type: Recreate, preventing accidental usage - Clear Events: Events emitted during the recreate process to show deletion and recreation phases
- Status Conditions: StatefulSet status clearly reflects the recreate process state
- PVC Preservation: PersistentVolumeClaims are not deleted, so data on volumes persists across recreate operations
Design Details
Detailed Algorithm Specification
Current RollingUpdate Algorithm
FOR i = replicas-1 To i >= 0 DO i--
If pod[i] needs update Then
wait_for_predecessors_ready(i+1 to replicas-1)
If !pod[i].Running Or !pod[i].Ready Then
return // STUCK - wait for manual intervention
ENDIF
update_pod(i)
wait_until_ready(pod[i])
ENDIF
ENDFOR
The algorithm halts when pod[i] is not Running or Ready, even if a fix is applied.
Proposed Recreate Strategy Algorithm
// Recreate Strategy Algorithm
// Uses controller-revision-hash label to identify pod revision (same as RollingUpdate)
// updateRevision = hash of current spec.template (computed by controller)
current_phase = determine_phase()
IF current_phase == "NeedsDeletion" THEN
// Phase 1: Delete all pods with old revision
emit_event("RecreateStarted", "Deleting all pods for Recreate update")
set_condition("Progressing", status="True", reason="RecreateInProgress")
// Delete ALL pods owned by this StatefulSet that have old revision
// This handles orphaned pods with ordinals >= replicas
FOR each pod in pods:
IF pod.Labels["controller-revision-hash"] != updateRevision THEN
IF pod.DeletionTimestamp == nil THEN
delete_pod(pod)
ENDIF
ENDIF
ENDFOR
return // Reconcile again after deletions are issued
ENDIF
IF current_phase == "WaitingTermination" THEN
// Phase 2: Wait for all old-revision pods to be fully removed from etcd
// Controller watches pods and will reconcile when deletions complete
// Note: Only emit event on first entry to this phase (tracked via condition)
return
ENDIF
IF current_phase == "ReadyForCreation" THEN
// Phase 3: Create pods with new revision according to podManagementPolicy
IF podManagementPolicy == OrderedReady THEN
// Create in ascending ordinal order; only create the next ordinal when predecessor is Running and Ready
i = lowest ordinal in [0, replicas-1] such that pod i does not exist
IF i is defined THEN
IF i == 0 OR (pod i-1 exists AND is Running and Ready) THEN
create_pod(i, updateRevision)
ENDIF
ENDIF
ELSE
// Parallel: create all missing pods at once
FOR i = 0 TO replicas-1:
IF pod with ordinal i does not exist THEN
create_pod(i, updateRevision)
ENDIF
ENDFOR
ENDIF
return // Reconcile again to check creation progress
ENDIF
IF current_phase == "Complete" THEN
// All replicas exist with current revision
set_condition("Progressing", status="True", reason="RecreateComplete")
return
ENDIF
// Helper: Determine current phase based on pod states
FUNCTION determine_phase():
pods = get_all_pods_for_statefulset() // All pods owned by this StatefulSet
old_revision_pods_active = 0 // Old revision, not yet deleted
old_revision_pods_terminating = 0 // Old revision, has DeletionTimestamp
new_revision_pods = 0 // Current revision (not terminating)
FOR each pod in pods:
IF pod.Labels["controller-revision-hash"] != updateRevision THEN
// Pod has old revision
IF pod.DeletionTimestamp == nil THEN
old_revision_pods_active++
ELSE
old_revision_pods_terminating++
ENDIF
ELSE
// Pod has current revision
IF pod.DeletionTimestamp == nil THEN
new_revision_pods++
ENDIF
// Note: new revision pods with DeletionTimestamp are ignored
// (could happen if user manually deleted, will be recreated)
ENDIF
ENDFOR
// Phase 1: Any old-revision pods that haven't been deleted yet
IF old_revision_pods_active > 0 THEN
return "NeedsDeletion"
ENDIF
// Phase 2: Old pods are terminating, wait for full removal
IF old_revision_pods_terminating > 0 THEN
return "WaitingTermination"
ENDIF
// Phase 3: No old pods remain, but we don't have enough new pods yet
IF new_revision_pods < replicas THEN
return "ReadyForCreation"
ENDIF
// Phase 4: All replicas exist with current revision
return "Complete"
END FUNCTION
Key Characteristics:
- Uses
controller-revision-hashlabel (same as RollingUpdate) to identify old vs new pods - All old-revision pods are fully terminated before any new pods are created
- Guarantees old and new pods never run simultaneously
- Deletes all old-revision pods including orphans with ordinals >= replicas
- Since all pods are forcibly deleted, updates cannot become permanently blocked
- Explicit downtime: Users opt-in knowing there will be unavailability between deletion and creation phases
- Safe to retry deletions and creations on controller restart
- Recreation phase respects
podManagementPolicy
API Changes
Spec Changes
// StatefulSetUpdateStrategyType is a string enumeration type that represents the update strategy type for StatefulSets
type StatefulSetUpdateStrategyType string
const (
// RollingUpdateStatefulSetStrategyType indicates that pods in a StatefulSet will be updated in reverse ordinal order
RollingUpdateStatefulSetStrategyType StatefulSetUpdateStrategyType = "RollingUpdate"
// OnDeleteStatefulSetStrategyType indicates that pods in a StatefulSet will only be updated when manually deleted
OnDeleteStatefulSetStrategyType StatefulSetUpdateStrategyType = "OnDelete"
// RecreateStatefulSetStrategyType indicates that all pods will be fully terminated before new ones are created
RecreateStatefulSetStrategyType StatefulSetUpdateStrategyType = "Recreate"
)
Example Usage:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
replicas: 10
updateStrategy:
type: Recreate
template:
spec:
containers:
- name: nginx
image: nginx:1.14.2
Behavior:
- When update is triggered (e.g., template change):
- All pods (web-0 through web-9) are deleted simultaneously
- Controller waits for all pods to fully terminate
- All new pods (web-0 through web-9) are created according to their
.spec.podManagementPolicy
- Downtime occurs between deletion and recreation phases
- No stuck pod scenarios - all pods are forcibly deleted
Status Changes
// StatefulSetConditionType describes the condition types
type StatefulSetConditionType string
const (
// Progress for a StatefulSet is considered when a new pod is created, deleted, or becomes ready.
StatefulSetProgressing StatefulSetConditionType = "Progressing"
StatefulSetAvailable StatefulSetConditionType = "Available"
)
Implementation Changes
The implementation requires changes to the StatefulSet controller in pkg/controller/statefulset/stateful_set_control.go:
Strategy Type Handling:
- Add new case for
RecreateStatefulSetStrategyTypein update strategy switch statement - Implement separate update path for Recreate strategy alongside existing RollingUpdate and OnDelete paths
- Add new case for
Recreate Update Logic:
- Phase 1 - Deletion: Iterate through all pods and delete them (similar to scale-down operation)
- Phase 2 - Wait for Termination: Check all pods for
deletionTimestamp; reconcile periodically until all pods are fully terminated - Phase 3 - Recreation: Create all new pods according to
spec.podManagementPolicy
Status Condition Management:
- Add
Progressingcondition to StatefulSet status
- Add
Validation:
- API validation in
pkg/apis/apps/validation/validation.go - Validate
type: Recreatecan be set on StatefulSet - No additional fields required for Recreate strategy (unlike RollingUpdate which has partition, maxUnavailable)
- API validation in
Respect Ordering Semantics:
- Recreate strategy according to
podManagementPolicysettings - All pods deleted at once and then re-created according to
podManagementPolicysettings
- Recreate strategy according to
Comparison with Existing Solutions
| Solution | Sequential Ordering | Automatic Recovery | Downtime | Behavior When Pod Stuck | Use Case |
|---|---|---|---|---|---|
RollingUpdate (default) | Yes | No | No | Halts completely, waits forever | Traditional stateful apps |
RollingUpdate + maxUnavailable | Yes (batched) | No | No | Still halts completely | Faster updates, but same stuck problem |
OnDelete | Yes (manual) | No | No | Fully manual control | Maximum safety/control |
Recreate (proposed) | No | Yes | Yes | All pods deleted and recreated | CI/CD, stateless apps, external storage, LWS |
Test Plan
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
Unit tests
pkg/apis/apps/validation/validation.go:2025-10-13-92.8%pkg/controller/statefulset/stateful_set_control.go:2025-10-13-91.5%pkg/controller/statefulset/stateful_pod_control.go:2025-10-13-89.6%pkg/registry/apps/statefulset/strategy.go:2025-10-13-83.9%
Integration tests
We should cover below scenarios:
- Without
type: Recreate: Existing StatefulSets withRollingUpdateandOnDeletecontinue to work unchanged (backward compatibility) - With
type: Recreateconfigured:- All pods are deleted when update is triggered (template spec change)
- Controller waits for all pods to fully terminate (no pods with deletionTimestamp remain)
- All new pods are created after termination complete
- Status condition
Progressing=True - Status condition
Progressing=Truewithreason=RecreateCompleteafter pods created - Recreate strategy respects
podManagementPolicy
- PVC preservation: PersistentVolumeClaims are not deleted during Recreate (only pods are deleted)
- Stuck pod handling: Pods stuck in any state are forcibly deleted (ImagePullBackOff, Pending, CrashLoopBackOff, etc.)
- Validation: API validation accepts
type: Recreateon StatefulSet - For alpha, Add test to verify that we cannot switch strategies from Recreate to RollingUpdate or OnDelete. Later on beta, we will need to add a test to verify that we can switch strategies
e2e tests
The following e2e tests will be added to test/e2e/apps/statefulset.go:
- StatefulSet with
type: Recreatesuccessfully deletes and recreates all pods during update - Recreate works with stuck pods (ImagePullBackOff scenario - pods are deleted and new ones created)
- Recreate waits for full termination before creating new pods (no mixed old/new state)
- Recreate preserves PersistentVolumeClaims (data persists across recreation)
- Recreate respects
podManagementPolicyduring recreation - StatefulSets without
type: Recreatemaintain current RollingUpdate/OnDelete behavior (backward compatibility) - Controller restart during Recreate resumes correctly from last phase
Graduation Criteria
Alpha
- Feature implemented behind a feature flag.
- Unit and integration tests passed as designed in TestPlan .
Beta
- Feature is enabled by default
- Address reviews and bug reports from Alpha users
- Users are able to switch strategies from Recreate to RollingUpdate or OnDelete
- e2e tests:
- Add links to testgrid results
- Verify zero flakes over 2+ weeks
GA
- No negative feedback from developers.
- Consider conformance test if feature becomes widely adopted and part of core contract
- Ensure existing conformance tests for basic RollingUpdate continue to pass
Upgrade / Downgrade Strategy
Upgrade
This feature is protected by the feature-gate StatefulSetRecreateStrategy, which must be enabled on both kube-apiserver and kube-controller-manager.
Component Dependencies:
- kube-apiserver: Validates and persists the
type: Recreatestrategy in the StatefulSet spec - kube-controller-manager: Implements the Recreate strategy logic (delete all, wait for termination, create all)
Upgrade Sequence
- Enable feature gate on kube-apiserver first
- Enable feature gate on kube-controller-manager
- Create/update StatefulSets with
updateStrategy.type: Recreate
Partial Upgrade Behavior
If apiserver has feature enabled but kube-controller-manager does not:
- API server accepts
type: Recreatestrategy - Strategy type is persisted in etcd
- Kube-controller-manager ignores Recreate type and falls back to default RollingUpdate behavior
- No errors, but Recreate behavior is not active
- API server accepts
If apiserver does NOT have feature enabled but kube-controller-manager does:
- API server rejects create/update requests that set
type: Recreatewith a validation error - Users cannot create or switch to Recreate until the apiserver has the feature enabled.
- Kube-controller-manager cannot process Recreate in this skew because no
StatefulSetwithtype: Recreatecan be stored.
- API server rejects create/update requests that set
Enable the feature gate on
kube-apiserverfirst, thenkube-controller-managerto ensure smooth transition.
Downgrade
- The older apiserver does not recognize
type: Recreateand will reject create/update requests that set it. - StatefulSets that already have
type: Recreatestored in etcd remain stored, but any update that touches the spec may be rejected unless the strategy is changed back to RollingUpdate/OnDelete first - The controller in the older version ignores Recreate and behaves as RollingUpdate for those existing objects
Version Skew Strategy
This feature has dependencies between control plane components.
kube-apiserver v1.xx+1 (feature enabled) and kube-controller-manager v1.xx (no feature)
- API accepts
type: Recreate, controller ignores it - StatefulSets fall back to default RollingUpdate behavior
- StatefulSets are functional, just without Recreate strategy feature
- No errors or warnings
- API accepts
kube-apiserver v1.xx (no feature) and kube-controller-manager v1.xx+1 (feature enabled)
- API server rejects create/update requests that set
type: Recreatewith a validation error - Users cannot create or update StatefulSets to use Recreate until apiserver is upgraded and the feature is enabled
- Enable the feature on kube-apiserver first, then on kube-controller-manager
Mixed control plane during rolling upgrade
- During control plane upgrade, apiservers and controller-managers may have different versions, and the feature may be enabled or disabled. The behavior depends on the leader’s version:
- If leader has feature enabled: Recreate strategy is processed correctly
- If leader has feature disabled: Recreate strategy is ignored, falls back to RollingUpdate behavior
- Leader may change during upgrade, causing behavior to switch between Recreate and RollingUpdate
- During control plane upgrade, apiservers and controller-managers may have different versions, and the feature may be enabled or disabled. The behavior depends on the leader’s version:
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name: StatefulSetRecreateStrategy
- Components depending on the feature gate:
- kube-apiserver
- kube-controller-manager
Does enabling the feature change any default behavior?
No. Enabling the StatefulSetRecreateStrategy feature gate does not change any default behavior.
The type: Recreate strategy is opt-in. When not explicitly set:
- StatefulSets behave exactly as they do today (default
RollingUpdatebehavior) - All existing StatefulSet update strategies continue to work unchanged
The feature only activates when users explicitly configure spec.updateStrategy.type: Recreate in their StatefulSet spec.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
Yes, the feature can be disabled.
What happens if we reenable the feature if it was previously rolled back?
The feature works normally again. StatefulSets with type: Recreate in their spec will immediately start using Recreate behavior for the next update.
Are there any tests for feature enablement/disablement?
No, unit and integration tests will be added to cover feature gate enablement/disablement scenarios.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
Rollout Failures:
- If apiserver and controller-manager have different feature gate states,
type: Recreatemay be accepted but ignored (falls back to RollingUpdate) - API validation accepts
type: Recreateas valid strategy type (no complex validation needed)
Rollback Failures:
- If the strategy type was not changed back, StatefulSets with
type: Recreatewill fall back to RollingUpdate behavior and Recreate behavior will be ignored.
Impact on Running Workloads:
- No impact on StatefulSets without
type: Recreate - StatefulSets with
type: Recreatewill experience downtime during updates (i.e. all pods are deleted before new ones are created)
What specific metrics should inform a rollback?
statefulset_unavailable_replicasshows how many Statefulset replicas are unavailableworkqueue_depth{name="statefulset"}shows the current depth of the StatefulSet controller queueworkqueue_queue_duration_seconds{name="statefulset"}shows how long items wait in queue before processingworkqueue_retries_total{name="statefulset"}shows retry counts which may indicate processing failures
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
No, tests will be added to cover upgrade and rollback scenarios.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No. This feature adds a new strategy type Recreate to spec.updateStrategy.type. No deprecations of existing fields or APIs nor removals of existing functionality.
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
- By querying StatefulSets using kubectl:
kubectl get statefulsets -A -o json | \
jq '.items[] | select(.spec.updateStrategy.type == "Recreate") |
{namespace: .metadata.namespace, name: .metadata.name, strategy: .spec.updateStrategy.type}'
- By checking StatefulSet status conditions:
kubectl get statefulsets -A -o json | \
jq '.items[] | select(.status.conditions[]? | select(.type=="Progressing"))'
How can someone using this feature know that it is working for their instance?
- [] Events
- API .status
- Condition name:
Progressing
- Condition name:
- Metrics (existing metrics kube-state-metrics
)
kube_statefulset_replicaskube_statefulset_status_replicas_readykube_statefulset_status_replicas_current
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
- 100% of StatefulSets without
type: Recreatebehave identically to pre-feature behavior - 99% of Recreate updates complete within (pod termination time + pod startup time + 30s)
- 0% of pods are left in mixed old/new spec states after Recreate update
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics (existing metrics kube-state-metrics
)
- Metric(s) name:
kube_statefulset_status_replicas_availablekube_statefulset_status_replicas_readykube_statefulset_status_replicas_current- Components exposing the metric: kube-state-metrics
- Metric name:
statefulset_unavailable_replicas- Components exposing the metric: kube-controller-manager
- These metrics reflect the StatefulSet
.status(availableReplicas, readyReplicas, currentReplicas). They have labelsstatefulsetandnamespace, so operators can filter by StatefulSet to monitor a specific StatefulSet during Recreate - During Recreate updates, the values show the transition from all pods deleted (0 available) to all new pods created and ready
- Metric(s) name:
Are there any missing metrics that would be useful to have to improve observability of this feature?
No. The existing StatefulSet metrics provide sufficient observability for the Recreate strategy.
Dependencies
Does this feature depend on any specific services running in the cluster?
No.
Scalability
Will enabling / using this feature result in any new API calls?
No new types of API calls. If the feature gate is enabled but no StatefulSet uses type: Recreate, then no additional API calls occur.
When Recreate strategy is used during an update, the following existing API call types are made:
- Pod Deletion (DELETE /api/v1/namespaces/{ns}/pods/{name})
- Pod Creation (POST /api/v1/namespaces/{ns}/pods)
- StatefulSet Status Update (PUT /apis/apps/v1/namespaces/{ns}/statefulsets/{name}/status)
- Event Creation (POST /api/v1/namespaces/{ns}/events)
Will enabling / using this feature result in introducing new API types?
No. A new strategy type Recreate is added to the existing StatefulSetUpdateStrategyType enum, but no new API types are introduced.
Will enabling / using this feature result in any new calls to the cloud provider?
No.
Will enabling / using this feature result in increasing size or count of the existing API objects?
Yes, minor increases in size when type: Recreate is used.
Per StatefulSet using Recreate strategy:
- Spec: ~8 bytes (strategy type enum value: “Recreate”)
- Status: ~150-200 bytes when Progressing condition is active
- Total: ~160-210 bytes per StatefulSet
For a cluster with 1000 StatefulSets using Recreate strategy:
- Total increase: ~160-210 KB
- Impact: Negligible compared to typical etcd usage (multi-GB scale)
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No.
API Server Operations:
- GET/LIST StatefulSets: No impact (strategy type is standard enum field, standard deserialization)
- CREATE/UPDATE StatefulSets: Minimal impact (~10-20μs for validating strategy type enum).
StatefulSet Controller Reconciliation:
- With feature enabled but strategy not set to Recreate: No additional overhead.
- With Recreate strategy: Same overhead as manual pod deletion + creation operations.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
No.
- Etcd Operations:
- Minimal increase in object size when using Recreate strategy (~8 bytes for strategy type enum value + ~150-200 bytes for status conditions when active).
- Memory/CPU:
- Memory (per StatefulSet): ~8 bytes for strategy type enum value.
- CPU: Strategy type comparison on each reconciliation: ~1-2μs (simple string comparison).
- Network I/O:
- An additional ~8 bytes per StatefulSet spec when Recreate strategy is set, and ~150-200 bytes per status update when Progressing condition is active.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No, the feature does not introduce new node resource exhaustion risks beyond existing mechanism.
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
The feature behaves similar to existing controllers which depend on API server and etcd availability.
- API Server Unavailable: StatefulSet controller cannot read/write StatefulSet or Pod objects, so all updates halt.
- etcd Unavailable: Similar to API server unavailability, no state changes can be persisted.
No special handling is required as this feature only changes the update progression logic, not the fundamental dependency on API server/etcd availability.
What are other known failure modes?
N/A
What steps should be taken if SLOs are not being met to determine the problem?
- Examine Metrics (
kube_statefulset_status_replicas_available,kube_statefulset_status_replicas_ready)- If
kube_statefulset_status_replicas_availableis stuck at 0 for extended period → pods may be stuck in termination - If
kube_statefulset_status_replicas_currentis increasing butkube_statefulset_status_replicas_readyis not → pods may be failing to start
- If
- Check if pods are stuck in termination (long grace periods, finalizers blocking deletion)
- Verify pod startup time is reasonable (image pull, initialization containers, readiness probes)
Implementation History
- 2022-09-26: Initial KEP Created
- 2025-07-29: Updated the KEP after changing the ownership
- 2025-10-13: Pivoted KEP from
EnforcedRollingUpdatestrategy topodProgressTimeoutSecondsfield based on sig-apps feedback. This approach better handles transient vs permanent failures and aligns with Deployment semantics. - 2025-12-01: Pivoted KEP from
podProgressTimeoutSecondstoRecreatestrategy based on sig-apps meeting (meeting recording ). Key feedback:- Progress deadline seconds in Deployments do not terminate pods, but podProgressTimeoutSeconds proposal would terminate pods
- Deleting/terminating pods based on readiness signals is problematic and disruptive
- Group consensus favored Recreate for simplicity and consistency with existing Kubernetes APIs
Drawbacks
Downtime Requirement
The Recreate strategy causes downtime during updates since all pods are deleted before new ones are created:
- Service Interruption: Application is completely unavailable during the deletion/recreation window
- Not Suitable for All Workloads: Traditional stateful applications requiring high availability cannot use this strategy
- User Expectation Management: Users must understand and accept downtime implications
Mitigation:
- Clear documentation emphasizing downtime implications
- Explicit opt-in via
type: Recreate(no accidental usage) - Recommendation to use for appropriate workloads (CI/CD, stateless apps, development environments)
Limited Rollback Options
During a Recreate update, there’s no gradual rollback:
- If new version has issues, all pods are affected (no gradual detection)
- Cannot compare old vs new pods side-by-side during rollout
- Must wait for full recreation cycle to attempt fixes
Mitigation:
- Clear events and status conditions during Recreate process
- Users can choose RollingUpdate for gradual rollouts where needed
- Quick feedback loop due to fast recreation (all pods start together)
Alternatives
Alternative 1: PodProgressTimeoutSeconds Field in RollingUpdate Strategy
Extend the existing RollingUpdate strategy with a podProgressTimeoutSeconds field (similar to Deployment’s progressDeadlineSeconds) that allows timeout-based detection of stuck pods.
API Example:
spec:
updateStrategy:
type: RollingUpdate
rollingUpdate:
podProgressTimeoutSeconds: 600 # Wait 10 minutes per pod
maxUnavailable: 1
Algorithm: For each pod in reverse ordinal order, delete and create new pod, wait for Ready state with timeout. If pod doesn’t become Ready within podProgressTimeoutSeconds, delete and recreate it.
Pros:
- Maintains sequential ordering guarantees
- Distinguishes transient failures (slow image pulls) from permanent failures (misconfig)
- Works with existing
maxUnavailableandpartitionfields - Allows fine-grained control over timeout per workload
Cons:
- Complexity: Requires tracking per-pod creation timestamps and deadline state across reconciliation loops
- Timeout Configuration Burden: Users must choose appropriate timeout values (too short = unnecessary churn, too long = slow recovery)
- Doesn’t Solve All Scenarios: Still blocks on transient issues until timeout expires
- Controller Complexity: Adds significant complexity to StatefulSet controller logic
Why Not Chosen as Primary Solution: Based on sig-apps meeting feedback (meeting link ), the group favored the simpler Recreate strategy approach. Key concerns raised:
- Progress deadline in Deployments does not terminate pods when deadline is reached, but this proposal would
- Using readiness signals to terminate pods is problematic and disruptive
- The timeout-based approach adds complexity that may not be necessary for the primary use cases (CI/CD, stateless apps, external storage)
- Recreate strategy is “pretty bare” and has direct parallel with Deployment patterns, making it easier to implement and understand
Alternative 2: EnforcedRollingUpdate Strategy
Add a new update strategy type EnforcedRollingUpdate that immediately deletes and replaces stuck pods without timeout during rolling updates.
API Example:
spec:
updateStrategy:
type: EnforcedRollingUpdate
enforcedRollingUpdate:
maxUnavailable: 1
Algorithm: When pod[i] needs update, delete it immediately regardless of current state, create new pod, wait for Ready.
Pros:
- Simpler than timeout-based approach (no deadline tracking)
- Maintains some ordering through sequential updates
- Immediate action on stuck pods
Cons:
- Cannot distinguish transient from permanent failures (network delays, CI/CD pipeline delays, slow image pulls)
- Still maintains sequential ordering, which adds complexity
- Doesn’t solve initial deployment failure, only works when spec changes
Why Not Chosen: Similar concerns as Alternative 1, but Recreate is even simpler by removing ordering requirements entirely.
Alternative 3: (Now Primary Solution): Recreate Strategy
NOTE: This alternative was chosen as the primary solution for this KEP based on sig-apps meeting feedback.
Add a Recreate update strategy (matching Deployment’s Recreate strategy) that deletes all pods before creating new ones.
API Example:
spec:
updateStrategy:
type: Recreate
Algorithm: Delete all pods, wait for termination, create all new pods according to spec.podManagementPolicy.
Pros:
- No complexity around stuck pods or timeout tracking
- All pods deleted before new ones created, guaranteeing clean state
- Simple, predictable behavior aligned with Deployment patterns
- Can quickly replace all pods regardless of their current state
- No need to configure timeouts or tune parameters
Cons:
- No ordering during deletion (all at once). Ordering during creation only when podManagementPolicy
- Not suitable for traditional stateful workloads requiring zero-downtime updates
Why Chosen as Primary Solution: Based on sig-apps meeting discussion, this approach is:
- Simpler to implement and understand (matches existing Deployment Recreate pattern)
- Addresses the primary use cases (CI/CD, stateless apps, external storage, LeaderWorkerSet)
- Avoids concerns about terminating pods based on readiness/timeout signals
- Provides explicit opt-in behavior where users accept downtime for automated recovery
Alternative 4: Add Force Flag to RollingUpdate
Add a boolean field like spec.updateStrategy.rollingUpdate.forceUpdate: true.
Pros:
- Minimal API change
Cons:
- Same issue as Alternative 1; cannot distinguish transient from permanent failures
- Less discoverable than dedicated field
- Boolean flag doesn’t allow tuning timeout per workload
Why Not Chosen: Recreate strategy is clearer about behavior and simpler to implement.
Alternative 5: Enhance Parallel Policy
Extend podManagementPolicy: Parallel to automatically replace stuck pods during updates.
Pros:
- Reuses existing field
- Already has parallel semantics
Cons:
- Loses sequential ordering guarantees
- Confuses semantics of
podManagementPolicy(affects both scaling and updates) vsupdateStrategy(updates only) - Less explicit than dedicated strategy type
- Doesn’t automatically delete all pods for clean state
Why Not Chosen: Recreate strategy as a dedicated update strategy type is clearer and more explicit. It also aligns better with Deployment patterns.
Infrastructure Needed (Optional)
N/A