KEP-3335: StatefulSet Slice

KEP-3335: StatefulSet Start Ordinal

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
- Alternative API changes
- Alternatives without any API changes
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

The goal of this feature is to allow a StatefulSet to be migrated (cross namespace, cross cluster, split into segments) without disrupting the underlying application.

A StatefulSet of N replicas implicitly numbers pods from ordinal 0 to N-1. The end ordinal (N-1) can be controlled with the replicas field. The goal of this feature is to allow a StatefulSet’s first ordinal to start from any natural number k. This enables StatefulSet ordinals from k to N+k-1

Motivation

This feature is motivated by the use case of orchestrating the migration of a StatefulSet across a namespace or a Kubernetes cluster without disruption. Existing approaches to this problem include:

Backup and restore: This approach takes a backup of an application (StatefulSet, underlying storage), and re-creates it in a different location. This introduces application downtime, the duration of time between old StatefulSet termination and new StatefulSet recreation.
Pod level migration: Using --cascade=orphan when deleting a StatefulSet preserves the pods. This allows an application operator to evict and reschedule pods individually. However, as pods are ephemeral, this requires the application operator to emulate the behavior of the StatefulSet, to reschedule pods as they restart, or are evicted and rescheduled.

Migrating a StatefulSet in slices allows for gradual migration of the application, as only a subset of replicas are migrated at any time. Consider the scenario of transferring pod ordinal ownership from a source StatefulSet with N pods to a destination StatefulSet with 0 pods. Further, to maintain application availability, no more than d pods should be unavailable at any time during the transfer. An orchestrator can manipulate .spec.replicas and .spec.ordinals.start to perform this migration:

Validate the source StatefulSet (replicas=N, ordinals.start=0).
Validate the destination StatefulSet (replicas=0, ordinals.start=N).
Adjust PDBs on source and destination to distribute the budget d.
While the destination StatefulSet has less than N replicas. (Source has k replicas, destination has N-k replicas).
1. Scale down the source StatefulSet by 1. Replicas k will be terminated (replicas=k-1, ordinals.start=0)
  1. This should allow the application to determine that replica k is no longer available.
  2. PDBs associated with each StatefulSet should be adjusted to reflect a reduced availability budget based on d
2. Move any dependencies of replica k to the destination (cluster or namespace)
  1. This may include namespace resources (PVC, ConfigMap) or cluster scoped resources (PV)
3. Scale up the destination StatefulSet by 1. Replicas k will be started (replicas=N-k, ordinals.start=k)
  1. The replica k should re-advertise its new network identity, through application peer discovery, and network endpoints that reference the existing pod ordinal should be updated.
The source StatefulSet should have 0 replicas and destination StatefulSet N replicas
Clean up the source StatefulSet, and any unused resources safely.

StatefulSets are implicitly numbered starting at ordinal 0. When pods are being deployed, they are created sequentially in order from pod 0 to pod N-1. When pods are being deleted, they are terminated in reverse order from pod N-1 to pod 0. This behavior limits the migration scenario where an application operator wants to scale down pods in the source StatefulSet and scale up pods in the destination StatefulSet. If pod N-1 is removed from the source StatefulSet, there is no mechanism to create only pod N-1 in a destination StatefulSet without creating pods [0, N-2] as well. To do so would lead to the presence of duplicate pod ordinals (eg: pod 0 would exist in both StatefulSets).

Extending StatefulSet to start at an ordinal k (eg: N-1) would allow the destination StatefulSet to skip over pods [0, N-2] when creating pod N-1. This allows the original StatefulSet to be sliced at ordinal k between the source and destination StatefulSet.

Goals

StatefulSet controller manages pods for a slice of a StatefulSet, within the range [k, N+k-1]

Non-Goals

Updating a PDB to safeguard more than one StatefulSet slice
- As StatefulSet slices are scaled up or down, corresponding PDBs can also be adjusted. For example, a PDB corresponding to a slice of k replicas could be adjusted to MinAvailable: k-1 on scale up or down events. Providing guidance and functionality to adjust these PDBs is outside the scope of this KEP.
Orchestrating pod movement from one StatefulSet slice to another
Managing network connectivity between pods in different StatefulSet slices
Orchestrating storage lifecycle of PVCs and PVs across different StatefulSet slices
- Referenced PV/PVCs will need to be migrated in order for a new StatefulSet to reference data that was used by an existing StatefulSet. Orchestration complexity will depend on how volumes are used (RWO with .spec.volumeClaimTemplates on a StatefulSet, RWX with pod .spec.volumes). If using StatefulSet PVC Auto-Deletion (KEP-1847 ), whenDeleted and whenScaled should be set to Retain on the existing StatefulSet prior to migration.

Proposal

This KEP solves the problem of managing subsets of the replicas in a StatefulSet by introducing the concept of a slice. A slice consists of a start ordinal k, and a number of replicas N. To control the starting and ending ordinal of each slice, a new struct ordinals is introduced to StatefulSetSpec.

User Stories (Optional)

The main motivation of this KEP is to support a more flexible StatefulSet, a building block in an ecosystem where Stateful applications can be migrated across Kubernetes clusters with more automation. Below are two high level user stories that share the problem of having a StatefulSet locked into a specific configuration. To fully automate these two scenarios, additional building blocks are needed around volume management and networking (see Notes/Constraints/Caveats ).

Story 1

Migrating across namespaces: Many organizations use namespaces for team isolation. Consider a team that is migrating a StatefulSet to a new namespace in a cluster. Migration could be motivated by a branding change, or a requirement to move out of a shared namespace. Consider the StatefulSet my-app with replicas: 5, running in a shared namespace.

name: my-app
namespace: shared
replicas: 5
-----------------------------------------------
[ nginx-0, nginx-1, nginx-2, nginx-3, nginx-4 ]

To move two pods, the my-app StatefulSet in the shared namespace can be scaled down to replicas: 3, ordinals.start: 0, and an analogous StatefulSet in the app-team namespace scaled up to replicas: 2, ordinals.start: 3. This allows for pod ordinals to be managed during migration. The application operator should manage network connectivity, volumes and slice orchestration (when to migrate and by how many replicas).

name: my-app						name: my-app
namespace: shared					namespace: app-team
replicas: 3						    replicas: 2
ordinals.start: 0				    ordinals.start: 3
------------------------------		---------------------
[ nginx-0, nginx-1, nginx-2 ]		[ nginx-3, nginx-4 ]

The replicasStatefulSet and replicas fields should be updated jointly, depending on the requirements of the migration.

Story 2

Migrating across clusters: Organizations taking a multi cluster approach may need to move workloads across clusters due to capacity constraints, infrastructure constraints, or for better application isolation. Similar to namespace migration, the application operator should manage network connectivity, volumes and slice orchestration.

Story 3

Non-Zero Based Indexing: A user may want to number their StatefulSet starting from ordinal 1, rather than ordinal 0. Using 1 based numbering may be easier to reason about and conceptualize (eg: ordinal k is the k‘th replica, not the k+1‘th replica).

Notes/Constraints/Caveats (Optional)

The following caveats are applicable to migrating a StatefulSet (scaling down one slice and scaling up another). The following caveats are outside the scope of this KEP, but are applicable to the User Journey of migration motivated by this feature.

Networking: Managing services and networking during migration is outside the scope of this proposal. Cross cluster migration can leverage Multi-Cluster Services to establish connectivity between pods in different slices. The application operator must set up Multi-Cluster Services in the clusters, and the underlying Stateful application must be configured appropriately. Cross namespace migration can leverage a fallback domain, referring to services from both slices. Similarly this requires an application to be aware of both services.

Storage: StatefulSets that use volumeClaimTemplates, will create pods that consume per replica PVCs. PVs are cluster scoped resources, but are bound one-to-one with namespace scoped PVCs. If the underlying storage is to be re-used in the new namespace, PVs must be unbound and manipulated appropriately.

Orchestration: Consider migrating from namespace A to B. To preserve StatefulSet at most one semantics, pods should only be migrated when safe to do so. If migrating across namespaces, a pod with ordinal i should be scaled down in slice A before it is scaled up in slice B.

Risks and Mitigations

This KEP proposes a new field spec.ordinals.start with a default value of 0. StatefulSet will maintain current behavior, if this field is unset.

To mitigate risk, this feature will be rolled out with an alpha feature gate for experimentation. In Beta, new functionality should only take effect if the field spec.ordinals.start is set to a value greater than 0.

Design Details

StatefulSet Spec Changes

A new struct is introduced to the StatefulSetSpec. In this KEP, the field only has a single field Start. A struct is added (rather than putting start directly into spec) to allow for the ordinals struct to change over time. If future use cases of StatefulSet require further ordinal controls (eg: ordinal numbering based on failure domains), new fields related to the numbering and grouping of StatefulSet ordinals can be added to this struct.

type StatefulSetSpec struct {
        // Ordinals controls how the stateful set creates pod and
        // persistent volume claim names.
        // The default behavior assigns a number starting with zero
        // and incremented by one for each additional replica requested.
        // +optional
        Ordinals struct {
               // Start is the number representing the
               // first index that is used to represent replica ordinals.
               // If set, replica ordinals will be numbered
               // [ordinals.start, ordinals.start + replicas)
               // If unspecified, defaults to 0
               // +optional
               Start int32
       }
}

Control Loop Changes

In the main control loop, StatefulSet will attempt to create pod replicas [k, N+k-1]

When scaling up pods: If ordinal i in range [k, N+k-1] does not exist, pod i will be created.
When scaling down pods: If ordinal j exists but is not in range [k, N+k-1]), pod j will be terminated. RollingUpdate Partition Changes

Since ordinals.start changes the offset of the replica ordinals, this affects the partition field used for RollingUpdate. As partition specifies an ordinal index, the partition field must be in the range [k, N+k-1], to be valid.

Test Plan

[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

k8s.io/kubernetes/pkg/apis/apps/validation: 2023-02-05: 90.5%
k8s.io/kubernetes/pkg/controller/statefulset: 2023-02-05: 85.7%
k8s.io/kubernetes/pkg/registry/apps/statefulset: 2023-02-05: 65.2%

E2E tests

k8s.io/kubernetes/test/e2e/apps/statefulset
- Testgrid
- k8s-triage

Integration tests

k8s.io/kubernetes/test/integration/statefulset.TestStatefulSetStartOrdinal
- Testgrid
- k8s-triage

Graduation Criteria

Alpha

Feature functionality implemented but hidden behind a feature gate
Add unit and integration tests

Beta

Validate with user workloads
Enable feature gate for e2e pipelines
Add e2e tests

GA

Real-world usage
- (The LeaderWorkerSet API (LWS) )

Upgrade / Downgrade Strategy

Upgrades: This feature adds a new field (ordinals.start) to the StatefulSet. The default value for the new field maintains the existing behavior of StatefulSet.

Downgrades: When using ordinals.start, downgrades are not backwards compatible. Versions of StatefulSet not implementing this feature will attempt to re-create all replicas from [0, N-1], and terminate any pods of ordinal N or greater, where N is the number of replicas.

Version Skew Strategy

There are only kube-controller-manager changes involved (in addition to the apiserver changes for dealing with the new StatefulSet field). Node components are not involved so there is no version skew between nodes and the control plane.

An n-1 kube-controller-manager will have the same effect (when applicable) as rolling back to a version where this feature is not enabled. See Rollout, upgrade and rollback planning for details.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: StatefulSetSlice
- Components depending on the feature gate:
  - kube-controller-manager: Controls which replica ordinals are created
  - kube-apiserver: Manages the new policy field ordinals.start
Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane? No
- Will enabling / disabling the feature require downtime or reprovisioning of a node? No

Does enabling the feature change any default behavior?

No, if the new StatefulSet field .spec.ordinals.start is unset, StatefulSet will retain existing behavior.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, disabling the feature gate will cause the new field to be ignored

Note that if disabled, StatefulSet will implicitly number ordinals starting from 0. This can cause churn of pods when disabled, if ordinals.start is not 0. Care should be taken when disabling this feature on a spec that has ordinals.start set, and should only be done when pod churn or disruption can be tolerated.

What happens if we reenable the feature if it was previously rolled back?

StatefulSet will see the ordinals.start field and scale pods to start from this ordinal. This can cause churn of pods when enabled, if ordinals.start is not 0. Care should be taken when enabling this feature on a spec that has ordinals.start set, and should only be done when pod churn or disruption can be tolerated.

Are there any tests for feature enablement/disablement?

Existing e2e tests will validate that when the feature is enabled, but not in use that the existing behavior (eg: not specifying the new ordinals.start API) is preserved.

Additionally unit tests for validating enablement/disablement will be added in Beta.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

If a control plane rollout disables this feature, the StatefulSet controller will update ordinal numbers it controls. This will result in pods being deleted, while other pods are scaled in. The StatefulSet controller scales up pods before it deletes pods, so as a result, the StatefulSet should not manage fewer than the number of replicas that are defined in the spec. Disabling the feature may have an effect on the stateful workload that is being run. If the stateful application expects a specific ordinal number to be available, it may result in an application failing to reach quorum, or rebalancing data based on the number of available replicas.

What specific metrics should inform a rollback?

The kube_statefulset_status_replicas metric can be monitored against the kube_statefulset_replicas metric to check the expected number of replicas to the actual number of pods matched by this StatefulSet’s selector. If there is a divergence between these fields during steady state operations, this can indicate that the number of replicas being created by the StatefulSet do not match the expected number of replicas.

On a large scale (across a large number of StatefulSets) the distribution of the ratio of these two metrics should not change when enabling this feature. If this ratio changes significantly after enabling this feature, it could indicate a problem.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

A manual upgrade->downgrade->upgrade scenario was performed:

Create a cluster on a version that doesn’t use this feature (eg: 1.26)
Upgrade a cluster to a version that uses this feature (eg: 1.27)
Install a StatefulSet that uses the .spec.ordinals.start field (eg: 2)
Validate the StatefulSet creates the correct pods
Downgrade the cluster to the prior version that doesn’t use this feature
Validate the StatefulSet follows documented the rollback scenario and pods are re-created so start ordinal is 0
Upgrade the cluster to the newer version that uses this feature
Validate the StatefulSet pods are modified to start at .spec.ordinals.start

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No removals or deprecations are tied to this rollout. The rollout is enabled by the feature flag StatefulSetStartOrdinal.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

An operator can check the .spec.ordinals.start metric on the StatefulSet to determine if this StatefulSet has a non-default start ordinal defined. The operator can also check if the kube_statefulset_ordinals_start metric is set. If .spec.ordinals is set on the StatefulSet, this metric will be populated. This metric can be counted across StatefulSets in a Kubernetes cluster, to identify the number of StatefulSets using this feature.

How can someone using this feature know that it is working for their instance?

Other (treat as last resort)
- Details: The user can inspect the pods that are created by the StatefulSet which match the StatefulSet’s selector.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

This feature does not state a SLO.

For checking correctness, the kube_statefulset_status_replicas metric can be compared against the kube_statefulset_replicas metric to check the expected number of replicas to the actual number of pods matched by this StatefulSet’s selector. Under steady state, these two fields should be equal. Note that these two metrics can diverge if application replicas don’t start up for other reasons (eg: StatefulSet is using PodManagementPolicy: OrderedReady, and pod-k doesn’t become ready, preventing pod-k+1 from being created).

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metric name: statefulset_reconcile_delay
- [Optional] Aggregation method: quantile
- Components exposing the metric: pkg/controller/statefulset
Metric name: kube_statefulset_replicas
- [Optional] Aggregation method: gauge
- Components exposing the metric: pkg/controller/statefulset
Metric name: kube_statefulset_status_replicas
- [Optional] Aggregation method: gauge
- Components exposing the metric: pkg/controller/statefulset
Metric name: kube_statefulset_ordinals_start
- [Optional] Aggregation method: gauge
- Components exposing the metric: pkg/controller/statefulset

Are there any missing metrics that would be useful to have to improve observability of this feature?

No.

Dependencies

Does this feature depend on any specific services running in the cluster?

This feature depends on API Server to determine the health of a pod, in order control pods with particular ordinal numbers. There are no other external dependencies.

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

Yes, StatefulSet adds an additional .spec.ordinals field. If set, this adds a nested integer, .spec.ordinals.start.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No. The runtime for pod control loop remains the same with this feature.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No. Resource usage remains the same with this feature.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No. This feature runs only on the control plane (StatefulSet controller within kube-controller-manager). It also doesn’t result in any increased node usage, as the number of expected StatefulSet replicas remains constant whether this feature is enabled (.spec.ordinals.start is set).

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

In the event of API server/etcd unavailability, the StatefulSet control loop will be unable to list pod resources. This will prevent the control loop from being able to reconcile pod resources in the cluster. When API server and etcd become available again, the control loop will adjust to reconcile resources, according to the .spec.ordinals.start and .spec.replicas fields.

What are other known failure modes?

Rollback: On feature rollback a user workload may be disrupted due to replica ordinal changes. See Rollout, upgrade and rollback planning for context.
- Detection: This issue can affect any workloads that are using a non-zero .spec.ordinals.start field prior to rollback. StatefulSets that are using this field can be identified through the kube_statefulset_ordinals_start metric.
- Mitigations: To mitigate, pods can be orphaned from their StatefulSet by using --orphan=cascade to prevent the StatefulSet from deleting replica pods until the application operator has a chance to react to the feature rollback.
- Testing: Unit tests exist to validate that the storage specification is preserved on rollback. This means that if the feature is re-enabled after rollback, the .spec.ordinals.start field will be preserved on the StatefulSet.

What steps should be taken if SLOs are not being met to determine the problem?

The StatefulSet should be validated to check if the correct number of replicas are running, and the replica ordinal numbering matches what is specified in the .spec.ordinals.start field. This can be done by looking at the running pods, and seeing if they are numbered from .spec.ordinals.start to .spec.ordinals.start + .spec.replicas.

If this is not the case, it could indicate that the StatefulSet controller is stuck reconciling. The StatefulSet controller creates new pod ordinals before it deletes lower pod ordinals, so the controller may be stuck reconciling higher order pods. This can happen if a higher order pod cannot be scheduled, so any pending or terminating pods in the selector can be inspected to determine why the StatefulSet is not reconciling to the expected .spec.replicas in status.

If further problems are experienced, the feature can be rolled back. Note the caveats around Rollback prior to doing so.

Implementation History

2022-06-02: KEP created.
2022-10-06: Alpha implementation.
2023-02-09: Beta graduation.
2024-06-04: Stable graduation.

Drawbacks

Downgrades are not gracefully supported, and are not backwards compatible. Cluster downgrades can cause a disruption to StatefulSet workloads if performed while ordinals.start is not 0.

Alternatives

Alternative API changes

ReverseOrderedReady: A new PodManagementPolicy policy called ReverseOrderedReady could be added. This would allow a StatefulSet to be started and actuated from the highest ordinal (current default is from the lowest ordinal). For the cross-cluster migration use case, this would allow for a source StatefulSet to be scaled down and a target StatefulSet to be scaled in. The downside with this API is that pod management policy is not a mutable field. So if an orchestrator uses this behavior to scale in a StatefulSet, in a destination cluster, and then wants to revert the PodManagementPolicy back to default, the StatefulSet would need to be deleted, and re-created.

KEP-3521: KEP-3521 proposes a Pod .spec level API that enables a pod to be paused at the initial scheduling phase of pod lifecycle. This provides granular control of which pods should be started and running (active) and which pods shouldn’t be scheduled (standby). An orchestrator can leverage control over specific pod scheduling, without making changes to the StatefulSet controller, as the StatefulSet controller is in control of creating pods.

If the StatefulSet controller is using OrderedReady Pod Management, pausing scheduling can result in a pod being marked as not Ready. This will prevent the StatefulSet controller from actuating updates to higher ordinal pods (eg: pod m will not be created if pod n is unhealthy, where m > n). This may increase orchestrator complexity, by requiring an orchestrator of a migration to leverage Parallel Pod Management during a migration, and then re-create a StatefulSet (using --cascade=orphan) to revert back to OrderedReady if desired.

Additionally, if modifying a StatefulSet template is undesired, a webhook must be introduced to mark Pods as paused when they are created. This adds a layer of complexity to an orchestrator operator, since it needs both an operator component that is capable of making changes to ApiServer, and a webhook that is reading from a consistent migration state.

Alternatives without any API changes

Orphan Pods: Users can orphan pods from a StatefulSet, migrate pods across a namespace or cluster, and create a new StatefulSet to manage pods upon migration. In the case of pod eviction or failure, pods will need to be manually recreated, requiring manual intervention and constant monitoring.

Backup/Restore: Users can backup and restore a StatefulSet (and underlying storage) in a new namespace or cluster. Doing so requires the existing StatefulSet to be deleted, for underlying storage to be backed up and restored, resulting in downtime for the stateful application.