KEP-3017: Pod Healthy Policy for PDB

KEP-3017: Unhealthy Pod Eviction Policy for PDBs

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
- Risks and Mitigations
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Abandoned Alternative Implementation
- Changes to the disruption controller
- Changes to the definition of healthy in a PDB according to the policy used.
Future Work

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests for meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Pod Disruption Budgets currently don’t provide a way for users to specify how to handle pods that are Running, but not Healthy (Ready). In this KEP, we add a new field unhealthyPodEvictionPolicy that allows users to specify what should happen to these not Healthy (Ready) pods. Whether they should be always evicted or kept in case the application guarded by a Pod Disruption Budget is not available and disrupted.

Motivation

Pod Disruption Budgets are currently being used for two different purposes:

Provide best-effort constraints on voluntary disruption to preserve availability on a set of pods.
Prevent data-loss by blocking eviction of pods until any data unique to a soon-to-be evicted pod has been copied/shared/replicated to other pod(s).

Both use-cases have rough edges with the current implementation.

For users who only want to make sure a minimum number of pods are available, it is possible to end up in situations where pods that are Running but not Healthy (Ready) can not be evicted, even when the total number of pods are higher than the threshold set in the PDB (https://github.com/kubernetes/kubernetes/issues/72320) . This can block automated tooling like the cluster-autoscaler and draining of nodes.

For users who leverage PDBs to prevent data-loss, the solution is unsafe (racey as described in https://github.com/kubernetes/kubernetes/pull/105296#issuecomment-929209150 ) and arguably uses the API in a way it was not designed for.

The first use-case if the primary goal of PDBs, but feedback suggests that a sufficient number of users are leveraging PDBs for the second use-case that changing the behavior in a way that doesn’t support this use-case is not an option. In particular, as Kubernetes doesn’t provide any alternatives solutions for this problem.

Goals

Prevent PDBs from deadlocking eviction due to non-Healthy (non-Ready) pods.
Make sure users who rely on PDBs for data-safety can continue to do so.

Non-Goals

Providing a safe solution for preventing data-loss. Not because this isn’t important, but it is unclear if PDB is the right tool for this.
Allow customization of healthiness detection for pods guarded by a PodDisruptionBudget.

Proposal

The core issue here is whether a pod that is Running but not Healthy (Ready) is considered disrupted, and thus should be evicted without being potentially constrained by a Pod Disruption Budget.

Currently, we only allow evicting Running pods in case there are enough pods healthy (.status.currentHealthy is at least equal to .status.DesiredHealthy). This is to give the application best chance to achieve availability and prevent data loss by disallowing disruption of starting pods that have not become Healthy (Ready yet).

We also want to allow unconditional eviction of Running pods for applications that do not have such strict constraints. This will allow cluster administrators to evict misbehaving applications that are guarded by a PDB and proceed with node drain.

Adding a unhealthyPodEvictionPolicy field on the PDB API will allow the user to specify which behavior is desired. This will be consistently handled by the eviction API, and any other APIs that might use PDBs. If a unhealthyPodEvictionPolicy is not provided, the default will be the current behavior.

The behavior for pods in Pending, Succeeded or Failed phase will stay the same and such pods will always be considered for eviction.

Risks and Mitigations

Design Details

API

// PodDisruptionBudgetSpec is a description of a PodDisruptionBudget.
type PodDisruptionBudgetSpec struct {
	
	...
    // UnhealthyPodEvictionPolicy defines the criteria for when unhealthy pods
    // should be considered for eviction. Current implementation considers healthy pods,
    // as pods that have status.conditions item with type="Ready",status="True".
    //
    // Valid policies are IfHealthyBudget and AlwaysAllow.
    // If no policy is specified, the default behavior will be used,
    // which corresponds to the IfHealthyBudget policy.
    //
    // Additional policies may be added in the future.
    // Clients making eviction decisions should disallow eviction of unhealthy pods
    // if they encounter an unrecognized policy in this field.
    UnhealthyPodEvictionPolicy *UnhealthyPodEvictionPolicyType `json:"unhealthyPodEvictionPolicy,omitempty" protobuf:"bytes,4,opt,name=unhealthyPodEvictionPolicy"`
}

// UnhealthyPodEvictionPolicyType defines the criteria for when unhealthy pods
// should be considered for eviction.
// +enum
type UnhealthyPodEvictionPolicyType string

const (
    // IfHealthyBudget policy means that running pods (status.phase="Running"),
    // but not yet healthy can be evicted only if the guarded application is not
    // disrupted (status.currentHealthy is at least equal to status.desiredHealthy).
    // Healthy pods will be subject to the PDB for eviction.
    IfHealthyBudget UnhealthyPodEvictionPolicyType = "IfHealthyBudget"

    // AlwaysAllow policy means that all running pods (status.phase="Running"),
    // but not yet healthy are considered disrupted and can be evicted regardless
    // of whether the criteria in a PDB is met. This means perspective running
    // pods of a disrupted application might not get a chance to become healthy.
    // Healthy pods will be subject to the PDB for eviction.
    AlwaysAllow UnhealthyPodEvictionPolicyType = "AlwaysAllow"
)

Changes to the eviction API

The eviction API will be updated to use unhealthyPodEvictionPolicy of a PDB to determine whether a pod which is Running but not Ready can be evicted regardless of the value of disruptionsAllowed. This will only be a behavioral change when users have specified a unhealthyPodEvictionPolicy, and will not require the actual API to change.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

We assess that the eviction api has adequate test coverage for places which might be impacted by this enhancement. Thus, no additional tests prior implementing this enhancement are needed.

Unit tests

The core packages (with their unit test coverage) which are going to be modified during the implementation:

k8s.io/kubernetes/pkg/apis/policy/validation: 5 October 2022 - 93%
k8s.io/kubernetes/pkg/apis/policy/v1: 5 October 2022 - 60%
k8s.io/kubernetes/pkg/registry/policy/poddisruptionbudget: 8 November 2022 - 62.5%
k8s.io/kubernetes/pkg/registry/core/pod/storage: 8 November 2022 - 74.2%

Alpha implementation:

k8s.io/kubernetes/pkg/apis/policy/validation: 7 December 2022 - 93.1%
k8s.io/kubernetes/pkg/apis/policy/v1: 7 December 2022 - 60%
k8s.io/kubernetes/pkg/registry/policy/poddisruptionbudget: 7 December 2022 - 75%
k8s.io/kubernetes/pkg/registry/core/pod/storage: 7 December 2022 - 78%

Integration tests

Integration tests covering:

The current behavior stays unchanged when the policy is not specified.
Correct behavior for both policies in the eviction API.
Feature gate disablement.

TestEvictionWithUnhealthyPodEvictionPolicy : https://storage.googleapis.com/k8s-triage/index.html?test=UnhealthyPodEvictionPolicy

e2e tests

Introduce tests covering:

Create a Deployment and a PDB with IfHealthyBudget policy and check that evictions work accordingly
Create a Deployment and a PDB with AlwaysAllow policy and check that evictions work accordingly

TBD

Graduation Criteria

Alpha

Feature gate disabled by default.
Unit and integration tests passing.

Beta

Feature gate enabled by default.
Integration test which exercises the functionality.
We want to keep the spec.unhealthyPodEvictionPolicy field null by default when not specified. This should preserve the original behavior and behave the same as the IfHealthyBudget value. This should be tested and documented.
manual test for upgrade->downgrade->upgrade path will be performed once 1.27 is released

GA

Every bug report is fixed.
Introduce E2E tests for this field and confirm their stability.
Verify existing E2E and conformance tests for PDBs and Eviction.
The eviction API ignores the feature gate.

Deprecation

N/A

Upgrade / Downgrade Strategy

No changes required for existing cluster to use the enhancement.

Version Skew Strategy

This feature doesn’t depend on the version for nodes.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: PDBUnhealthyPodEvictionPolicy
- Components depending on the feature gate:
  - kube-apiserver

Does enabling the feature change any default behavior?

No, the behavior is only changed when users specify the unhealthyPodEvictionPolicy in the PodDisruptionBudget spec.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, in that case the eviction API will just use the default behavior.

What happens if we reenable the feature if it was previously rolled back?

The eviction API will again start using the unhealthyPodEvictionPolicy if provided on a PDB.

Are there any tests for feature enablement/disablement?

TestPodDisruptionBudgetStrategy

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

Bugs could affect /evictions endpoint which would return server error in that case. It cannot directly affect workloads, but could potentially cause node drain to stall, which would have an effect on the cluster during an upgrade.

When the rollback occurs, existing filled .spec.unhealthyPodEvictionPolicy fields will be ignored and the old eviction behavior will be enforced for these PDBs.

What specific metrics should inform a rollback?

Failing eviction requests could be an indicator. apiserver_request_total{resource = "pods", subresource = "eviction"} metric can be observed to detect increased rate of failing evictions.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

A manual test was performed, as follows:

Create a cluster in 1.25.
Upgrade to 1.26.
Create Deployment A and PDB A targeting the pods of Deployment A using the AlwaysAllow UnhealthyPodEvictionPolicy.
Downgrade to 1.25.
Verify that the eviction continue to work without using the UnhealthyPodEvictionPolicy.
Create another StatefulSet B and PDB B targeting the pods of StatefulSet B.
Upgrade to 1.26.
Verify that eviction of pods for Deployment A and StatefulSet B use the default behavior. Verify that the AlwaysAllow UnhealthyPodEvictionPolicy can be set again to a PDB of Deployment A and test the eviction behavior

A manual test was performed, as follows:

Create a cluster in 1.26.
Upgrade to 1.27.
Create Deployment A and PDB A targeting the pods of Deployment A using the AlwaysAllow UnhealthyPodEvictionPolicy.
Downgrade to 1.26.
Verify that the eviction continue to work without using the UnhealthyPodEvictionPolicy (PDBUnhealthyPodEvictionPolicy feature gate disabled by default).
Create another StatefulSet B and PDB B targeting the pods of StatefulSet B.
Upgrade to 1.27.
Verify that eviction of pods for Deployment A uses the AlwaysAllow UnhealthyPodEvictionPolicy and eviction of pods for StatefulSet B uses the default behavior.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

N/A

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

By checking .spec.unhealthyPodEvictionPolicy field of the PodDisruptionBudget. Pods belonging to this PDB should be evicted according to this policy.

How can someone using this feature know that it is working for their instance?

Other (treat as last resort)
- Details: kube-apiserver logs and audit logs that track eviction requests can be examined to see if the UnhealthyPodEvictionPolicy feature is working properly.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

This feature should not have an impact on the eviction request latency or availability. Eviction requests should follow the existing latency SLOs for serving mutating or read-only API calls.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

The following indicators should conform to the existing kube-apiserver SLIs.

Metrics
- Metric name: apiserver_request_total
  - [Optional] Aggregation method: resource = “pods”, subresource = “eviction”
  - Components exposing the metric: kube-apiserver
- Metric name: apiserver_request_duration_seconds
  - [Optional] Aggregation method: resource = “pods”, subresource = “eviction”
  - Components exposing the metric: kube-apiserver

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

No, the eviction API already fetch the PDB from the API server.

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

API: PodDisruptionBudget
Estimated increase in size: New field of about 15B

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

No change from the existing behavior of the eviction API.

What are other known failure modes?

None.

What steps should be taken if SLOs are not being met to determine the problem?

N/A.

Implementation History

2021-10-24: Proposed KEP for adding the new behavior in alpha status in 1.24.
2022-11-11: Initial alpha implementation merged into 1.26
2022-12-07: KEP rewritten to match the implementation (PodHealthyPolicy was renamed to UnhealthyPodEvictionPolicy)
2023-02-06: Update for beta promotion
2024-01-30: Update for stable promotion

Drawbacks

If the current behavior is sufficient, we should not make this change. However, the evidence is that it doesn’t address the needs of users.

Alternatives

Changing the default behavior was considered but rejected for two reasons:

We can’t change the behavior of a GA API
There are two separate use-cases for this feature and changing the behavior to support only one of them would create problems for other users.

Abandoned Alternative Implementation

There is a noticeable difference to the original KEP as some behaviours were dropped.

Changes to the disruption controller

We have removed changes to the disruption controller and computation of disruptionsAllowed. We have kept the scope only to the eviction API. It is better to split these changes into separate features in order to have simpler (less confusing), and more well-defined behavior for each feature.

You can see possible followups for customizing the definition of healthiness in Future Work

Changes to the definition of healthy in a PDB according to the policy used.

We have decided that eviction policy should not change the meaning of a healthy pod as a single powerful field could introduce more confusion into how it affects the status of PodDisruptionBudget and Eviction API.

PodRunning policy was measuring running pods and changing the computation of disruptionsAllowed, and it was removed from the original KEP.

const (
	// PodRunning policy means that pods that are in the Running phase
	// is considered healthy by the disruption controller, regardless of
	// whether they are Ready or not. Any pods that are in the Running
	// phase will be counted when computing "disruptionsAllowed" and
	// will be subject to the PDB for eviction.
	PodRunning PodHealthyPolicy = "PodRunning"
)

Future Work

The current implementation considers healthy pods, as pods that have .status.conditions item with type="Ready" and status="True". These pods are tracked via .status.currentHealthy field in the PDB status.

This might not be enough for all use cases. For example the user might want to specifically handle pods that have their PVC on a specific node’s local storage. The pod should block the node from being drained and going down to prevent a possible data loss, even in all situtations when the pod is not ready (discussion )

To support this, a new custom mechanism for defining healthiness needs to be defined to optionally replace the default implementation. This could be achieved with the help of user defined Pod Readiness Gates , by introducing a new field in a PodDisruptionBudget that could receive either a list of condition types or a logical expression referencing these condition types to conclude whether the pod is healthy or not. This field and other options should be explored in an additional KEP.

The disruption controller would update the existing fields in a PodDisruptionBudget status based on the custom healthiness. The eviction API would react to the existing fields in the same way as it does now, and in a combination with here proposed PDBUnhealthyPodEvictionPolicy feature.