KEP-3836: Kube-proxy improved ingress connectivity reliability

Summary
Motivation
- Goals
- Non-Goals
Proposal
- Risk
- Mitigations
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives

Summary

The service controller in the Kubernetes cloud controller manager (KCCM) configures the load balancer and its corresponding health check (HC) following service related events. The configured health check is then used by the load balancer as to determine which instances are candidates for traffic load balancing. For certain cloud providers (GCP): the KCCM configures the HC to target the Kubernetes service proxy for this information. In this KEP we will focus on Kube-proxy, since it’s the only service proxy under the responsibility of the Kubernetes project.

We can define two classes of services for what concerns the HC:

externalTrafficPolicy: Cluster / eTP:Cluster (default)
externalTrafficPolicy: Local / eTP:Local

For class eTP:Cluster services: Kube-proxy currently returns an answer following its healthz state (specifically, whether the data-plane programming is known to be stale). For eTP:Local services: Kube-proxy only reports if the service for which the load balancer was created for, has a Ready endpoint running on the node. This KEP will only focus on the former case.

This KEP proposes three changes:

That Kube-proxy provides a mechanism for load balancers to do connection draining for terminating Nodes. This is to be done by Kube-proxy by inspecting a field on the Node object which indicates that the “node is terminating/deleting” and when seen, starts failing its healthz and subsequently the LB HC. When discussing this scenario, the primary case this applied to was: downscaling by the cluster autoscaler (CA). Given that the CA taints the Node which is to be downscaled and deleted, with the taint: ToBeDeletedByClusterAutoscaler it would seem most appropriate to use that here. It is unfortunate to have this taint spread around the code base, but for now: no better indicator has been thought of. The .spec.unschedulable field has been discussed as well. Setting that field is usually followed by eviction and Node termination, but it doesn’t have as strong of a direct link to the termination/deletion of the Node, as the taint does. Users can for example decide to cordon all nodes during at given moment in time. Using .spec.unschedulable as this signal would cause ingress traffic for all eTP:Cluster services to break.
That Kube-proxy adds a /livez path to its health check server (proxier_health.go) corresponding to the old healthz semantics (i.e: whether the data-plane programming is known to be stale).
This KEP does not attempt to align cloud providers for health checking eTP:Cluster services. It recognizes cloud providers have valid reasons for doing this differently depending on their implementations. However, it does want to recommend ways for cloud providers to do health checking in a better way on Kubernetes clusters. The KEP hence proposes that a document be added to https://kubernetes.io/docs/concepts/ which can act as a formal guide and be utilized as knowledge sharing with cloud providers.

Motivation

The motivation for each change is:

Nodes used as intermediate nexthop for eTP:Cluster services would allow all connections passing through the node while it is being terminated, to gracefully shutdown.
Adding this new /livez path will allow vendors / users of Kube-proxy to specify a livenessProbe which isn’t impacted by any node termination indicator. It will indicate Kube-proxy health, only, just as is the case today. This is a low-hanging fruit which requires modifying the Kube-proxy DeamonSet spec to opt-in on.
Cloud providers have very different ways of ascertaining if a load balancer should target a specific node for eTP:Cluster services. We would like to highlight the benefits of certain methods and pitfalls of others, in a formal document, so that this is known. The hope is that this allows the information to act as a source of knowledge for how to adapt their implementations to the mechanics of a Kubernetes cluster.

Goals

Offer a better capability of connection draining terminating Nodes, for load balancers which support that.

Non-Goals

Aligning cloud provider HCs for eTP:Cluster services. Cloud providers like Azure/AWS do not configure their HCs to point to the service proxy. Instead they connect to the NodePort defined for the service. As to have them benefit from the proposals of this KEP, they would need to change their implementation.
That Kube-proxy includes its healthz state AND its current answer w.r.t the local endpoints, when it answers to the HC for eTP:Local services. Kube-proxy is currently defined as “unhealthy” when 2 * syncPeriod passes in which it knows that it needs to update the data plane (iptables/ipvs), but has not actually done so. Not including the healthz state can cause Kube-proxy to indicate to a load balancer that it should send traffic to a Node simply because the endpoint is scheduled there, even though Kube-proxy might not be healthy and successfully managed to write the rules required for actually being able to forward traffic to the endpoint. This has however been agreed is a bug and will be treated as such, as opposed to following the KEP cycle for it.

Proposal

Risk

The risk are:

Vendors of Kubernetes which deploy Kube-proxy and specify a livenessProbe targeting /healthz are expected to start seeing a CrashLooping Kube-proxy when the Node gets tainted with ToBeDeletedByClusterAutoscaler. This is because: if we modify /healthz to fail when this taint gets added on the Node, then the livenessProbe will fail, causing the Kubelet to restart the Pod until the Node is deleted. As far as we can tell, no vendor set livenessProbe, nor does kubeadm, so the risk is low.
By not being able to watch the Node object (while failing to read from the API server, for example) we might have all Kube-proxy start failing the HCs at once. That being said: Kube-proxy currently watches the Node object and is susceptible to this risk.

Mitigations

Such problems are expected to surface during the Beta phase when the feature gate will be enabled by default. The mitigation at that point would be to set the feature gate to “off” and default back to current behavior. Alternatively, to start using the /livez path which will keep the old semantics. We will also make the graduation criteria to Beta be the document we would like to write and mention this as an explicit recommendation of what not to do when deploying Kube-proxy. As such any vendor doing this would get a heads-up during Alpha.
If Kube-proxy starts failing when reading from the API server, it should just assume that the last state seen continues. For Kube-proxy to be made aware of this, it needs to invoke serviceInformer.Informer().SetWatchErrorHandler(DefaultWatchErrorHandler) when initializing its informers. Any errors observed by client-go when watching from the API server will be reported on DefaultWatchErrorHandler.
Metrics should inform on Kube-proxy health and include information about its healthz/livez state, this can then be used to correlate to networking metrics surrounding new/established connections on the node. Ex: a failing healthz should correlate to a total drop in the count of new connections and with a zero-or-negative rate of established connections. E2E tests should also be designed with this specific goal in mind, i.e: validating the impact of a failing kube-proxy on ingress connectivity. Kube-proxy currently has a lot of metrics regarding how its health is doing, but no direct red/green indicator of what the end result of its health is. A couple of such metric could be proxy_healthz_total/proxy_livez_total with labels for the HTTP status codes: 503 / 200.
The feature could be disabled for user who is dependent upon such behavior by means of flipping the feature flag to off.

Design Details

Implement Kube-proxy change invoking client-go’s SetWatchErrorHandler on watch errors from the API server. This addresses the second point in Mitigations
Implement change in Kube-proxy which will react to changes on the Node object and once the taint ToBeDeletedByClusterAutoscaler is placed on the Node object: start failing it’s healthz state.
Write document to be published at: https://kubernetes.io/docs/concepts/ which details: a) how determining node/instance health can best be done for Kubernetes clusters b) how Kube-proxy will do it once the changes proposed in this KEP are merged c) what some pitfalls with other methods might be.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

Update the Kube-proxy unit tests to include the healthz answer for the healthcheck server test suite.

k8s.io/kubernetes/pkg/proxy/healthcheck: 09/Feb/2023 - 68.8%

Integration tests

This feature is not readily integration tested, so we will use unit and E2E.

e2e tests

Add E2E tests for connectivity to services on terminating nodes and validate graceful termination of TCP connections.

Graduation Criteria

Alpha

E2E tests coded before any feature implementation is made which highlights the existing problem.
Feature implemented behind a feature flag.
Document written at https://kubernetes.io/docs/concepts/

Beta

No issues reported.
Decision on final field on Node object to be used as an indicator of “node is terminating/deleting”

GA

No issues reported during two releases.

Upgrade / Downgrade Strategy

Any upgrade to a version enabling the feature, succeeded by a downgrade to a version disabling it, is not expected to be impact ingress in any way, given that Kube-proxy is healthy on all cluster nodes. Should any Kube-proxy not be healthy: then ingress for eTP:Cluster services won’t be using that node as a nexthop for ingress traffic. This would have been the case in the preceding version

Version Skew Strategy

This doesn’t touch load balancer / HC API, so even though an old Kube-proxy might talk to a newer control plane, there’s no real concern.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: KubeProxyDrainingTerminatingNodes
- Components depending on the feature gate: Kube-proxy

Does enabling the feature change any default behavior?

Yes. For eTP:Cluster services: Kube-proxy currently doesn’t include any logic about terminating / deleting nodes when determining if it’s healthy. This will be the case going forward, whereby the addition of the taint ToBeDeletedByClusterAutoscaler will cause Kube-proxy to fail its healthz.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, by resetting the feature gate back.

What happens if we reenable the feature if it was previously rolled back?

Behavior will be restored back immediately.

Are there any tests for feature enablement/disablement?

Not needed, since the feature is purely in-memory thing with no consequences for any persistent data.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

This change is localized to Kube-proxy only. On upgrades Kube-proxy will restart, so client connectivity is impacted in any case. If applications are running on Nodes which are tainted with ToBeDeletedByClusterAutoscaler, but which are experiencing delay for draining: then ingress SLAs might be impacted, whereby ingress connectivity for new connections experience a drop below what’s accepted. But all application pods should be running on these terminating Nodes in that case.

What specific metrics should inform a rollback?

The metric: proxy_healthz_total (with label: 503) mentioned in Monitoring requirements will inform on red healthz. proxy_livez_total (with label: 503) will inform on red livez state. If the healthz count is increasing but the livez does not: then a problem might have occurred with the node related reconciliation logic.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Given that the feature is purely in-memory for kube-proxy and determines the way it reports /healthz: upgrade-rollback-upgrade doesn’t introduce additional value on top of regular feature tests.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

Two new metrics proxy_healthz_total/proxy_livez_total which will count the amount of reported successful/unsuccessful health check invocations per 503 and 200. These metrics can then be correlated to impacted ingress connectivity, for endpoints running on those nodes.

How can an operator determine if the feature is in use by workloads?

By connecting to service of type: LoadBalancer and eTP:Cluster through a terminating/tainted Node and validating that any new connections are blocked, and established connections are fine.

How can someone using this feature know that it is working for their instance?

For eTP:Cluster: their connections will terminate gracefully when the node used as a nexthop for their connection is terminating or tainted.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name: proxy_healthz_total
- Metric name: proxy_livez_total

Are there any missing metrics that would be useful to have to improve observability of this feature?

N/A

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

Not any different than today.

What are other known failure modes?

Vendors of Kubernetes which deploy Kube-proxy and specify a livenessProbe targeting /healthz are expected to start seeing a CrashLooping Kube-proxy when the Node gets tainted with ToBeDeletedByClusterAutoscaler. This is because: if we modify /healthz to fail when this taint gets added on the Node, then the livenessProbe will fail, causing the Kubelet to restart the Pod until the Node is deleted.
- Detection: node is tainted with ToBeDeletedByClusterAutoscaler upon which Kube-proxy fails its /healthz check and starts Crashlooping. Confirm this by validating that Kube-proxy has a livenessProbe defined which targets /healthz.
- Mitigations:
  - While in beta: disable the feature gate KubeProxyDrainingTerminatingNodes.
  - While in stable: update the livenessProbe to target /livez. ToBeDeletedByClusterAutoscaler is a taint placed on the Node by the cluster-autoscaler and indicates that the node will be deleted. Kube-proxy is therefore going to terminate soon in any case. If a Crashlooping Kube-proxy is problematic in such a situations (ex: it needs to handle service/endpoint updates until the node is completely gone), then updating the livenessProbe to /livez, provides a mitigation and resolves the issue once the update has rolled out.
- Diagnostics:
  - The metric proxy_healthz_total aggregated over the label 503 is increasing while the metric proxy_livez_total aggregated over the label 503 remains unchanged. This indicates and confirms that the /healthz endpoint is failing, and that the reason is: the node is being deleted. This is the difference between /healthz and /livez.
- Testing:
  - Configure Kube-proxy with a livenessProbe targeting /healthz and delete a Node. Kube-proxy on that Node should start failing its /healthz and start Crashlooping. Apply the fixes proposed in Mitigations and verify that it resolves the issue.

What steps should be taken if SLOs are not being met to determine the problem?

There are no SLOs for this KEP, see: “What are the reasonable SLOs (Service Level Objectives) for the enhancement?”

Implementation History

2023-02-03: Initial proposal

KEP-3836: Kube-proxy improved ingress connectivity reliability

KEP-3836: Kube-proxy improved ingress connectivity reliability

Summary

Motivation

Goals

Non-Goals

Proposal

Risk

Mitigations

Design Details

Test Plan

Prerequisite testing updates

Unit tests

Integration tests

e2e tests

Graduation Criteria

Alpha

Beta

GA

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives