KEP-3836: Kube-proxy improved ingress connectivity reliability
KEP-3836: Kube-proxy improved ingress connectivity reliability
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
Summary
The service controller in the Kubernetes cloud controller manager (KCCM) configures the load balancer and its corresponding health check (HC) following service related events. The configured health check is then used by the load balancer as to determine which instances are candidates for traffic load balancing. For certain cloud providers (GCP): the KCCM configures the HC to target the Kubernetes service proxy for this information. In this KEP we will focus on Kube-proxy, since it’s the only service proxy under the responsibility of the Kubernetes project.
We can define two classes of services for what concerns the HC:
externalTrafficPolicy: Cluster/eTP:Cluster(default)externalTrafficPolicy: Local/eTP:Local
For class eTP:Cluster services: Kube-proxy currently returns an answer
following its healthz state (specifically, whether the data-plane programming
is known to be stale). For eTP:Local services: Kube-proxy only reports if the
service for which the load balancer was created for, has a Ready endpoint
running on the node. This KEP will only focus on the former case.
This KEP proposes three changes:
That Kube-proxy provides a mechanism for load balancers to do connection draining for terminating Nodes. This is to be done by Kube-proxy by inspecting a field on the Node object which indicates that the “node is terminating/deleting” and when seen, starts failing its
healthzand subsequently the LB HC. When discussing this scenario, the primary case this applied to was: downscaling by the cluster autoscaler (CA). Given that the CA taints the Node which is to be downscaled and deleted, with the taint:ToBeDeletedByClusterAutoscalerit would seem most appropriate to use that here. It is unfortunate to have this taint spread around the code base, but for now: no better indicator has been thought of. The.spec.unschedulablefield has been discussed as well. Setting that field is usually followed by eviction and Node termination, but it doesn’t have as strong of a direct link to the termination/deletion of the Node, as the taint does. Users can for example decide to cordon all nodes during at given moment in time. Using.spec.unschedulableas this signal would cause ingress traffic for alleTP:Clusterservices to break.That Kube-proxy adds a
/livezpath to its health check server (proxier_health.go) corresponding to the old healthz semantics (i.e: whether the data-plane programming is known to be stale).This KEP does not attempt to align cloud providers for health checking
eTP:Clusterservices. It recognizes cloud providers have valid reasons for doing this differently depending on their implementations. However, it does want to recommend ways for cloud providers to do health checking in a better way on Kubernetes clusters. The KEP hence proposes that a document be added to https://kubernetes.io/docs/concepts/ which can act as a formal guide and be utilized as knowledge sharing with cloud providers.
Motivation
The motivation for each change is:
Nodes used as intermediate nexthop for
eTP:Clusterservices would allow all connections passing through the node while it is being terminated, to gracefully shutdown.Adding this new
/livezpath will allow vendors / users of Kube-proxy to specify alivenessProbewhich isn’t impacted by any node termination indicator. It will indicate Kube-proxy health, only, just as is the case today. This is a low-hanging fruit which requires modifying the Kube-proxy DeamonSet spec to opt-in on.Cloud providers have very different ways of ascertaining if a load balancer should target a specific node for
eTP:Clusterservices. We would like to highlight the benefits of certain methods and pitfalls of others, in a formal document, so that this is known. The hope is that this allows the information to act as a source of knowledge for how to adapt their implementations to the mechanics of a Kubernetes cluster.
Goals
- Offer a better capability of connection draining terminating Nodes, for load balancers which support that.
Non-Goals
Aligning cloud provider HCs for
eTP:Clusterservices. Cloud providers like Azure/AWS do not configure their HCs to point to the service proxy. Instead they connect to theNodePortdefined for the service. As to have them benefit from the proposals of this KEP, they would need to change their implementation.That Kube-proxy includes its
healthzstate AND its current answer w.r.t the local endpoints, when it answers to the HC foreTP:Localservices. Kube-proxy is currently defined as “unhealthy” when2 * syncPeriodpasses in which it knows that it needs to update the data plane (iptables/ipvs), but has not actually done so. Not including thehealthzstate can cause Kube-proxy to indicate to a load balancer that it should send traffic to a Node simply because the endpoint is scheduled there, even though Kube-proxy might not be healthy and successfully managed to write the rules required for actually being able to forward traffic to the endpoint. This has however been agreed is a bug and will be treated as such, as opposed to following the KEP cycle for it.
Proposal
Risk
The risk are:
Vendors of Kubernetes which deploy Kube-proxy and specify a
livenessProbetargeting/healthzare expected to start seeing a CrashLooping Kube-proxy when the Node gets tainted withToBeDeletedByClusterAutoscaler. This is because: if we modify/healthzto fail when this taint gets added on the Node, then thelivenessProbewill fail, causing the Kubelet to restart the Pod until the Node is deleted. As far as we can tell, no vendor setlivenessProbe, nor does kubeadm, so the risk is low.By not being able to watch the Node object (while failing to read from the API server, for example) we might have all Kube-proxy start failing the HCs at once. That being said: Kube-proxy currently watches the Node object and is susceptible to this risk.
Mitigations
Such problems are expected to surface during the Beta phase when the feature gate will be enabled by default. The mitigation at that point would be to set the feature gate to “off” and default back to current behavior. Alternatively, to start using the
/livezpath which will keep the old semantics. We will also make the graduation criteria to Beta be the document we would like to write and mention this as an explicit recommendation of what not to do when deploying Kube-proxy. As such any vendor doing this would get a heads-up during Alpha.If Kube-proxy starts failing when reading from the API server, it should just assume that the last state seen continues. For Kube-proxy to be made aware of this, it needs to invoke
serviceInformer.Informer().SetWatchErrorHandler(DefaultWatchErrorHandler)when initializing its informers. Any errors observed by client-go when watching from the API server will be reported onDefaultWatchErrorHandler.Metrics should inform on Kube-proxy health and include information about its
healthz/livezstate, this can then be used to correlate to networking metrics surrounding new/established connections on the node. Ex: a failinghealthzshould correlate to a total drop in the count of new connections and with a zero-or-negative rate of established connections. E2E tests should also be designed with this specific goal in mind, i.e: validating the impact of a failing kube-proxy on ingress connectivity. Kube-proxy currently has a lot of metrics regarding how its health is doing, but no direct red/green indicator of what the end result of its health is. A couple of such metric could beproxy_healthz_total/proxy_livez_totalwith labels for the HTTP status codes: 503 / 200.The feature could be disabled for user who is dependent upon such behavior by means of flipping the feature flag to off.
Design Details
Implement Kube-proxy change invoking client-go’s
SetWatchErrorHandleron watch errors from the API server. This addresses the second point in MitigationsImplement change in Kube-proxy which will react to changes on the Node object and once the taint
ToBeDeletedByClusterAutoscaleris placed on the Node object: start failing it’shealthzstate.Write document to be published at: https://kubernetes.io/docs/concepts/ which details: a) how determining node/instance health can best be done for Kubernetes clusters b) how Kube-proxy will do it once the changes proposed in this KEP are merged c) what some pitfalls with other methods might be.
Test Plan
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
Unit tests
Update the Kube-proxy unit tests to include the healthz answer for the
healthcheck server test suite.
k8s.io/kubernetes/pkg/proxy/healthcheck:09/Feb/2023-68.8%
Integration tests
This feature is not readily integration tested, so we will use unit and E2E.
e2e tests
- Add E2E tests for connectivity to services on terminating nodes and validate graceful termination of TCP connections.
Graduation Criteria
Alpha
- E2E tests coded before any feature implementation is made which highlights the existing problem.
- Feature implemented behind a feature flag.
- Document written at https://kubernetes.io/docs/concepts/
Beta
- No issues reported.
- Decision on final field on Node object to be used as an indicator of “node is terminating/deleting”
GA
- No issues reported during two releases.
Upgrade / Downgrade Strategy
Any upgrade to a version enabling the feature, succeeded by a downgrade to a
version disabling it, is not expected to be impact ingress in any way, given
that Kube-proxy is healthy on all cluster nodes. Should any Kube-proxy not be
healthy: then ingress for eTP:Cluster services won’t be using that node as a
nexthop for ingress traffic. This would have been the case in the preceding
version
Version Skew Strategy
This doesn’t touch load balancer / HC API, so even though an old Kube-proxy might talk to a newer control plane, there’s no real concern.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name:
KubeProxyDrainingTerminatingNodes - Components depending on the feature gate: Kube-proxy
- Feature gate name:
Does enabling the feature change any default behavior?
Yes. For eTP:Cluster services: Kube-proxy currently doesn’t include any logic
about terminating / deleting nodes when determining if it’s healthy. This will
be the case going forward, whereby the addition of the taint
ToBeDeletedByClusterAutoscaler will cause Kube-proxy to fail its healthz.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
Yes, by resetting the feature gate back.
What happens if we reenable the feature if it was previously rolled back?
Behavior will be restored back immediately.
Are there any tests for feature enablement/disablement?
Not needed, since the feature is purely in-memory thing with no consequences for any persistent data.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
This change is localized to Kube-proxy only. On upgrades Kube-proxy will
restart, so client connectivity is impacted in any case. If applications are
running on Nodes which are tainted with ToBeDeletedByClusterAutoscaler, but
which are experiencing delay for draining: then ingress SLAs might be impacted,
whereby ingress connectivity for new connections experience a drop below what’s
accepted. But all application pods should be running on these terminating Nodes
in that case.
What specific metrics should inform a rollback?
The metric: proxy_healthz_total (with label: 503) mentioned in Monitoring
requirements
will inform on red healthz.
proxy_livez_total (with label: 503) will inform on red livez state. If the
healthz count is increasing but the livez does not: then a problem might
have occurred with the node related reconciliation logic.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
Given that the feature is purely in-memory for kube-proxy and determines the way it reports /healthz: upgrade-rollback-upgrade doesn’t introduce additional value on top of regular feature tests.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No
Monitoring Requirements
Two new metrics proxy_healthz_total/proxy_livez_total which will count the
amount of reported successful/unsuccessful health check invocations per 503
and 200. These metrics can then be correlated to impacted ingress
connectivity, for endpoints running on those nodes.
How can an operator determine if the feature is in use by workloads?
- By connecting to service of
type: LoadBalancerandeTP:Clusterthrough a terminating/tainted Node and validating that any new connections are blocked, and established connections are fine.
How can someone using this feature know that it is working for their instance?
For eTP:Cluster: their connections will terminate gracefully when the node
used as a nexthop for their connection is terminating or tainted.
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
No
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
proxy_healthz_total - Metric name:
proxy_livez_total
- Metric name:
Are there any missing metrics that would be useful to have to improve observability of this feature?
N/A
Dependencies
No
Does this feature depend on any specific services running in the cluster?
No
Scalability
Will enabling / using this feature result in any new API calls?
No
Will enabling / using this feature result in introducing new API types?
No
Will enabling / using this feature result in any new calls to the cloud provider?
No
Will enabling / using this feature result in increasing size or count of the existing API objects?
No
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
No
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
Not any different than today.
What are other known failure modes?
- Vendors of Kubernetes which deploy Kube-proxy and specify a
livenessProbetargeting/healthzare expected to start seeing a CrashLooping Kube-proxy when the Node gets tainted withToBeDeletedByClusterAutoscaler. This is because: if we modify/healthzto fail when this taint gets added on the Node, then thelivenessProbewill fail, causing the Kubelet to restart the Pod until the Node is deleted.- Detection: node is tainted with
ToBeDeletedByClusterAutoscalerupon which Kube-proxy fails its/healthzcheck and starts Crashlooping. Confirm this by validating that Kube-proxy has alivenessProbedefined which targets/healthz. - Mitigations:
- While in beta: disable the feature gate
KubeProxyDrainingTerminatingNodes. - While in stable: update the
livenessProbeto target/livez.ToBeDeletedByClusterAutoscaleris a taint placed on the Node by the cluster-autoscaler and indicates that the node will be deleted. Kube-proxy is therefore going to terminate soon in any case. If a Crashlooping Kube-proxy is problematic in such a situations (ex: it needs to handle service/endpoint updates until the node is completely gone), then updating thelivenessProbeto/livez, provides a mitigation and resolves the issue once the update has rolled out.
- While in beta: disable the feature gate
- Diagnostics:
- The metric
proxy_healthz_totalaggregated over the label503is increasing while the metricproxy_livez_totalaggregated over the label503remains unchanged. This indicates and confirms that the/healthzendpoint is failing, and that the reason is: the node is being deleted. This is the difference between/healthzand/livez.
- The metric
- Testing:
- Configure Kube-proxy with a
livenessProbetargeting/healthzand delete a Node. Kube-proxy on that Node should start failing its/healthzand start Crashlooping. Apply the fixes proposed inMitigationsand verify that it resolves the issue.
- Configure Kube-proxy with a
- Detection: node is tainted with
What steps should be taken if SLOs are not being met to determine the problem?
There are no SLOs for this KEP, see: “What are the reasonable SLOs (Service Level Objectives) for the enhancement?”
Implementation History
- 2023-02-03: Initial proposal