KEP-3458: Remove transient node predicates from KCCM's service controller
KEP-3458: Remove transient node predicates from KCCM’s service controller
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
Summary
The service controller in the Kubernetes cloud controller manager (KCCM) currently adds/removes Nodes from the load balancers’ node set in the following cases:
a) When a node gets the taint ToBeDeletedByClusterAutoscaler added/removed
b) When a node goes Ready / NotReady
b) however only applies to services with externalTrafficPolicy: Cluster. In
both cases: removing the Node in question from the load balancers’ node set will
cause all connections on that node to get terminated instantly. This can be
considered a bug / sub-optimal behavior for nodes which are experiencing
transient readiness state or for terminating nodes, since connections are not
allowed to drain in those cases, even though the load balancer might support
that. Moreover: on large clusters with a lot nodes and entropy, re-syncing load
balancers like this can lead to rate-limiting by the cloud provider due to an
excessive amount of update calls.
As to enable connection draining, reduce cloud provider API calls and simplify
the KCCMs sync loop: this KEP proposes that the service controller stops
synchronizing the load balancer node set in these cases. Seeing as how this has
always been the case, a new feature gate StableLoadBalancerNodeSet will
be introduced, which will be used to enable the more optimal behavior.
Motivation
Abruptly terminating connections in the cases defined by a) and b) above can be seen as buggy behavior and should be improved. By enabling connection draining, applications are allowed profit from graceful shutdown / termination, for what concerns cluster ingress connectivity. Users of Kubernetes will also see a reduction in the amount of cloud API calls, for what concerns calls stemming from syncing load balancers with the Kubernetes cluster state.
Addressing b) is not useful for ingress load balancing. A load balancer needs to know if the networking data plane is running fine and this is determined by the configured health check. Cloud providers define their own health check, and no one does the same. The following describes what the health check looks like on the three major public cloud providers:
- GCP: probes port 10256 (Kube-proxy’s healthz port)
- AWS: if ELB; probes the first
NodePortdefined on the service spec - Azure: probes all
NodePortdefined on the service spec.
All clouds take an approach of trying to ascertain if traffic can be forwarded to the endpoint, which is a completely valid health check for load balancer services. There are drawbacks to all of these ways of doing - but cloud providers themselves are deemed best suited for what concerns: determining what is the best mechanism to use for their load balancers / cloud’s mode of operation. Their mechanism is beyond the scope of this KEP, i.e: this KEP does not attempt to “align them”.
Goals
- Stop re-configuring the load balancers’ node set for cases a) and b) above
Non-Goals
- Stop re-configuring the load balancers’ node set for fully deleted /
newly added cluster nodes, or for nodes which get annotated with
node.kubernetes.io/exclude-from-external-load-balancers. - Enable load balancer connection draining while Node is draining. This requires health check changes.
Proposal
Risks and Mitigations
Risk
- Cloud providers which do not allow VM deletion when the VM is referenced by other constructs, will block the cluster auto-scaler (CA) from deleting the VM upon downscale. This will result in reduced downscale performance by the CA, or completely block VM deletion from happening - this is because the service controller will never proceed to de-reference the VM from the load balancer node set until the Node is fully deleted in the API server, which will never occur until the VM is deleted. The three major cloud providers (GCP/AWS/Azure) do however support this, and it is not expected that other providers don’t.
- Cloud providers which do not configure their load balancer health checks to target the service proxy’s healthz, alternatively: constructs which validate the endpoint’s reachability across the data plane; risk experiencing regressions as a consequence of the removal of b). This would happen if a node is faced with a terminal error which does impact the Node’s network connectivity. Doing this is considered incorrect, and therefor not expected to be the case.
- By removing b) above we are delaying the removal of the Node from the load balancers’ node set until the Node is completely deleted in the API server. This might have an impact on CA downscaling. The reason for this is: the CA deletes the VM and expects the node controller in the KCCM to notice this and delete the Node in Kubernetes, as a consequence. If the node controller takes a while to sync that and other Node related events trigger load balancer reconciliation while this is happening, then the service controller will error until the cluster reaches steady-state (because it’s trying to sync Nodes for which the VM is non-existent). A mitigation to this is presented in Mitigations
Mitigations
- Cloud providers/workloads which do not support the behavior mentioned in Risk , have the possibility to set the feature flag to false, thus default back to the current mechanism.
- As to address point 3. we kept the taint
ToBeDeletedByClusterAutoscaleras a predicate for bothexternalTrafficPolicy: Cluster/Local. Updates to that taint will however not trigger a load balancer re-sync. This will ensure that whatever Nodes are included in the load balancer set, always have a corresponding VM. If the scenario detailed in 3. happens though, this also means we won’t connection drain the node which is terminating (since the node will be remvoed from the load balancer set). This means that we will have cases where we have a sub-optimal behavior (i.e: no connection draining), but we avoid errors syncing the state of load balancers. For reference, see: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/cloud-provider/controllers/service/controller.go#L1009-L1014C26
Design Details
- Implement the change in the service controller and ensure it does not add / remove nodes from the load balancers’ node set for cases a) and b) mentioned in (Summary)[#Summary]
- Add the feature gate:
StableLoadBalancerNodeSet, set it to “on” by default and promote it directly to Beta.
Test Plan
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
Unit tests
The service controller in the KCCM currently has a set of tests validating expected syncs caused by Node predicates, these will need to be updated.
k8s.io/cloud-provider/controllers/service:08/Feb/2023-67.7%
Integration tests
Kubernetes is mostly tested via unit tests and e2e, not integration, and this is not expected to change.
e2e tests
Kubernetes in general needs to extended its load balancing test suite with disruption tests, this might be the right effort we need to get that ball rolling. Testing would include:
- validation that an application running on a deleting VM benefits from graceful termination of its TCP connection.
- validation that Node readiness state changes do not result in load balancer re-syncs.
Graduation Criteria
Beta
This is addressing a sub-optimal solution currently existing in Kubernetes, so the feature gate will be moved to Beta and “on” by default from the start.
The feature flag should be kept available until we get sufficient evidence of people not being affected by anything mentioned in (Risks)[#Risks] or other.
GA
Given the lack of reported issues in Beta: the feature gate will be locked-in in GA.
Tentative timeline for this is in v1.30. Services of type: LoadBalancer are
sufficiently common on any given Kubernetes cluster, that any cloud provider
susceptible to the (Risks)[#Risks] will very likely report issues in Beta.
Upgrade / Downgrade Strategy
Any upgrade to a version enabling the feature, succeeded by a downgrade to a
version disabling it, is not expected to be impacted in any way. On upgrade: the
service controller will add all existing cluster nodes (bar excluded ones) to
the load balancer set. On downgrade: any nodes NotReady / tainted will get
reconciled by the service controller corresponding to the downgraded control
plane version and get removed from the load balancer set - as they should.
Version Skew Strategy
This change is contained to only the control plane and is therefor not impacted by any version skew.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name:
StableLoadBalancerNodeSet - Components depending on the feature gate: Kubernetes cloud controller manager
- Feature gate name:
Does enabling the feature change any default behavior?
Yes, Kubernetes Nodes will remain in the load balancers’ node set until fully deleted in the API server, as opposed to the current behavior: which adds / removes the nodes from the set when the Node experience transient state changes. Cloud providers which do not support deleting VMs which are still referenced by load balancers, will be unable to do so upon downscaling by the cluster auto-scaler when it attempts to delete the VM.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
Yes, by resetting the feature gate back.
What happens if we reenable the feature if it was previously rolled back?
Behavior will be restored back immediately.
Are there any tests for feature enablement/disablement?
Not needed since the enablement/disablement is triggered by changing a in-memory boolean variable.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
If a cluster has a lot of Nodes which are currently NotReady (in the order of
hundreds) and a rollout is triggered, it is expected that all of these nodes
will be added at once to every load balancer. That might have cloud API rate
limiting impacts on the service controller.
What specific metrics should inform a rollback?
Performance degradation by the CA when downscaling / flat out inability to
delete VMs. - this should be informed by the metric nodesync_error_rate
mentioned in Monitoring requirements
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
The owner of the KEP missed running the manual test when promoting the feature to Beta. Since then it was implicitly tested by many users that upgraded their clusters to 1.27+ versions without any bug reports, so running additional tests now wouldn’t provide additional value.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
Monitoring Requirements
The only mechanism currently implemented, is: events for syncing load balancers in the KCCM. The events are triggered any time a service is synced or Node change triggers a re-sync of all services. This will not change and can be used to monitor the implemented change. The implementation will result in less load balancer re-syncs.
A new metric load_balancer_sync_count has been added for explicitly monitoring
the amount of load balancer related syncs performed by the service controller.
This will include load balancer syncs caused by Service and Node changes. See:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/cloud-provider/controllers/service/metrics.go#L44-L49
A new metric nodesync_error_count has been added for explicitly monitoring the
amount of errors produced by syncing Node related events for load balancers. The
goal is have an indicator of if the service controller is impacted by point 3.
mentioned in (Risk)[#Risk], and at which frequency. See:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/cloud-provider/controllers/service/metrics.go#L50-L55
How can an operator determine if the feature is in use by workloads?
Analyze events stemming from the API server, correlating node state changes
(readiness or addition / removal of the taint: ToBeDeletedByClusterAutoscaler)
to load balancer re-syncs. The events should show a clear reduction in re-syncs
post the implementation and rollout of the change.
How can someone using this feature know that it is working for their instance?
By observing no change for the metric
load_balancer_sync_countwhen a Node transitions betweenReady<->NotReadyor when a Node is tainted withToBeDeletedByClusterAutoscaler. This is because this KEP proposes that we stop syncing load balancer as a consequence of these events.By observing no change w.r.t any active ingress connections for an
externalTrafficPolicy: Clusterservice, which is passing through a Node which is transitioning betweenReady<->NotReady. I.e: no impact on new or established connections, given that Kube-proxy is healthy when the Node transitions state like this and isn’t impacted.
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
No
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
Total amount of load balancer re-syncs should be reduced, leading to less cloud
provider API calls. Also, and more subtle: connections will get a chance to
gracefully terminate when the CA downscales cluster nodes. For services of type
externalTrafficPolicy: Cluster “traversing” connections through a “nexthop”
node might not be impacted by that Node’s readiness state anymore.
- Metrics
- Events: The KCCM triggers events when syncing load balancers. The amount of these events should be reduced.
- Metrics:
load_balancer_sync_count
Are there any missing metrics that would be useful to have to improve observability of this feature?
N/A
Dependencies
Does this feature depend on any specific services running in the cluster?
No
Scalability
Will enabling / using this feature result in any new API calls?
No
Will enabling / using this feature result in introducing new API types?
No
Will enabling / using this feature result in any new calls to the cloud provider?
No
Will enabling / using this feature result in increasing size or count of the existing API objects?
No
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
No
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
Not any different than today.
What are other known failure modes?
None
What steps should be taken if SLOs are not being met to determine the problem?
Validate that services of type: LoadBalancer exists on the cluster and that
Nodes are experiencing a transitioning readiness state, alternatively that the
CA downscales and deletes VMs.
Implementation History
- 2023-02-01: Initial proposal