KEP-1645: Multi-Cluster Services API
KEP-1645: Multi-Cluster Services API
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Constraints and Conflict Resolution
- Implementation History
- Alternatives
- Infrastructure Needed
Release Signoff Checklist
- Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- KEP approvers have approved the KEP status as
implementable - Design details are appropriately documented
- Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- Graduation criteria is in place
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
There is currently no standard way to connect or even think about Kubernetes services beyond the cluster boundary, but we increasingly see users deploy applications across multiple clusters designed to work in concert. This KEP proposes a new API to extend the service concept across multiple clusters. It aims for minimal additional configuration, making multi-cluster services as easy to use as in-cluster services, and leaves room for multiple implementations.
Converted from this original proposal doc .
Motivation
There are many reasons why a K8s user may want to split their deployments across multiple clusters, but still retain mutual dependencies between workloads running in those clusters. Today the cluster is a hard boundary, and a service is opaque to a remote K8s consumer that would otherwise be able to make use of metadata (e.g. endpoint topology) to better direct traffic. To support failover or temporarily during migration, users may want to consume services spread across clusters, but today that requires non-trivial bespoke solutions.
The Multi-Cluster Services API aims to fix these problems.
Goals
- Define a minimal API to support service discovery and consumption across
clusters.
- Consume a service in another cluster.
- Consume a service deployed in multiple clusters as a single service.
- When a service is consumed from another cluster its behavior should be predictable and consistent with how it would be consumed within its own cluster.
- Allow gradual rollout of changes in a multi-cluster environment.
- Create building blocks for multi-cluster tooling.
- Support multiple implementations.
- Leave room for future extension and new use cases.
Non-Goals
- Define specific implementation details beyond general API behavior.
- Change behavior of single cluster services in any way.
- Define what NetworkPolicy means for multi-cluster services.
- Solve mechanics of multi-cluster service orchestration.
Proposal
Terminology
clusterset - A placeholder name for a group of clusters with a high degree of mutual trust and shared ownership that share services amongst themselves. Membership in a clusterset is symmetric and transitive. The set of member clusters are mutually aware, and agree about their collective association. Within a clusterset, namespace sameness applies and all namespaces with a given name are considered to be the same namespace. Implementations of this API are responsible for defining and tracking membership in a clusterset. The specific mechanism is out of scope of this proposal.
mcs-controller - A controller that syncs services across clusters and makes them available for multi-cluster service discovery and connectivity. There may be multiple implementations, this doc describes expected common behavior. The controller may be a single controller, multiple decentralized controllers, or a human using kubectl to create resources. This document aims to support any implementation that fulfills the behavioral expectations of this API.
cluster name - A unique identifier for a cluster, scoped to the implementation’s cluster registry. We do not attempt to define the registry. The cluster name must be a valid RFC 1123 DNS label.
The cluster name should be consistent for the life of a cluster and its membership in the clusterset. Implementations should treat name mutation as a delete of the membership followed by recreation with the new name.
cluster id - A unique identifier for a cluster, scoped to a clusterset. The cluster id must be either:
- equal to cluster name,
- or composed of two valid RFC 1123 DNS labels separated with a dot. The first label equals cluster name and the second one gives additional context, allowing the implementation to uniquely identify a cluster within a clusterset composed of clusters registered with multiple cluster registries.
The cluster id should be consistent for the life of a cluster and its membership in the clusterset. Implementations should treat id mutation as a delete of the membership followed by recreation with the new name.
We propose a new CRD called ServiceExport, used to specify which services
should be exposed across all clusters in the clusterset. ServiceExports must
be created in each cluster that the underlying Service resides in. Creation of
a ServiceExport in a cluster will signify that the Service with the same
name and namespace as the export should be visible to other clusters in the
clusterset.
Another CRD called ServiceImport will be introduced to act as the in-cluster
representation of a multi-cluster service in each importing cluster. This is
analogous to the traditional Service type in Kubernetes. Importing clusters
will have a corresponding ServiceImport for each uniquely named Service that
has been exported within the clusterset, referenced by namespaced name.
ServiceImport resources will be managed by the MCS implementation’s
mcs-controller.
If multiple clusters export a Service with the same namespaced name, they will
be recognized as a single combined service. For example, if 5 clusters export
my-svc.my-ns, each importing cluster will have one ServiceImport named
my-svc in the my-ns namespace and it will be associated with endpoints from
all exporting clusters. Properties of the ServiceImport (e.g. ports, topology)
will be derived from a merger of component Service properties.
This specification is not prescriptive on exact implementation details. Existing
implementations of Kubernetes Service API (e.g. kube-proxy) can be extended to
present ServiceImports alongside traditional Services. One often discussed
implementation requiring no changes to kube-proxy is to have the mcs-controller
maintain ServiceImports and create “dummy” or “shadow” Service objects, named
after a mcs-controller managed EndpointSlice that aggregates all cross-cluster
backend IPs, so that kube-proxy programs those endpoints like a regular Service.
Other implementations are encouraged as long as the properties of the API described
in this document are maintained.
User Stories
Different ClusterIP Services Each Deployed to Separate Cluster
I have 2 clusters, each running different ClusterIP services managed by different teams, where services from one team depend on services from the other team. I want to ensure that a service from one team can discover a service from the other team (via DNS resolving to VIP), regardless of the cluster that they reside in. In addition, I want to make sure that if the dependent service is migrated to another cluster, the dependee is not impacted.
Single Service Deployed to Multiple Clusters
I have deployed my stateless service to multiple clusters for redundancy or scale. Now I want to propagate topologically-aware service endpoints (local, regional, global) to all clusters, so that other services in my clusters can access instances of this service in priority order based on availability and locality. Requests to my replicated service should seamlessly transition (within SLO for dropped requests) between instances of my service in case of failure or removal without action by or impact on the caller. Routing to my replicated service should optimize for cost metric (e.g. prioritize traffic local to zone, region).
Constraints
While standard Services traffic policies and traffic distribution have been integrated and work across clusters (for instance PreferSameZone across clusters sharing the same zone), we do not yet have multi-cluster specific traffic distribution control. This is planned to be addressed in its own KEP that will complement this specification.
Risks and Mitigations
Design Details
Exporting Services
Services will not be visible to other clusters in the clusterset by default. They must be explicitly marked for export by the user. This allows users to decide exactly which services should be visible outside of the local cluster.
Tooling may (and likely will, in the future) be built on top of this to simplify
the user experience. Some initial ideas are to allow users to specify that all
services in a given namespace or in a namespace selector or even a whole cluster
should be automatically exported by default. In that case, a ServiceExport
could be automatically created for all Services. This tooling will be designed
in a separate doc, and is secondary to the main API proposed here.
To mark a service for export to the clusterset, a user will create a ServiceExport CR:
// ServiceExport declares that the associated service should be exported to
// other clusters.
type ServiceExport struct {
metav1.TypeMeta `json:",inline"`
// +optional
metav1.ObjectMeta `json:"metadata,omitempty"`
// +optional
Spec ServiceExportSpec `json:"spec,omitempty"`
// +optional
Status ServiceExportStatus `json:"status,omitempty"`
}
// ServiceExportSpec describes an exported service and extra exported information
type ServiceExportSpec struct {
// +optional
ExportedLabels map[string]string `json:"exportedLabels"`
// +optional
ExportedAnnotations map[string]string `json:"exportedAnnotations"`
}
// ServiceExportStatus contains the current status of an export.
type ServiceExportStatus struct {
// +optional
// +patchStrategy=merge
// +patchMergeKey=type
// +listType=map
// +listMapKey=type
Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type"`
}
apiVersion: multicluster.k8s.io/v1alpha1
kind: ServiceExport
metadata:
name: my-svc
namespace: my-ns
status:
conditions:
- type: Valid
status: "True"
lastTransitionTime: "2020-03-30T01:33:51Z"
reason: Valid
message: "The ServiceExport and its Service is exportable."
- type: Ready
status: "True"
lastTransitionTime: "2020-03-30T01:33:55Z"
reason: Exported
message: "The service has been exported"
- type: Conflict
status: "True"
lastTransitionTime: "2020-03-30T01:33:55Z"
reason: TypeConflict
message: "Conflicting type. Using \"ClusterSetIP\" from oldest service export in \"cluster-1\". 2/5 clusters disagree."
To export a service, a ServiceExport should be created within the cluster and
namespace that the service resides in, name-mapped to the service for export -
that is, they reference the Service with the same name as the export. If
multiple clusters within the clusterset have ServiceExports with the same name
and namespace, these will be considered the same service and will be combined at
the clusterset level.
Note: A Service without a corresponding ServiceExport in its local cluster
will not be exported even if other clusters are exporting a Service with the
same namespaced name.
This requires that within a clusterset, a given namespace is governed by a single authority across all clusters. It is that authority’s responsibility to ensure that a name is shared by multiple services within the namespace if and only if they are instances of the same service.
Most information about the service, including ports, backends, topology and
session affinity, internal traffic policy, and traffic distribution
will continue to be stored in the Service objects, which are each name
mapped to a ServiceExport. This does not apply for labels and
annotations which are stored in ServiceExport directly in spec.exportedLabels
and spec.exportedAnnotations. Exporting labels and annotations is optionally
supported by MCS-API implementations. If supported, annotations or labels must
not be exported from the metadata of the Service or ServiceExport resources.
An implementation may use the ipFamilies field from the exported Services as
a hint to influence the IPs and ipFamilies of the ServiceImport object.
The exact mechanism for determining those fields is implementation-defined.
If ipFamilies is set on the ServiceImport object, it must not have duplicated
families (for instance ipFamilies: [IPv4, IPv4] is not valid) and the IPs
should eventually be in the same order as what is defined in ipFamilies. If
conflicting ipFamilies are found among the constituent Services, implementations
must raise an IPFamilyConflict condition when this might result in network
traffic reaching only a subset of the backends depending on the IP protocol used.
Also note that even in a dual stack cluster regular Services are by default SingleStack
which might default to IPv4 or IPv6 depending on the cluster configuration and there
are various constraints when mutating ipFamilies and ipFamilyPolicy on a Service
(see ref
.
Deleting a ServiceExport will stop exporting the name-mapped Service.
Restricting Exports
Cluster administrators may use RBAC rules to prevent creation of
ServiceExports in select namespaces. While there are no general restrictions
on which namespaces are allowed, administrators should be especially careful
about permitting exports from kube-system and default. As a best practice,
admins may want to tightly or completely prevent exports from these namespaces
unless there is a clear use case.
Importing Services
To consume a clusterset service, the domain name associated with the
multi-cluster service should be used (see DNS
). When the mcs-controller
sees a ServiceExport, a ServiceImport will be introduced in each importing
cluster to represent the imported service. Users are primarily expected to
consume the service via domain name and clusterset VIP, but the ServiceImport
may be used for imported service discovery via the K8s API and will be used
internally as the source of truth for routing and DNS configuration.
A ServiceImport is a service that may have endpoints in other clusters. This
includes 3 scenarios:
- This service is running entirely in different cluster(s).
- This service has endpoints in other cluster(s) and in this cluster.
- This service is running entirely in this cluster, but is exported to other cluster(s) as well.
A multi-cluster service will be imported only by clusters in which the service’s namespace exists. All clusters containing the service’s namespace will import the service. This means that all exporting clusters will also import the multi-cluster service. An implementation may or may not decide to create missing namespaces automatically, that behavior is out of scope of this spec.
Because of the potential wide impact a ServiceImport may have within a
cluster, non-cluster-admin users should not be allowed to create or modify
ServiceImport resources. The mcs-controller should be solely responsible for
the lifecycle of a ServiceImport.
Some errors may occur during the ServiceImport’s lifecycle, such as IP protocol
incompatibilities (i.e.: importing an IPv6 only service in an IPv4 cluster). These
errors and general status reporting of a ServiceImport should be reported
via its status conditions field.
For each exported service, one ServiceExport will exist in each cluster that
exports the service. The mcs-controller will create and maintain a derived
ServiceImport in each cluster within the clusterset so long as the service’s
namespace exists (see: constraints and conflict
resolution
). If all ServiceExport
instances are deleted, each ServiceImport will also be deleted from all
clusters.
// ServiceImport describes a service imported from clusters in a clusterset.
type ServiceImport struct {
metav1.TypeMeta `json:",inline"`
// +optional
metav1.ObjectMeta `json:"metadata,omitempty"`
// +optional
Spec ServiceImportSpec `json:"spec,omitempty"`
// +optional
Status ServiceImportStatus `json:"status,omitempty"`
}
// ServiceImportType designates the type of a ServiceImport
type ServiceImportType string
const (
// ClusterSetIP are only accessible via the ClusterSet IP.
ClusterSetIP ServiceImportType = "ClusterSetIP"
// Headless services allow backend pods to be addressed directly.
Headless ServiceImportType = "Headless"
)
// ServiceImportSpec describes an imported service and the information necessary to consume it.
type ServiceImportSpec struct {
// +listType=atomic
Ports []ServicePort `json:"ports"`
// +kubebuilder:validation:MaxItems:=2
// +optional
IPs []string `json:"ips,omitempty"`
// +kubebuilder:validation:MaxItems:=2
// +optional
IPFamilies []corev1.IPFamily `json:"ipFamilies,omitempty"`
// +optional
Type ServiceImportType `json:"type"`
// +optional
SessionAffinity corev1.ServiceAffinity `json:"sessionAffinity"`
// +optional
SessionAffinityConfig *corev1.SessionAffinityConfig `json:"sessionAffinityConfig"`
// +optional
InternalTrafficPolicy *corev1.ServiceInternalTrafficPolicy `json:"internalTrafficPolicy,omitempty"`
// The possible TrafficDistribution values should match what can be similarly
// defined in a Service, see https://kubernetes.io/docs/concepts/services-networking/service/#traffic-distribution
// +optional
TrafficDistribution *string `json:"trafficDistribution,omitempty"`
}
// ServicePort represents the port on which the service is exposed
type ServicePort struct {
// The name of this port within the service. This must be a DNS_LABEL.
// All ports within a ServiceSpec must have unique names. When considering
// the endpoints for a Service, this must match the 'name' field in the
// EndpointPort.
// Optional if only one ServicePort is defined on this service.
// +optional
Name string `json:"name,omitempty"`
// The IP protocol for this port. Supports "TCP", "UDP", and "SCTP".
// Default is TCP.
// +optional
Protocol Protocol `json:"protocol,omitempty"`
// The application protocol for this port.
// This field follows standard Kubernetes label syntax.
// Un-prefixed names are reserved for IANA standard service names (as per
// RFC-6335 and http://www.iana.org/assignments/service-names).
// Non-standard protocols should use prefixed names such as
// mycompany.com/my-custom-protocol.
// Field can be enabled with ServiceAppProtocol feature gate.
// +optional
AppProtocol *string `json:"appProtocol,omitempty"`
// The port that will be exposed by this service.
Port int32 `json:"port"`
}
// ServiceImportStatus describes derived state of an imported service.
type ServiceImportStatus struct {
// +optional
// +patchStrategy=merge
// +patchMergeKey=cluster
// +listType=map
// +listMapKey=cluster
Clusters []ClusterStatus `json:"clusters"`
// +optional
// +patchStrategy=merge
// +patchMergeKey=type
// +listType=map
// +listMapKey=type
Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type"`
}
// ClusterStatus contains service configuration mapped to a specific source cluster
type ClusterStatus struct {
Cluster string `json:"cluster"`
}
apiVersion: multicluster.k8s.io/v1alpha1
kind: ServiceImport
metadata:
name: my-svc
namespace: my-ns
spec:
ips:
- 42.42.42.42
ipFamilies:
- IPv4
type: "ClusterSetIP"
ports:
- name: http
protocol: TCP
port: 80
sessionAffinity: None
status:
conditions:
- type: Ready
reason: Ready
status: "True"
lastTransitionTime: "2020-03-30T01:33:51Z"
clusters:
- cluster: us-west2-a-my-cluster
The ServiceImport.Spec.IP (VIP) can be used to access this service from within
this cluster.
ClusterSet Service Behavior Expectations
Service Types
ClusterIP: This is the straightforward case most of the proposal assumes. Each endpoint from a producing cluster associated with the exported service is aggregated with endpoints from other clusters to make up the clusterset service. They will be imported to the cluster behind the clusterset IP, with aServiceImportof typeClusterSetIP. The details on how the clusterset IP is allocated or how the combined slices are maintained may vary by implementation; see also Tracking Endpoints .ClusterIP: none(Headless): Headless services are supported and will be imported with aServiceImportlike any otherClusterIPservice, but do not configure a VIP and must be consumed via DNS . TheirServiceImports will be of typeHeadless. A multi-cluster service’s headlessness is derived from it’s constituent exported services according to the conflict resolution policy .Exporting a non-headless service to an otherwise headless service can dynamically change the clusterset service type when an old export is removed, potentially breaking existing consumers. This is likely the result of a deployment error. Conditions and events on the
ServiceExportwill be used to communicate conflicts to the user.NodePortandLoadBalancer: These createClusterIPservices that would sync as expected. For example if you export aNodePortservice, the resulting cross-cluster service will still be a clusterset IP type. The local service will not be affected. Node ports can still be used to access the cluster-local service in the source cluster, and only the clusterset IP will route to endpoints in remote clusters.ExternalName: It doesn’t make sense to export anExternalNameservice. They can’t be merged with other exports, and it seems like it would only complicate deployments by even attempting to stretch them across clusters. Instead, regularExternalNametypeServicesshould be created in each cluster individually. If aServiceExportis created for anExternalNameservice, a condition typeValidwith reasonInvalidServiceTypeand statusfalsewill be set on theServiceExport.
ClusterSetIP
A non-headless ServiceImport is expected to have associated IP addresses, the
clusterset IPs, which may be accessed from within an importing cluster. These IPs
may be used clusterset-wide or assigned on a per-cluster basis, but is expected
to be consistent for the life of a ServiceImport from the perspective of the
importing cluster. Requests to these IPs from within a cluster will route to
backends for the aggregated Service. The IPs field must correspond to the
protocols defined in the ipFamilies field, if specified. How the ipFamilies
field is determined is implementation-defined, for instance it might correspond
to what IP protocols the constituent ServiceExports support or only the IP
protocols that the local cluster supports.
Note: this doc does not discuss NetworkPolicy, which cannot currently be used
to describe a selector based policy that applies to a multi-cluster service.
DNS
Optional, but recommended.
The full specification for Multicluster Service DNS is in this KEP’s specification.md . MCS aims to align with the existing service DNS spec . This section provides an overview of the multicluster DNS specification and its rationale, and assumes familiarity with in-cluster Service DNS behavior.
In short, when a ServiceExport is created, this will cause a domain name for
the multi-cluster service to become accessible from within the clusterset. The
domain name will be <service>.<ns>.svc.clusterset.local. This domain name
operates differently depending on whether the ServiceExport refers to a
ClusterSetIP or Headless service:
- ClusterSetIP services: Requests to this domain name from within an
importing cluster will resolve to the clusterset IP. Requests to this IP will be
spread across all endpoints exported with
ServiceExports across the clusterset. - Headless services: Within an importing cluster, the clusterset domain
name will have multiple
A/AAAArecords, each containing the address of a ready endpoint of the headless service.<service>.<ns>.svc.clusterset.localwill resolve to the entire set or the subset of ready pod IPs, depending on the implementation and endpoint count.
In addition, other resource records are included to conform to in-cluster
Service DNS behavior. SRV records are included to support known use cases such
as VOIP, Active Directory, and etcd cluster bootstrapping. Pods backing a
Headless service may be addressed individually using the
<hostname>.<clusterid>.<svc>.<ns>.svc.clusterset.local format; necessary
records will be created based on each ready endpoint’s hostname and the
multicluster.kubernetes.io/source-cluster label on the EndpointSlice. This
allows naming collisions to be avoided for headless services backed by identical
StatefulSets deployed in multiple clusters.
Note: the total length of a FQDN is limited to 253 characters. Each label is independently limited to 63 characters, so users must choose host/cluster/service names to avoid hitting this upper bound.
All service consumers must use the *.svc.clusterset.local name to enable
clusterset routing, even if there is a matching Service with the same
namespaced name in the local cluster. This name allows service consumers to
opt-in to multi-cluster behavior. There will be no change to existing behavior
of the cluster.local zone.
It is expected that the .clusterset.local zone is standard and available in
all implementations, but customization and/or aliasing can be explored if
there’s demand.
No PTR records necessary for multicluster DNS
This specification does not require PTR records be generated in the course of
implementing multicluster DNS. By definition, each IP must only have one PTR
record, to facilitate reverse DNS lookup. The cluster-local Kubernetes DNS
specification already requires a PTR record for the ready IPs for ClusterIP
and Headless Services. As this specification is currently written, by not
requiring any new PTR records and leaving the cluster-local PTR records as
the only ones, PTR record existence becomes potentially inconsistent for
multicluster DNS, especially between importing and exporting clusters (for
example, a Headless pod IP PTR record would exist on the exporting cluster,
but not necessarily on an importing cluster). On the other hand, some existing
MCS API implementations create a new “dummy” cluster-local Service object for
every ServiceImport, and due to the cluster-local DNS specification, they will
already have a PTR record generated due to the DNS resolution of the “dummy”
Service.
In cases where PTR records are not always set, if the specification did
require to backfill in a clusterset.local zoned one wherever one is missing
(i.e. for importing clusters), the result would be a patchwork of
cluster.local and clusterset.local PTR records, depending what cluster in
the ClusterSet you are querying from, still resulting in an inconsistent
experience.
Alternatively, the multicluster DNS specification could have required
clusterset.local PTR records across the board, making the experience
consistent. This would require implementations to overwrite the cluster-local
behavior for MCS services since IPs can only have one PTR record. However, the
MCS API purposefully tries to avoid changing cluster-local behavior as much as
possible.
Fundamentally, PTR records are used for reverse DNS lookup from an IP to a DNS
name. Besides this, some potentially useful information (ex mapping pod IPs, if
you happen to have one out of context, to their related Service objects) would
be consistently surfaced through reverse DNS lookup if we required
clusterset.local PTR records. However, the k8s API server contains the same
metadata and is already potentially accessible to any MCS client since the
requests originate in-clusterset. Without a strong use case for requiring them
and given the desire to avoid changing cluster-local behavior, PTR records are
not required for multicluster DNS.
Not allowing cluster-specific targeting via DNS
While we reserve the form <clusterid>.<svc>.<ns>.svc.clusterset.local. for
possible future use, both ClusterSetIP Services and Multicluster Headless
Services are specified to explicitly disallow using this form to create DNS
records that target all 1+N backends in a specific cluster.
For ClusterSetIP services, this rationale is tied to the intent of its
underlying ClusterIP Service. In a single-cluster setup, the purpose of a
ClusterIP service is to reduce the context needed by the application to target
ready backends, especially if those backends disappear or change frequently, and
leverages kube-proxy to do this independent of the limitations of DNS.
(ref
)
Similarly, users of exported ClusterIP services should depend on the single
<clusterset-ip> (or the single A/AAAA record mapped to it), instead of
targeting per cluster backends. If a user has a need to target backends in a
different way, they should use headless Services.
For Multicluster Headless Services, the rationale is tied to the intent of its underlying Headless Service to provide absolutely no load balancing capabilities on any stateful dimension of the backends (such as cluster locality), and provide routing to each single backend for the application’s purposes.
In both cases, this restriction seeks to preserve the MCS position on namespace sameness . Services of the same name/namespace exported in the multicluster environment are considered to be the same by definition, and thus their backends are safe to ‘merge’ at the clusterset level. If these backends need to be addressed differently based on other properties than name and namespace, they lose their fungible nature which the MCS API depends on. In these situations, those backends should instead be fronted by a Service with a different name and/or namespace.
For example, say an application wishes to target the backends for a
ClusterSetIP ServiceExport called special/prod in <clusterid>=cluster-east
separately from all backends in <clusterid>=cluster-west. Instead of depending
on the disallowed implementation of cluster-specific addressing, the Services in
each specific cluster should actually be considered non-fungible and be created
and exported by ServiceExports with different names that honor the boundaries
of their sameness, such as special-east/prod for all the backends in
<clusterid>=cluster-east and special-west/prod for the backends in
<clusterid>=cluster-west. In this situation, the resulting DNS names
special-east.prod.svc.clusterset.local and
special-west.prod.svc.clusterset.local encode the cluster-specific addressing
required by virtue of being two different ServiceExports.
Note that this puts the burden of enforcing the boundaries of a
ServiceExport’s fungibility on the name/namespace creator.
Individually addressing pods backing a Headless service is exempt from the rules
described in this section. Such a pod may be addressed using the
<hostname>.<clusterid>.<svc>.<ns>.svc.clusterset.local format, where clusterid
must uniquely identify a cluster within a clusterset. The implementation may use
cluster name as clusterid, and this is not ambiguous if all the clusters on
the clusterset are registered with the same cluster registry. In case a
clusterset contains clusters registered with multiple registries, cluster name
may be ambiguous. The implementation may in such case use clusterid composed
of cluster name and an additional DNS label, separated with a dot. The
additional label gives additional context, which is implementation-dependent and
may be used for instance to uniquely identify the cluster registry with which a
cluster is registered.
Tracking Endpoints
The specific mechanism by which the mcs-controller maintains references to the
individual backends for an aggregated service is an implementation detail not
fully prescribed by this specification. Implementations may depend on a higher
level (possibly vendor-specific) API, offload to a load balancer or xDS server
(like Envoy), or use Kubernetes networking APIs. If the implementation depends
on Kubernetes networking APIs, specifically EndpointSlice objects, they must
conform to the specification in the following section.
Using EndpointSlice objects to track endpoints
Optional to create, but specification defined if present.
If an implementation does create discovery.k8s.io/v1 EndpointSlices, they must
conform to the following structure. This structure was originally required as
part of this specification in alpha, and are the structure on which other
SIG-endorsed reference implementations and tooling, like the CoreDNS
multicluster plugin
, depend.
When a ServiceExport is created, this will cause EndpointSlice objects for
the underlying Service to be created in each importing cluster within the
clusterset, associated with the derived ServiceImport. One or more
EndpointSlice resources will exist for the exported Service, with each
EndpointSlice containing only endpoints from a single source cluster. An
EndpointSlice created by an mcs-controller must be marked as managed by the
mcs-controller, not the default EndpointSlice controller to avoid any conflicts
between the controllers.
When a service is un-exported, the associated EndpointSlices will be deleted. The specific mechanism by which they are deleted is an implementation detail.
Since a given ServiceImport may be backed by multiple EndpointSlices, a
given EndpointSlice will reference its ServiceImport using the label
multicluster.kubernetes.io/service-name similarly to how an EndpointSlice is
associated with its Service in a single cluster.
Each imported EndpointSlice will also have a
multicluster.kubernetes.io/source-cluster label with the cluster id, a
clusterset-scoped unique identifier for the cluster. The EndpointSlices
imported for a service are not guaranteed to exactly match the originally
exported EndpointSlices, but each slice is guaranteed to map only to a single
source cluster.
If the implementation is using EndpointSlices in this way, the mcs-controller
is responsible for managing the imported EndpointSlices and making sure they
are conformant with this section.
apiVersion: multicluster.k8s.io/v1alpha1
kind: ServiceImport
metadata:
name: my-svc
namespace: my-ns
spec:
ips:
- 42.42.42.42
type: "ClusterSetIP"
ports:
- name: http
protocol: TCP
port: 80
sessionAffinity: None
status:
clusters:
- cluster: us-west2-a-my-cluster
---
apiVersion: discovery.k8s.io/v1beta1
kind: EndpointSlice
metadata:
name: imported-my-svc-cluster-b-1
namespace: my-ns
labels:
multicluster.kubernetes.io/source-cluster: us-west2-a-my-cluster
multicluster.kubernetes.io/service-name: my-svc
ownerReferences:
- apiVersion: multicluster.k8s.io/v1alpha1
controller: false
kind: ServiceImport
name: my-svc
addressType: IPv4
ports:
- name: http
protocol: TCP
port: 80
endpoints:
- addresses:
- "10.1.2.3"
conditions:
ready: true
topology:
topology.kubernetes.io/zone: us-west2-a
<<[UNRESOLVED]>>
We have not yet sorted out scalability impact here. We hope the upper bound for
imported endpoints + in-cluster endpoints will be ~= the upper bound for
in-cluster endpoints today, but this remains to be determined.
<<[/UNRESOLVED]>>
Endpoint TTL
To prevent stale endpoints from persisting in the event that the mcs-controller is unable to reach a cluster, it is recommended that an implementation provide an in-cluster controller to monitor and remove stale endpoints. This may be the mcs-controller itself in distributed implementations.
We recommend creating leases to represent connectivity with source clusters.
These leases should be periodically renewed by the mcs-controller while the
connection with the source cluster is confirmed alive. When a lease expires, the
cluster id and multicluster.kubernetes.io/source-cluster label may be used
to find and remove all EndpointSlices containing endpoints from the
unreachable cluster.
Constraints and Conflict Resolution
Exported services are derived from the properties of each component service and their respective endpoints. However, some properties combine across exports better than others.
Global Properties
These properties describe how the service should be consumed as a whole. They directly impact service consumption and must be consistent across all child services. If these properties are out of sync for a subset of exported services, there is no clear way to determine how a service should be accessed.
Conflict resolution policy: If any properties have conflicting values that can
not simply be merged, a Conflict condition with a true status will be set
on all ServiceExports for the conflicted service with a description of the conflict.
The conflict will be resolved by assigning precedence based on each
ServiceExport’s creationTimestamp, from oldest to newest.
Note: When a ServiceExport’s conflict condition changes from False to True due to this resolution policy, runtime traffic remains unaffected. The oldest cluster will win the conflict and continue to be referenced in the ServiceImport, maintaining service continuity. Conversely, when the conflict condition transitions from True to False (for example, when the oldest cluster’s service is unexported), the ServiceImport may remain unchanged to avoid potentially disruptive changes to active traffic patterns.
Service Port
A derived service will be accessible with the clusterset IP at the ports
dictated by child services. If the external properties of service ports for a
set of exported services don’t match, the clusterset service will expose the
union of service ports declared on its constituent services and raise a PortConflict
conflict condition. In that case, network traffic to a conflicting port should
only be directed to endpoints from constituent services that actually expose the port.
Like regular services, the resulting ports must respect two rules:
- Have no duplicated names (including unnamed/empty name)
- Two ports must not have the same protocol and port number
As a result, MCS-API implementations should merge ports from constituent services first based on port name then by the protocol and port number pair. The conflict resolution policy will determine which of the duplicated ports are used by the ServiceImport.
Headlessness
Headlessness affects a service as a whole for a given consumer. Whether or not a derived service is headless will be decided according to the conflict resolution policy.
Session Affinity
Session affinity affects a service as a whole for a given consumer. The derived service’s session affinity will be decided according to the conflict resolution policy.
Internal Traffic Policy
Internal traffic policy affects a service as a whole for a given consumer. The derived service’s internal traffic policy will be decided according to the conflict resolution policy.
Traffic Distribution
Traffic distribution affects a service as a whole for a given consumer. The derived service’s traffic distribution will be decided according to the conflict resolution policy.
Labels and Annotations
If supported, exporting labels and annotations would affect a Service as a whole
for a given consumer. The derived service’s labels and annotations will be decided
according to the conflict resolution if the set of name/value pairs are not identical
between the constituent clusters.
Test Plan
E2E tests can use kind to create multiple clusters to test various multi-cluster scenarios. To meet conditions required by MCS, cluster networks will be flattened by adding static routes between nodes in each cluster.
- Test cluster A can contact service imported from cluster B and route to expected endpoints.
- Test cluster A local service not impacted by same-name imported service.
- Test cluster A can contact service imported from cluster A and B and route to expected endpoints in both clusters.
Graduation Criteria
Alpha -> Beta Graduation
- A detailed DNS spec for multi-cluster services.
- NetworkPolicy either solved or explicitly ruled out.
- API group chosen and approved.
- E2E tests exist for MCS services.
- Beta -> GA Graduation criteria defined.
- At least one MCS DNS implementation.
- A formal plan for a standard Cluster ID.
- Finalize a name for the “supercluster” concept.
- Cluster ID KEP is in beta
Beta -> GA Graduation
- Scalability/performance testing, understanding impact on cluster-local service scalability.
- Cluster ID KEP is GA, with at least one other multi-cluster use case.
- A conformance report program for MCS-API has been created to document the conformance level of the various implementations.
Upgrade / Downgrade Strategy
Kube-proxy and must be updated to a supported version before MCS services may be
used. To take advantage of MCS DNS, the DNS provider must be upgraded to a
version that implements the MCS spec. Kube-proxy MCS support will be guarded by
a MultiClusterServices feature gate. When enabled, kube-proxy will watch the
serviceimports.multicluster.k8s.io CRD. MCS support will be dynamically
enabled and disabled as the CRD is created and deleted.
Version Skew Strategy
Kube-proxy and DNS must be upgraded before new MCS API versions may be used. Backwards compatibility will be maintained in accordance with the deprecation policy .
Implementation History
- 2020-02-05 - Initial Proposal
- 2020-05-10 - Merged as provisional
- 2020-06-22 - Moved to implementable
- 2020-08-04 - ClusterSet name finalized
- 2020-08-10 - Alpha implementation available at sigs.k8s.io/mcs-api
Alternatives
ObjectReference in ServiceExport.Spec to directly map to a Service
Instead of name mapping, we could use an explicit ObjectReference in a
ServiceExport.Spec. This feels familiar and more explicit, but fundamentally
changes certain characteristics of the API. Name mapping means that the export
must be in the same namespace as the Service it exports, allowing existing
RBAC rules to restrict export rights to current namespace owners. We are
building on the concept that a namespace belongs to a single owner, and it
should be the Service owner who controls whether or not a given Service is
exported. Using ObjectReference instead would also open the possibility of
having multiple exports acting on a single service and would require more effort
to determine if a given service has been exported.
The above issues could also be solved via controller logic, but we would risk differing implementations. Name mapping enforces behavior at the API.
Export services via label selector
Instead of name mapping, ServiceExport could have a
ServiceExport.Spec.ServiceSelector to select matching services for export.
This approach would make it easy to simply export all services with a given
label applied and would still scope exports to a namespace, but shares other
issues with the ObjectReference approach above:
- Multiple
ServiceExportsmay export a givenService, what would that mean? - Determining whether or not a service is exported means searching
ServiceExportsfor a matching selector.
Though multiple services may match a single export, the act of exporting would still be independent for individual services. A report of status for each export seems like it belongs on a service-specific resource.
With name mapping it should be relatively easy to build generic or custom logic
to automatically ensure a ServiceExport exists for each Service matching a
selector - perhaps by introducing something like a ServiceExportPolicy
resource (out of scope for this KEP). This would solve the above issues but
retain the flexibility of selectors.
Export via annotation
ServiceExport initially had no spec and seemed like it could just be
replaced with an annotation, e.g. multicluster.kubernetes.io/export. When a
service is found with the annotation, it would be considered marked for export
to the clusterset. The controller would then create EndpointSlices and an
ServiceImport in each cluster exactly as described above. Unfortunately,
Service does not have an extensible status and there is no way to represent
the state of the export on the annotated Service. We could extend
Service.Status to include Conditions and provide the flexibility we need,
but requiring changes to Service makes this a much more invasive proposal to
achieve the same result. As the use of a multi-cluster service implementation
would be an optional addon, it doesn’t warrant a change to such a fundamental
resource.
Other conflict resolution algorithms
When a service has a ServiceExport and a ServiceImport, we could have taken the approach of favoring a “local truth” by giving a higher precedence to the locally exported Service in the conflict resolution algorithm. This alternative approach was not adopted, as in this KEP we favored global consistency across the ClusterSet.
The conflict resolutions algorithm could also have been based on majority instead of ServiceExport oldness. However, with this approach, we would have to consider a tie breaking factor that could have also been based on age. This would complicate the implementation of MCS-API and, most importantly, might be more confusing for users. Having just one simple deciding factor based on ServiceExport oldness makes resolving conflicts straightforward, and this alternative conflict resolution algorithm could hinder this ease of use.
Exporting labels/annotations from the Service/ServiceExport objects
Service and ServiceExport have labels and annotations which could be used during
export and propagated to the ServiceImport. However various tools such as kubectl or
ArgoCD add some labels and annotations which would then need to be actively
filtered to avoid any conflict. Filtering those labels and annotations is not
something easy and we chose to avoid this problem entirely by not using the metadata
object and adding dedicated fields in the spec of the ServiceExport resource.
Also if we were using the labels and annotations from the metadata of either the
ServiceExport or Service resources, it may be more confusing for users as it
would be the only fields present in both resources. For instance, should an
implementation merge the labels/annotations from both objects? Should it favor one?
Should it takes only from the Service object? With dedicated fields for labels
and annotations in the spec of the ServiceExport resource, it may becomes more
straightforward that each resource have their own labels and annotations in their
metadata and that the exported labels and annotations are from the dedicated
fields in the ServiceExport spec.
We also favored dedicated fields on the ServiceExport resource to allow for better
flexibility, as it will allow to export labels and annotations fully decorrelated
from the Service and ServiceExport metadata. More flexibility could also be
achieved with CEL expression on the ServiceExport at the cost of greater
complexity (managing CEL expressions on potentially many ServiceExport across clusters).
Infrastructure Needed
To facilitate consumption by kube-proxy, the MCS CRDs need to live in kubernetes/staging. We will need a new k8s.io/multiclusterservices repo for published MCS code.