KEP-1900: CSI Snapshot Webhook

KEP-1900: Add additional validation to volume snapshot objects

Release Signoff Checklist
Summary
Motivation
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
(R) Graduation criteria is in place
(R) Production readiness review completed
Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Tighten validation on VolumeSnapshot and VolumeSnapshotContent by updating the CRD validation schema and providing a webhook server to enforce immutability.

This KEP will list the new validation rules. It will also provide the release plan to ensure backwards compatibility. As well, it will outline the deployment plan of the webhook server. The webhook server is deployed separately from the snapshot controller.

This tightening of the validation on volume snapshot objects is considered a change to the volume snapshot API. Choosing not to install the webhook server and participate in the 2-phase release process can cause future problems when upgrading from v1beta1 to V1 volumesnapshot API if there are currently persisted objects which fail the new stricter validation. Potential impacts include being unable to delete invalid snapshot objects. It should be possible to downgrade the CRD definition as a workaround.

Motivation

VolumeSnapshot feature has been on the BETA stage in Kubernetes OSS community since Kubernetes version 1.17. The community has identified a gap in lacking validation when CR(custom resource), i.e, VolumeSnapshot, VolumeSnapshotContent, are created issue . This gap will need to be resolved before the feature can be brought to GA.

Background on Admission webhooks

Admission webhooks are HTTP callbacks to intercept requests to the API server. They could be validating webhooks and mutating webhooks(details). Admission webhooks have been released in BETA since K8s v1.9 and GA in v1.16. Following prerequisites are needed to be able to use this feature:

K8s version, v1.9+ to use admissionregistration.k8s.io/v1beta1 or v1.16+ to use admissionregistration.k8s.io/v1 (Note that volume snapshot moved to BETA in v1.17)
Corresponding admission controllers(MutatingAdmissionWebhook, ValidatingAdmissionWebhook) is enabled. (in v1.18+, both will be enabled by default, with mutating precedes validating)
API admissionregistration.k8s.io/v1beta1 or admissionregistration.k8s.io/v1 is enabled.(Prefer v1 over v1beta1)

Admission controllers have been in common use in kubernetes for a long time. Admission webhooks are the new, preferred way to control admission, especially for external (out-of-tree) components like the CSI external snapshotter.

A webhook server receives AdmissionReview requests from API server, and responses with a response of the same type to either admit/deny the request. Following simplified diagram shows the workflow. (Note that the mutating webhooks will be invoked BEFORE validating).

Webhook workflow diagram

The webhook server will expose an HTTP endpoint such that to allow the API server to send AdmissionReview requests. Webhook server providers can dynamically configure what type of resources and what type of admission webhooks via creating CRs of type ValidatingWebhookConfiguration and/or MutatingWebhookConfiguration.

CRD validation is preferred over webhook validation due to their lower complexity, however CRD validation schema is unable to enforce immutability or provide ratcheting validation.

Goals

Provide an updated CRD schema to validate fields
Prevent:
- Invalid VolumeSnapshot/VolumeSnapshotContent from creation and update
- Invalid updates on immutable fields, i.e., VolumeSnapshot.Spec.Source
Provide a pre-built image which can be used to deploy the webhook server
Provide a way to deploy the webhook server in cluster
Provide a way to authenticate the webhook server to the API server via TLS
Provide a release process to safely tighten the validation and move towards the ideal state of using builtin CRD validation while maintaining backwards compatibility

Non-Goals

Provide a way to authenticate the API server to the webhook server

Proposal

Tighten the validation on Volume Snapshot objects. Please see the tables below for detailed information.

Due to backwards compatibility concerns, the tightening will occur in three phases.

The first phase is webhook-only, and will use ratcheting validation . It will be the user’s responsibility to clean up invalid objects which already existed before the webhook was enabled. Invalid objects are those which fail the new, stricter validation. The controller will not be able to automatically fix invalid objects, however it will apply a label to invalid objects so that users can easily locate them.
The second phase can occur once all invalid objects are cleared from the cluster. It will be the cluster admin’s responsibility to check and detect when it is safe to move to the second phase. The CRD schema validation will be tightened and the webhook will stick around to enforce immutability until immutable fields come to CRDs (Custom Resource Definition). This will be accompanied by a version change (from v1beta1 to v1) to make it clear the CRD is using different validation. however the storage version will be kept as v1beta1 to ensure a rollback is possible at phase 2.
The storage version of the CRD will be changed from v1beta1 to v1

The phases come in separate releases to allow users / cluster admin the opportunity to clean their cluster of any invalid objects. More details are in the Risks and Mitigations section.

The webhook server will perform validation on Volume Snapshot objects when CREATE and UPDATE requests are made to the api server for VolumeSnapshot and VolumeSnapshotContent objects. The webhooks will only use validating webhooks, which are read-only. An image will be built and example Deployment and Service yaml files will be provided. Example configuration files for the ValidatingWebhookConfiguration will be provided, to be used to register the webhooks on the API server.

The webhook will be developed inside the external-snapshotter repository.

Validating Scenarios

The following is a list of fields which will get checked when a CREATE or UPDATE operation is sent to the API server. Some validation is already enforced by the CRD schema definition, for example some required fields and enums.

All of the validation desired can be achieved by updating the CRDs to take advantage of the OpenApi v3 schema validation. In particular, the oneOf and minLength fields can be used.

There is a desire for some fields to be immutable, which is not yet supported by CRDs. See the immutable fields KEP for the latest updates. As of August 2020, the KEP is provisional and has no clear timeline for when immutable fields will come to CRDs.

VolumeSnapshot

Operation	Field	Reason	HTTP RCode
CREATE	spec.Source	Exactly one of PersistentVolumeClaimName (Dynamic) or VolumeSnapshotContentName (Pre-provisioned) should be specified.	400
UPDATE	spec.Source	Immutable, no updates allowed. If the user has specified an incorrect source, they must delete and remake the snapshot. The webhook validation server will not be able to guarantee that only incorrect sources are allowed to be updated.	400
CREATE	spec.VolumeSnapshotClassName	Must not be the empty string. Can be unset (to use the default snapshot class, if it is set. If the default snapshot class is not set or there is more than 1 default class, then the hook will allow the creation but the snapshot will fail.), or set to a non-empty string (the snapshot class).	400
UPDATE	spec.VolumeSnapshotClassName	Same restrictions as CREATE. We won’t restrict updating by making this field immutable (only applying the same restrictions as creation) but this field should only be changed by those who know exactly what they are doing.	400

VolumeSnapshotContent

Operation	Field	Reason	HTTP RCode
CREATE	spec.Source	Exactly one of VolumeHandle (dynamic snapshot created by controller) or SnapshotHandle (pre-provisioned snapshot created by cluster admin) should be specified	400
UPDATE	spec.Source	Immutable, no updates allowed.	400
CREATE	spec.VolumeSnapshotRef	Must have both name and namespace fields set. Preprovisioned: This is the reference to the yet to be created VolumeSnapshot object which should bind to this VolumeSnapshotContent. https://github.com/kubernetes-csi/external-snapshotter/blob/097b1fc7d7cd6576182ca34512c14de1c84b2127/pkg/apis/volumesnapshot/v1beta1/types.go#L270 . Dynamic: This is the reference to the VolumeSnapshot object which triggered the creation of this VolumeSnapshotContent. It also has the UID field, but this is set by the controller.	400
UPDATE	spec.VolumeSnapshotRef	Immutable, no updates allowed, once it’s UID has been set.	400

Authentication

There are two directions to authentication. Authenticating the identity of the webhook server, and authenticating the identitiy of the kubernetes api server.

The API server authenticates the webhook server through TLS certificates and HTTPS. This is required, and an example method of deploying the webhook server with HTTPS will be provided.

Authentication on incoming requests to the webhook server is configurable however out of scope of this document. It’s the user’s responsibility in general to configure the webhook service and the API server if authentication is required (details ). The web server implementation, however, should allow users to configure whether authentication is required or not. If no authentication config is specified, the webhook server should default to “NoClientCert”, which effectively will not authenticate the identity of the clients.

Timeout

Webhooks add latency to each API server call configured in the ValidationWebhookConfig/MutatingWebhookConfig, for this KEP it should only affect CREATE and UPDATE requests on snapshot resources. Thus setting up a reasonable timeout for each AdmissionReview request from the webhook server side is critical. The default timeout is 10 seconds if not specified. When an AdmissionReview request sent to the webhook server timed out, failurePolicy(default to Fail which is equivalent to disallow) will be triggered.

In the ValidatingWebhookConfiguration yaml example , a default timeout of two seconds is provided, cluster admins who wish to change the timeout may change the value of timeoutSeconds.

To avoid migration pain it is recommended to start with a failurePolicy value of Ignore, changing it to Fail only after the webhook is confirmed to have been installed successfully. Choosing Ignore means that it would be possible invalid objects can get created/updated in the system.

Idempotency/Deadlock

Since only validating webhooks will be introduced in this version, idempotency/deadlock are not relevant.

Automatic Labelling of Invalid Objects

The controller will apply a label called snapshot.storage.sigs.k8s.io/invalid-snapshot-resource to VolumeSnapshot and snapshot.storage.sigs.k8s.io/invalid-snapshot-content-resource to VolumeSnapshotContent objects which fail strict validation. For valid objects the label will not be present, and for invalid objects it will be present. The value of the label does not matter, and is set to the empty string by default. The controller will use the same validation logic in the webhook.

For example here’s the yaml for an invalid VolumeSnapshot:

apiVersion: snapshot.storage.k8s.io/v1beta1
kind: VolumeSnapshot
metadata:
  name: snapshot-label-example
  labels:
    snapshot.storage.kubernetes.io/invalid-snapshot-resource: "" # Label applied for invalid VolumeSnapshot objects
...

Here’s an example for the yaml for an invalid VolumeSnapshotContent:

apiVersion: snapshot.storage.k8s.io/v1beta1
kind: VolumeSnapshotContent
metadata:
  name: snapcontent-72d9a349-aacd-42d2-a240-d775650d2455
  labels:
    snapshot.storage.kubernetes.io/invalid-snapshot-content-resource: "" # Label applied for invalid VolumeSnapshotContent objects
...

Users and cluster admins MUST ensure there are NO objects with the labels applied before upgrading to phase 2. The labels are added by the controller, and there may be a delay after deployment. It is recommended to wait 48 hours after installing the webhook and new controller, as the controller does a full resync of each snapshot resource every 24 hours.

User Stories (Optional)

Story 1

Cluster admin can deploy the webhook server. Users can create and update snapshot objects with confidence invalid updates will be rejected.

Following are some typical scenarios we are aiming to prevent:

Creation of invalid CRs
- Reject if a VolumeSnapshot CR does not have a legit VolumeSnapshotSource, i.e., missing both PersistentVolumeName and VolumeSnapshotContentName.
- Reject if a VolumeSnapshotContent CR does not have a legit VolumeSnapshotContentSource, i.e., both VolumeHandle and SnapshotHandle have been specified
Updating immutable fields
- Reject updates to VolumeSnapshot’s VolumeSnapshotSource
- Reject updates to VolumeSnapshotContent’s VolumeSnapshotContentSource
- Reject updates to VolumeSnapshotContent’s volume snapshot ref after binding

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Backwards compatibility

There is a backwards compatibility issue involved when tightening the validation on snapshot objects. Since the feature is already in beta we are committed to more backward compatibility guarantee than alpha.

Backward compatibility violations which are okay:

create: users can no longer create objects which fail strict validation
update: users can no longer update objects to fail strict validation

Backwards compatibility violations which are not okay:

delete: users can no longer delete objects which fail strict validation

If the validation tightening would prevent a no-op update, it would prevent deletion of that object because deletions require removing finalizers, which are done via update. Therefore we must support a no-op update, where previously invalid objects can be updated without changes. Making a previously optional field required in the schema blocks update of previously persisted data that omitted the field (unless the update populates the newly required field, or you specify a schema default)

To tackle the backwards compatibility problem, this KEP proposes the following release process.

Begin with validating webhook only enforcement. The webhook will perform the following validation

one release with ratcheting validation using the webhook server
- webhook is strict on create
- webhook is strict on updates where the existing object passes strict validation
- webhook is relaxed on updates where the existing object fails strict validation (allows finalizer removal, status update, deletion, etc)
The user will need to delete or fix all invalid objects. The webhook and controllers will not take any automatic action to reconcile invalid objects. However, the controller will add a label.

For UPDATE operations, the webhook server will receive the existing object and the new, proposed object. We will use this feature to check when the existing objects passes or fails strict validation.

Once we are sure no invalid data is persisted, we can switch to CRD schema-enforced validation with validating webhooks for immutability in a subsequent release.

Rollback

If users do not completely remove their invalid objects before upgrading their CRD definition, it should be possible to downgrade the CRD definition to allow invalid objects to get deleted.

The rollback procedure would look like this:

New v1 api with v1beta1 storage is causing problems and user wants to rollback
Control plane (including CRD + snapshot controller) is rolled back and the v1 version is dropped. Objects still persist though since it’s stored as v1beta1
User fixes their problems with invalid objects
User upgrades the control plane again.
In an n+2 release, once all the invalid objects are purged, we can switch the storage version to v1.

In phase 2, the storage version will be kept at v1beta1 in order to ensure the rollback is possible.

In phase 3, the storage version will be changed to v1.

v1 (served=true, storage=false)
v1beta1 (served=false, storage=true)

Current Controller validation of OneOf semantic

Handling VolumeSnapshot.

See code here .

If the object violates oneOf semantic: Update the VolumeSnapshot status to “SnapshotValidationError” and issue an event.

Note:

If the VolumeSnapshot object has been updated AFTER binding to a VSC, binding from VolumeSnapshot->VSC will be lost.
Deletion of an invalid resource is not blocked by that check as the deletion workflow happens before validation(code). This is to ensure that a user can delete an invalid VolumeSnapshot resource.

Handling VolumeSnapshotContent

See code here . If the object violates oneOf semantic: Update the VSC status to “ContentValidationError” and issue an event.

Design Details

Deployment

There are two main steps to setup validation for the snapshot objects. The kubernetes API server must be configured to connect to the webhook server, and the webhook server must be deployed and reachable. Make sure to take a look at the prerequisites before deploying.

A sample script will be provided which will handle the deployment of TLS certificates. It is not considered production ready and users are encouraged to use their own certificate management process. The demo will create certificates as a secret in the cluster and mount them as a volume. The ValidatingWebhookConfiguration will need to be updated with the cluster admin bundle.

Kubernetes API Server Configuration

The API server must be configured to connect to the webhook server for certain API requests. This is done by creating a ValidatingWebhookConfiguration object. For a more thorough explanation of each field refer to the documentation. An example yaml file is provided below. The value of timeoutSeconds will affect the latency of snapshot creation, and must be considered carefully as it might affect the time the application is frozen for.

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: "webhook-validation.storage.sigs.k8s.io"
webhooks:
- name: "snapshot.webhook-validation.storage.sigs.k8s.io"
  rules:
  - apiGroups:   ["snapshot.storage.k8s.io"]
    apiVersions: ["v1beta1"]
    operations:  ["CREATE", "UPDATE"]
    resources:   ["volumesnapshots", "volumesnapshotcontents"]
    scope:       "*"
  clientConfig:
    service:
      namespace: "default"
      name: "snapshot-validation-service"
      path: "/volumesnapshots"
    caBundle: "LS0tLS...base64 encoded of public key...LS0K"
  admissionReviewVersions: ["v1", "v1beta1"]
  sideEffects: None
  failurePolicy: Ignore # We recommend switching to Fail only after successful installation of the server and webhook.
  timeoutSeconds: 2 # This will affect the latency and performance. Finetune this value based on your application's tolerance.

Webhook Server Deployment

The recommended deployment mode for the webhook server is within the same node as the api server to minimize network latency. For high-availability we recommend using a Deployment and Service to deploy the validation server. Some example yaml files are provided, and should be changed to suit the Cluster Admin’s needs.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: snapshot-validation-deployment
  labels:
    app: snapshot-validation
spec:
  replicas: 3
  selector:
    matchLabels:
      app: snapshot-validation
  template:
    metadata:
      labels:
        app: snapshot-validation
    spec:
      containers:
      - name: snapshot-validation
        image: image:xxx # change the image to released image or if you wish to use your own custom validation server image
        args: ['webhook', '--tls-cert-file=/etc/webhook/certs/cert.pem', '--tls-private-key-file=/etc/webhook/certs/key.pem'] # Change args as needed
        ports:
        - containerPort: 443 # change the port as needed
        volumeMounts:
          - name: webhook-certs
            mountPath: /etc/webhook/certs
            readOnly: true
      volumes:
        - name: webhook-certs
          secret:
            secretName: snapshot-validation-secret

apiVersion: v1
kind: Service
metadata:
  name: snapshot-validation-service
  namespace: default # Don't use the default namespace. Choose an appropriate one.
spec:
  selector:
    app: snapshot-validation
  ports:
    - protocol: TCP
      port: 443 # Change if needed
      targetPort: 443 # Change if the webserver image expects a different port

Test Plan

There will be unit testing on the webserver in the same repository to ensure that the correct policy gets enforced.

Since the webhook is developed in the external-snapshotter repository, and does not test any csi driver, it would not be a good fit for e2e tests to go under the kubernetes core repository. Hence the plan for e2e tests is to add a new test job in external-snapshotter repo that brings up a kind cluster, installs crds and the webhook, and then runs validation tests.

Graduation Criteria

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in kep.yaml)
  - Feature gate name: VolumeSnapshotDataSource (overall feature gate)
  - Components depending on the feature gate:
- Other
  - Describe the mechanism: Create or delete the validatingwebhookconfiguration object. Once we reach phase two of the release with validating via CRDs, the feature cannot be disabled.
  - Will enabling / disabling the feature require downtime of the control plane? No (Phase 1)
  - Will enabling / disabling the feature require downtime or reprovisioning of a node? (Do not assume Dynamic Kubelet Config feature is enabled). No
Does enabling the feature change any default behavior? Currently some validation is not fully enforced. This will tighten the validation to be in line with what is intended.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? In phase one, the feature can be disabled by removing the webhook. However once we update the CRDs users will not easily be able to disable it once they have upgraded.
What happens if we reenable the feature if it was previously rolled back? Nothing special.
Are there any tests for feature enablement/disablement? The e2e framework does not currently support enabling or disabling feature gates. However, unit tests in each component dealing with managing data, created with and without the feature, are necessary. At the very least, think about conversion tests if API types are being modified.

Rollout, Upgrade and Rollback Planning

This section must be completed when targeting beta graduation to a release.

How can a rollout fail? Can it impact already running workloads? Try to be as paranoid as possible - e.g., what if some components will restart mid-rollout?
What specific metrics should inform a rollback?
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? Describe manual testing that was done and the outcomes. Longer term, we may want to require automated upgrade/rollback tests, but we are missing a bunch of machinery and tooling and can’t do that now.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? Even if applying deprecation policies, they may still surprise some users.

Monitoring Requirements

This section must be completed when targeting beta graduation to a release.

How can an operator determine if the feature is in use by workloads? Ideally, this should be a metric. Operations against the Kubernetes API (e.g., checking if there are objects with field X set) may be a last resort. Avoid logs or events for this purpose.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
  - Metric name:
  - [Optional] Aggregation method:
  - Components exposing the metric:
- Other (treat as last resort)
  - Details:
What are the reasonable SLOs (Service Level Objectives) for the above SLIs? At a high level, this usually will be in the form of “high percentile of SLI per day <= X”. It’s impossible to provide comprehensive guidance, but at the very high level (needs more precise definitions) those may be things like:
- per-day percentage of API calls finishing with 5XX errors <= 1%
- 99% percentile over day of absolute value from (job creation time minus expected job creation time) for cron job <= 10%
- 99,9% of /health requests per day finish with 200 code
Are there any missing metrics that would be useful to have to improve observability of this feature? Describe the metrics themselves and the reasons why they weren’t added (e.g., cost, implementation difficulties, etc.).

Dependencies

This section must be completed when targeting beta graduation to a release.

Does this feature depend on any specific services running in the cluster? Think about both cluster-level services (e.g. metrics-server) as well as node-level agents (e.g. specific version of CRI). Focus on external or optional services that are needed. For example, if this feature depends on a cloud provider API, or upon an external software-defined storage or network control plane.
For each of these, fill in the following—thinking about running existing user workloads and creating new ones, as well as about cluster-level services (e.g. DNS):
- [Dependency name]
  - Usage description:
    - Impact of its outage on the feature:
    - Impact of its degraded performance or high-error rates on the feature:

Scalability

For alpha, this section is encouraged: reviewers should consider these questions and attempt to answer them.

For beta, this section is required: reviewers must answer these questions.

For GA, this section is required: approvers should be able to confirm the previous answers based on experience in the field.

Will enabling / using this feature result in any new API calls? Describe them, providing:
- API call type (e.g. PATCH pods)
- estimated throughput
- originating component(s) (e.g. Kubelet, Feature-X-controller) focusing mostly on:
- components listing and/or watching resources they didn’t before
- API calls that may be triggered by changes of some Kubernetes resources (e.g. update of object X triggers new updates of object Y)
- periodic API calls to reconcile state (e.g. periodic fetching state, heartbeats, leader election, etc.)
Will enabling / using this feature result in introducing new API types? Describe them, providing:
- API type
- Supported number of objects per cluster
- Supported number of objects per namespace (for namespace-scoped objects)
Will enabling / using this feature result in any new calls to the cloud provider?
Will enabling / using this feature result in increasing size or count of the existing API objects? Describe them, providing:
- API type(s):
- Estimated increase in size: (e.g., new annotation of size 32B)
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs ? Think about adding additional work or introducing new steps in between (e.g. need to do X to start a container), etc. Please describe the details.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components? Things to keep in mind include: additional in-memory state, additional non-trivial computations, excessive access to disks (including increased log volume), significant amount of data sent and/or received over network, etc. This through this both in small and large cases, again with respect to the supported limits .

Troubleshooting

The Troubleshooting section currently serves the Playbook role. We may consider splitting it into a dedicated Playbook document (potentially with some monitoring details). For now, we leave it here.

This section must be completed when targeting beta graduation to a release.

How does this feature react if the API server and/or etcd is unavailable?
What are other known failure modes? For each of them, fill in the following information by copying the below template:
- [Failure mode brief description]
  - Detection: How can it be detected via metrics? Stated another way: how can an operator troubleshoot without logging into a master or worker node?
  - Mitigations: What can be done to stop the bleeding, especially for already running user workloads?
  - Diagnostics: What are the useful log messages and their required logging levels that could help debug the issue? Not required until feature graduated to beta.
  - Testing: Are there any tests for failure mode? If not, describe why.
What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

Sig-storage wide webhook design. This was not accepted because the scope would be too big.

Wait until immutable fields for crds are implemented. This KEP tracks the feature.

KEP-1900: CSI Snapshot Webhook

KEP-1900: Add additional validation to volume snapshot objects

Release Signoff Checklist

Summary

Motivation

Background on Admission webhooks

Goals

Non-Goals

Proposal

Validating Scenarios

VolumeSnapshot

VolumeSnapshotContent

Authentication

Timeout

Idempotency/Deadlock

Automatic Labelling of Invalid Objects

User Stories (Optional)

Story 1

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Backwards compatibility

Rollback

Current Controller validation of OneOf semantic

Handling VolumeSnapshot.

Handling VolumeSnapshotContent

Design Details

Deployment

Kubernetes API Server Configuration

Webhook Server Deployment

Test Plan

Graduation Criteria

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

Rollout, Upgrade and Rollback Planning

Monitoring Requirements

Dependencies

Scalability

Troubleshooting

Implementation History

Drawbacks

Alternatives

Infrastructure Needed (Optional)