KEP-3926: Handling undecryptable resources

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Encryption at rest for API resources has been a stable part of Kubernetes for a long time. Every now and then there had been cases where, be it by improper handling or external system failures, the cluster encryption got into a broken state.

If a single object of a resource type cannot be decrypted, listing resources of that type in a path prefix containing the object always fails, even if the rest of the resource instances is accessible.

Currently, removing a resource that causes such failures is not possible. A cluster administrator must access etcd directly and remove the malformed data manually.

This KEP proposes a way to identify resources that fail to decrypt or fail to be decoded into an object, and introduces a new delete option to ignore any storage checks in case such a read error occurs. This is done in order to be able to delete such a failing resource by using just Kubernetes API.

Motivation

Goals

provide a way to identify persisted resources that failed decryption or that cannot be decoded
provide an option to delete a resource independently of its contents, if those cannot be reached due to data transformation or data corruption

Non-Goals

implementing system for ignoring different types of storage errors
give clients control over skipping other steps of a delete request flow than decoding errors

Proposal

Improve resource retrieval errors to include more information about the object that failed transformation while it was being retrieved from the storage.

Introduce a new DeleteOption that would allow deleting a resource even if we cannot retrieve its data.

User Stories (Optional)

Story 1

I accidentally removed my encryption key but only a few resources were encrypted with it. I know that these will either be recreated by a controller, or I can manually recreate them. I would like a simple way to figure out which resources fail decryption and I would like a way to remove them via Kubernetes API.

Story 2

I would like to remove a namespace I no longer need. However, some of the resources inside of the namespace were encrypted before the encryption at rest configuration broke, which blocks a successful namespace delete.

Notes/Constraints/Caveats (Optional)

An unconditional delete of a malformed resource may break garbage collection, would ignore finalizers and would disregard any underlying system processes that might be tied to the given resource (e.g. pods).

Risks and Mitigations

We need to make sure that a user that is trying to perform an unconditional delete of a malformed resource is well informed about the impact of what they are doing. This should be handled by one or more prompts from the kubectl client when the DeleteOption from this enhancement is set.

Gate the deletion with an additional admission layer on server.

Design Details

Background

The encryption/decryption for encryption at rest is implemented via transformers that get applied to a resource in code that handles resource read/write from etcd3 databases.

The storage handling does not change with KMSv2, a resource transformer is provided in that case, too.

References:

The code example 2. above shows that currently, when reading a resource fails, we lose all the context about the resource and a non-wrapping, generic internal error is returned.

Proposed Solution

New Error Status for Read Failures

The current API errors don’t appear to include an error status specific to storage. Therefore a new status should be introduced - StatusReasonStoreReadError.

  // StatusReasonStoreReadError means that the server encountered an error while
  // retrieving resources from the backend object store.
  // This may be due to backend database error, or because processing of the read
  // resource failed.
  // Details:
  //   "kind" string - the kind attribute of the resource being acted on.
  //   "name" string - the prefix where the reading error(s) occurred
  //   "causes" []StatusCause
  //      - (optional):
  //        - "type" CauseType - CauseTypeUnexpectedServerResponse
  //        - "message" string - the error message from the store backend
  //        - "field" string - the full path with the key of the resource that failed reading
  //
  // Status code 500
  StatusReasonStoreReadError StatusReason = "StorageReadError"

This error will also include full paths to the resources that cannot be read in an unstructured, human-readable message.

In cases where the number of malformed resources would be too great (> 100), only the first 100 will be shown in the causes slice. The 101st element of the slice takes the following form:

StatusCause{
  type: CauseTypeTooMany
  message: "too many errors, the list is truncated"
}

New Delete Option for Corrupt Objects

Deleting a resource is a rather complicated process:

a resource might represent an actual process running on a host (Pod)
there might be other resources with owner references to the resource that’s being deleted
a resource might contain finalizers that safeguard the deletion of the given resource before other dependent resources are deleted (typically - namespaces and the kubernetes finalizer)

An unconditional deletion should try to do best effort on all of the above, but in case of an undecryptable resource, all the above would be ignored.

For case 1., ignoring an underlying process may not be an issue as kubelet is supposed to take care of unused containers .

In case 2., there might be issues with setting related objects as orphans, which could potentially cause an unwanted cascade deletion of objects.

has a potential of becoming rather serious. Finalizers are typically set to safeguard other objects, and so if e.g. an aggregated API server is removed, its API objects might be scattered around the etcd database without and API to remove them.

To allow unconditional deletion, a new DeleteOption should be introduced - IgnoreStoreReadErrorWithClusterBreakingPotential

type DeleteOptions struct {
  ...
  // IgnoreStoreReadErrorWithClusterBreakingPotential will try to perform the normal
  // deletion flow but if the data of the resource being deleted cannot be read from
  // the store, either because it failed to be decrypted or the data is
  // otherwise corrupted and cannot be decoded, it will disregard these errors
  // and still perform the deletion.
  // WARNING: This will break the cluster if the resource has dependencies beyond
  //          the caller's comprehension. Use only if you REALLY know what you are
  //          doing.
  // WARNING: Vendors will most likely consider using this option to be breaking the
  //          support of their product.
  IgnoreStoreReadErrorWithClusterBreakingPotential bool
}

Admission Control for Unconditional Deletion

A “delete” verb on a resource is not usually considered a privileged action. As the previous section explains, deletion of a resource might carry unexpected consequences. Unconditional deletions should therefore have their own extra admission.

The unconditional deletion admission:

checks if a “delete” request contains the IgnoreStoreReadErrorWithClusterBreakingPotential option
if it does, it checks the RBAC of the request’s user for the delete-ignore-read-errors verb of the given resource

Implementation Considerations

Watch Event Propagation and Client Recovery

When a corrupt object is deleted from etcd, the kube-apiserver’s watch cache cannot transform or decode the object’s previous value. This triggers a deliberate recovery sequence:

Error Detection: The etcd3 watcher fails to transform/decode the deleted object’s data and generates a watch.Error event with StatusReasonStoreReadError.
Cacher Reset: The Cacher’s internal Reflector receives this error, causing ListAndWatch() to stop. After a brief delay, the Cacher reinitializes by calling terminateAllWatchers() followed by a fresh LIST from etcd.
Client Disconnection: All active watch connections for that resource type are terminated. Clients see their watch channels close without receiving the original error event.

sequenceDiagram
    participant etcd
    participant Watcher as etcd3/watcher
    participant Cacher
    participant CacheWatcher as cacheWatcher
    participant HTTP as HTTP Handler

    etcd->>Watcher: DELETE event (corrupt prevValue)
    Watcher->>Watcher: transform() fails on prevValue
    Watcher->>Cacher: watch.Error (StoreReadError)
    Note over Cacher: Reflector returns, waits 1s
    Cacher->>CacheWatcher: terminateAllWatchers()
    CacheWatcher->>HTTP: close(result)
    HTTP-->>HTTP: return (connection closes)
    Cacher->>etcd: LIST + WATCH
    Note over Cacher: Cache rebuilt, new RV window

Client Recovery: Disconnected clients attempt to resume watching from their last known resourceVersion. The server rejects this with a “too old resource version” error, forcing clients to perform a fresh LIST and rebuild their local caches.

Design Principles

The following principles, agreed upon by SIG API Machinery, guide this enhancement:

Watch history cannot be preserved when a corrupt object exists. Since the object’s data cannot be decrypted or decoded, we have no access to the correct previous object state required for a semantically valid DELETE event.
Performance degradation is acceptable during the remediation window. The temporary increase in API server load from client re-lists is an accepted tradeoff for restoring cluster health.
Enable admin remediation: The admin must be able to identify corrupt objects and delete them, even if one by one. Once all corrupt objects are removed, the kube-apiserver and client informers recover automatically.

This approach favors eventual consistency and cluster recovery over preserving individual watch streams during an inherently abnormal situation.

Alternative Approaches Considered

We considered using shallow object representations to enhance error or delete events, enabling targeted removal of the corrupt object from client caches without triggering a full re-list:

DeletedFinalStateUnknown: A client-go type used when the final state of a deleted object is unknown. This approach failed because DeletedFinalStateUnknown does not implement runtime.Object, which is required by the watch cache.
PartialObjectMetadata: A Kubernetes type containing only object metadata. This failed because the watch cache’s getAttrsFunc performs type assertions to the specific resource type (e.g., *api.Secret), which PartialObjectMetadata cannot satisfy.
Type Identity Object: Creating an empty object of the correct type via newFunc() and copying only essential metadata (namespace, name, resourceVersion, UID). While technically feasible, the added complexity was not justified given the design principles outlined above.

Test Plan

I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

k8s.io/apiserver/pkg/storage/etcd3: 28.9.2023 - 77%
k8s.io/apimachinery/pkg/apis/meta/v1: 28.9.2023 - 48.1%

Integration tests

Alpha:

TestAllowUnsafeMalformedObjectDeletionFeature : testgrid , triage
- Verifies corrupt secrets can be deleted with feature enabled, the new option set and proper RBAC
- Verifies that normal deletion deletion fails with new StorageError: corrupt object
- Verifies that normal secrets can still be deleted with the feature enabled, even with corrupt objects in the database
- Verifies deletion of corrupt objects is blocked when feature is disabled and there is a lack of option and RBAC.
TestListCorruptObjects : testgrid , triage
- Verifies LIST returns errors for corrupt objects when feature is enabled
- Verifies error truncation when too many corrupt objects exist

Beta:

test that LIST operation is capable of returning multiple corrupt objects
test delete handler with unsafe deletion flow
test deletion of bit-flip corrupted objects (deserialization failure, not transformer failure)
test deletion of corrupt CRs
validate kube-apiserver transition to healthy state after cleanup

e2e tests

Integration tests are functionally equivalent to e2e tests for this feature. They exercise the full kube-apiserver stack with a real etcd backend. The integration test framework is preferred because it allows direct manipulation of etcd contents, encryption configuration during test execution and they are more stable to handle such manipulation.

Graduation Criteria

Alpha

Error type is implemented
Deletion of malformed etcd objects and its admission can be enabled via a feature flag

Beta

Feature enabled by default
Dry-run support for unsafe corrupt object deletion
Comprehensive test coverage as outlined in the Integration tests > Beta section.

Upgrade / Downgrade Strategy

This feature is contained entirely within kube-apiserver with no persistent state changes:

Upgrade: Enabling the feature gate makes the IgnoreStoreReadErrorWithClusterBreakingPotential delete option functional. No configuration migration required.
Downgrade: Disabling the feature gate makes the delete option non-functional. The option is silently ignored. No cleanup required.
Mixed version clusters: During rolling updates, some apiservers may have the feature enabled while others don’t. Requests with the unsafe delete option will only succeed on apiservers with the feature enabled. This is acceptable for an emergency recovery feature.

No special upgrade or downgrade procedures are required.

Version Skew Strategy

This feature is entirely within kube-apiserver with no node component interaction:

API server to API server: In HA setups, some apiservers may have the feature enabled while others don’t during rollout. The unsafe delete option only works on apiservers with the feature enabled. This is acceptable behavior.
Kubelet: No interaction. This feature doesn’t affect pod lifecycle or node operations.
Other components: No interaction. The feature only affects DELETE requests with the specific option set.

No version skew concerns exist because:

The feature doesn’t introduce new API fields that need coordination
The DeleteOption is ignored by apiservers without the feature
No persistent state changes that could cause inconsistency

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: AllowUnsafeMalformedObjectDeletion
- Components depending on the feature gate: kube-apiserver
Other
- Describe the mechanism: The new error type will always be present once implemented
- Will enabling / disabling the feature require downtime of the control plane? No
- Will enabling / disabling the feature require downtime or reprovisioning of a node? No

Does enabling the feature change any default behavior?

No.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

The feature can be safely enabled and disabled at will.

What happens if we reenable the feature if it was previously rolled back?

There should be no side-effects.

Are there any tests for feature enablement/disablement?

Yes, the integration tests explicitly toggle the feature gate to verify enablement/disablement:

TestAllowUnsafeMalformedObjectDeletionFeature - feature gate toggle at L198 : Parametrized test running with featureEnabled: true and featureEnabled: false. Verifies deletion is blocked when disabled, works when enabled with proper RBAC.
TestListCorruptObjects - feature gate toggle at L512 : Parametrized test verifying LIST returns StatusReasonStoreReadError when enabled, StatusReasonInternalError when disabled.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

Rollout and rollback cannot fail because:

No persistent state changes: The feature doesn’t write new data to etcd or modify existing objects (except deleting them when explicitly requested).
Contained within kube-apiserver: No coordination with kubelet, controllers, or other components required.
Opt-in behavior: The feature only activates when a client explicitly sets the IgnoreStoreReadErrorWithClusterBreakingPotential option AND has RBAC permission for the unsafe-delete-ignore-read-errors verb.

Impact on running workloads: None. The feature doesn’t affect normal cluster operations.

What specific metrics should inform a rollback?

Important context: This feature is for emergency cluster recovery. During remediation, temporary performance degradation is expected and acceptable. The following metrics will spike when corrupt objects are deleted - this is the feature working correctly, not a problem.

Rollback should only be considered if:

Unexpected cache resets — apiserver_watch_cache_initializations_total spikes occur when no corrupt object deletion was performed. This would indicate the feature gate enablement itself is causing unintended side effects.
Recovery does not complete — After corrupt object deletion, the system should stabilize within minutes. If apiserver_storage_list_total remains elevated for an extended period (>10 minutes for typical clusters), clients may be stuck in reconnection loops.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

No testing of upgrade->downgrade->upgrade necessary because:

No new persisted state changes: Either the corrupt object is deleted or not.
“Atomic” behavior: Either the feature is enabled and the user can perform unsafe deletes (with proper RBAC), or it’s disabled and they cannot.
Version skew is handled gracefully: The interpretation of a deletion event of a corrupt object is added to k8s 1.32.
Rollback is trivial: Disabling the feature gate simply makes the DeleteOption non-functional. No cleanup or migration required.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No deprecations.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

This feature is for cluster administrators performing emergency recovery, not for workload automation.

To detect actual usage (i.e., unsafe deletions being performed):

Audit logs: Search for annotation: apiserver.k8s.io/unsafe-delete-ignore-read-error.
RBAC: Check RoleBindings/ClusterRoleBindings granting unsafe-delete-ignore-read-errors verb.

How can someone using this feature know that it is working for their instance?

Events
- Event Reason:
API .status
- Condition name:
- Other field:
Other (treat as last resort)
Details:
1. Attempt to delete a corrupt object with the delete option set but without RBAC permission for unsafe-delete-ignore-read-errors verb. Receiving 403 Forbidden (instead of 500 StorageReadError) confirms the feature is enabled and recognizing the option.
2. Without the delete option, attempting to delete a corrupt object returns the original 500 StorageReadError.
3. Use dry-run to safely verify the behavior with various combinations of option and RBAC permissions.
4. With proper RBAC permission and the option set, the corrupt object deletion succeeds.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

This feature targets emergency cluster recovery scenarios where corrupt objects are blocking normal operations. Temporary performance degradation during remediation is acceptable - the priority is restoring cluster functionality.

The deletion itself is faster as it bypasses preconditions and finalizers, but there are cache resets at the kube-apiserver and its watching clients (informers) that may cause performance degradation.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
Note: During corrupt object deletion remediation, temporary metric spikes are expected and acceptable. The priority is restoring cluster functionality, not maintaining SLOs.
- Metric name: apiserver_watch_cache_initializations_total
  - Labels: group, resource
  - Components exposing the metric: kube-apiserver
  - Details: Increments when watch cache rebuilds. A spike correlating with corrupt object deletion confirms the expected recovery flow triggered. After remediation completes, this should return to baseline (typically zero or very low).
- Metric name: apiserver_storage_list_total
  - Labels: group, resource
  - Components exposing the metric: kube-apiserver
  - Details: Tracks LIST operations hitting etcd storage. Expect a transient spike as clients reconnect and rebuild caches. Recovery is complete when this returns to pre-remediation levels.
Other (treat as last resort)
- Details:

Are there any missing metrics that would be useful to have to improve observability of this feature?

The existing metrics provide sufficient observability for tracking cache rebuilds and recovery:

apiserver_watch_cache_initializations_total — confirms cache rebuild occurred
apiserver_storage_list_total — tracks recovery progress (client re-lists)

Known gap: apiserver_storage_decode_errors_total only covers decode errors in store.go operations (GET, LIST, etc.), not in watcher.go transform/decode failures. This means the metric won’t increment specifically for the corrupt object deletion watch flow. This is acceptable because:

The feature is for emergency recovery where detailed decode error counts are less critical than successful deletion.
The cache rebuild metrics above provide sufficient signal that the flow completed.
Adding watcher-specific decode error metrics would require broader consensus in sig-instrumentation.

For tracking actual feature usage (unsafe deletions performed), operators should use audit logs and search for the apiserver.k8s.io/unsafe-delete-ignore-read-error annotation.

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

The feature itself should not bring any concerns in terms of performance at scale. In particular as its usage is supposed to run on potentially broken clusters.

An issue in terms of scaling comes with the error that attempts to list all resources that appeared to be malformed while reading from the storage. A limit of 100 presented resources was arbitrarily picked to prevent huge HTTP responses.

Another issue in terms of scaling happens when the corrupt objects are deleted. Client reflectors re-list to recover, this causes temporarily increased load on the client-side and the kube-apiserver.

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

DeleteOptions gets a new boolean field, but it is transient: no persistence in etcd.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

DELETE operations:

Unsafe DELETE path is faster (skips preconditions, validation, finalizers)
Decreases latency for the unsafe delete itself

LIST operations:

Client-side reflectors re-list when their watch breaks (after corrupt object deletion ERROR event)
Temporarily increases LIST request volume to apiserver
Latency increase depends on: number of watching clients × object count × apiserver resources

Expected impact:

Negligible under the circumstance that the cluster is in a potentially broken state.
Potentially noticeable if: popular resource (many watchers) × many objects × resource-constrained apiserver

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Temporary increase during cleanup, dependent on object and resource type popularity:

apiserver: CPU / network during re-lists
client-side: CPU / memory / network during re-lists / rebuilding cache

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

If the API server is unavailable, no DELETE requests can be processed (including unsafe deletes). This is standard Kubernetes behavior.

If etcd is unavailable, DELETE requests fail with storage errors, including the unsafe delete feature.

What are other known failure modes?

Missing RBAC permission
- Detection: 403 Forbidden responses when using the unsafe delete option
- Mitigation: Grant unsafe-delete-ignore-read-errors verb permission to the user
- Diagnostics: Audit logs show RBAC denial; API server logs show “forbidden” at verbosity 3+
- Testing: Covered by TestAllowUnsafeMalformedObjectDeletionFeature
Feature gate disabled
- Detection: Unsafe delete option silently ignored; corrupt object still returns 500 StorageReadError
- Mitigation: Enable AllowUnsafeMalformedObjectDeletion feature gate
- Diagnostics: Check feature gate status via /healthz or metrics
- Testing: Covered by TestAllowUnsafeMalformedObjectDeletionFeature
Object not actually corrupt
- Detection: Normal delete succeeds without needing the option
- Mitigation: None needed - use normal delete
- Diagnostics: Object is readable via GET
- Testing: Covered by integration tests

What steps should be taken if SLOs are not being met to determine the problem?

During corrupt object deletion, temporary SLO degradation is expected (see Monitoring Requirements section). If degradation persists:

Check apiserver_watch_cache_initializations_total - should return to baseline within minutes
Check apiserver_storage_list_total - elevated counts indicate clients are still rebuilding caches
Review audit logs - confirm the unsafe delete completed successfully
If recovery doesn’t complete - restart kube-apiserver to force fresh state

Implementation History

2023-03-27: KEP created
2023-10-05: KEP merged as provisional
v1.32: Alpha implementation:
- Deletion of corrupt objects, with client option and RBAC.
- Extended listing of corrupt objects
- Integration tests
v1.36: Targeting beta
- Cache reset deemed acceptable in sig-api-machinery bi-weekly meeting
- Dry-Run
- Additional integration tests for CRs and serialization failures.

Drawbacks

Potential for misuse: The unsafe delete option bypasses safety mechanisms (finalizers, garbage collection). Misuse could orphan resources or break cluster state.
Vendor support concerns: Using this feature may void support from Kubernetes distributions/vendors, as it allows bypassing normal API guarantees.
No undo: Unsafe deletion is permanent. If used incorrectly, the only recovery is restoring from etcd backup.

These drawbacks are intentional - the feature is designed for emergency recovery where the alternative (direct etcd manipulation) is worse.

Alternatives

Direct etcd manipulation (status quo)

Requires etcd access, bypasses all Kubernetes abstractions, risky, not audited.

KEP-3926: Handling undecryptable resources

KEP-3926: Handling undecryptable resources

Release Signoff Checklist

Summary

Motivation

Goals

Non-Goals

Proposal

User Stories (Optional)

Story 1

Story 2

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Design Details

Background

Proposed Solution

New Error Status for Read Failures

New Delete Option for Corrupt Objects

Admission Control for Unconditional Deletion

Implementation Considerations

Watch Event Propagation and Client Recovery

Design Principles

Alternative Approaches Considered

Test Plan

Prerequisite testing updates

Unit tests

Integration tests

e2e tests

Graduation Criteria

Alpha

Beta

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

Infrastructure Needed (Optional)