KEP-3926: Handling undecryptable resources
KEP-3926: Handling undecryptable resources
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
Encryption at rest for API resources has been a stable part of Kubernetes for a long time. Every now and then there had been cases where, be it by improper handling or external system failures, the cluster encryption got into a broken state.
If a single object of a resource type cannot be decrypted, listing resources of that type in a path prefix containing the object always fails, even if the rest of the resource instances is accessible.
Currently, removing a resource that causes such failures is not possible. A cluster administrator must access etcd directly and remove the malformed data manually.
This KEP proposes a way to identify resources that fail to decrypt or fail to be decoded into an object, and introduces a new delete option to ignore any storage checks in case such a read error occurs. This is done in order to be able to delete such a failing resource by using just Kubernetes API.
Motivation
Goals
- provide a way to identify persisted resources that failed decryption or that cannot be decoded
- provide an option to delete a resource independently of its contents, if those cannot be reached due to data transformation or data corruption
Non-Goals
- implementing system for ignoring different types of storage errors
- give clients control over skipping other steps of a delete request flow than decoding errors
Proposal
Improve resource retrieval errors to include more information about the object that failed transformation while it was being retrieved from the storage.
Introduce a new DeleteOption that would allow deleting a resource even
if we cannot retrieve its data.
User Stories (Optional)
Story 1
I accidentally removed my encryption key but only a few resources were encrypted with it. I know that these will either be recreated by a controller, or I can manually recreate them. I would like a simple way to figure out which resources fail decryption and I would like a way to remove them via Kubernetes API.
Story 2
I would like to remove a namespace I no longer need. However, some of the resources inside of the namespace were encrypted before the encryption at rest configuration broke, which blocks a successful namespace delete.
Notes/Constraints/Caveats (Optional)
An unconditional delete of a malformed resource may break garbage collection, would ignore finalizers and would disregard any underlying system processes that might be tied to the given resource (e.g. pods).
Risks and Mitigations
We need to make sure that a user that is trying to perform an unconditional delete of
a malformed resource is well informed about the impact of what they are doing. This should be handled
by one or more prompts from the kubectl client when the DeleteOption from this enhancement is set.
Gate the deletion with an additional admission layer on server.
Design Details
Background
The encryption/decryption for encryption at rest is implemented via transformers that get applied to a resource in code that handles resource read/write from etcd3 databases.
The storage handling does not change with KMSv2, a resource transformer is provided in that case, too.
References:
The code example 2. above shows that currently, when reading a resource fails, we lose all the context about the resource and a non-wrapping, generic internal error is returned.
Proposed Solution
New Error Status for Read Failures
The current API errors don’t appear to include an error status specific to storage. Therefore
a new status should be introduced - StatusReasonStoreReadError.
// StatusReasonStoreReadError means that the server encountered an error while
// retrieving resources from the backend object store.
// This may be due to backend database error, or because processing of the read
// resource failed.
// Details:
// "kind" string - the kind attribute of the resource being acted on.
// "name" string - the prefix where the reading error(s) occurred
// "causes" []StatusCause
// - (optional):
// - "type" CauseType - CauseTypeUnexpectedServerResponse
// - "message" string - the error message from the store backend
// - "field" string - the full path with the key of the resource that failed reading
//
// Status code 500
StatusReasonStoreReadError StatusReason = "StorageReadError"
This error will also include full paths to the resources that cannot be read in an unstructured, human-readable message.
In cases where the number of malformed resources would be too great (> 100), only
the first 100 will be shown in the causes slice. The 101st element of the slice
takes the following form:
StatusCause{
type: CauseTypeTooMany
message: "too many errors, the list is truncated"
}
New Delete Option for Corrupt Objects
Deleting a resource is a rather complicated process:
- a resource might represent an actual process running on a host (Pod)
- there might be other resources with owner references to the resource that’s being deleted
- a resource might contain finalizers that safeguard the deletion of the given resource
before other dependent resources are deleted (typically - namespaces and the
kubernetesfinalizer)
An unconditional deletion should try to do best effort on all of the above, but in case of an undecryptable resource, all the above would be ignored.
For case 1., ignoring an underlying process may not be an issue as kubelet is supposed to take care of unused containers .
In case 2., there might be issues with setting related objects as orphans, which
could potentially cause an unwanted cascade deletion of objects.
- has a potential of becoming rather serious. Finalizers are typically set to safeguard other objects, and so if e.g. an aggregated API server is removed, its API objects might be scattered around the etcd database without and API to remove them.
To allow unconditional deletion, a new DeleteOption should be introduced - IgnoreStoreReadErrorWithClusterBreakingPotential
type DeleteOptions struct {
...
// IgnoreStoreReadErrorWithClusterBreakingPotential will try to perform the normal
// deletion flow but if the data of the resource being deleted cannot be read from
// the store, either because it failed to be decrypted or the data is
// otherwise corrupted and cannot be decoded, it will disregard these errors
// and still perform the deletion.
// WARNING: This will break the cluster if the resource has dependencies beyond
// the caller's comprehension. Use only if you REALLY know what you are
// doing.
// WARNING: Vendors will most likely consider using this option to be breaking the
// support of their product.
IgnoreStoreReadErrorWithClusterBreakingPotential bool
}
Admission Control for Unconditional Deletion
A “delete” verb on a resource is not usually considered a privileged action. As the previous section explains, deletion of a resource might carry unexpected consequences. Unconditional deletions should therefore have their own extra admission.
The unconditional deletion admission:
- checks if a “delete” request contains the
IgnoreStoreReadErrorWithClusterBreakingPotentialoption - if it does, it checks the RBAC of the request’s user for the
delete-ignore-read-errorsverb of the given resource
Implementation Considerations
Watch Event Propagation and Client Recovery
When a corrupt object is deleted from etcd, the kube-apiserver’s watch cache cannot transform or decode the object’s previous value. This triggers a deliberate recovery sequence:
Error Detection: The etcd3 watcher fails to transform/decode the deleted object’s data and generates a
watch.Errorevent withStatusReasonStoreReadError.Cacher Reset: The Cacher’s internal Reflector receives this error, causing
ListAndWatch()to stop. After a brief delay, the Cacher reinitializes by callingterminateAllWatchers()followed by a fresh LIST from etcd.Client Disconnection: All active watch connections for that resource type are terminated. Clients see their watch channels close without receiving the original error event.
sequenceDiagram
participant etcd
participant Watcher as etcd3/watcher
participant Cacher
participant CacheWatcher as cacheWatcher
participant HTTP as HTTP Handler
etcd->>Watcher: DELETE event (corrupt prevValue)
Watcher->>Watcher: transform() fails on prevValue
Watcher->>Cacher: watch.Error (StoreReadError)
Note over Cacher: Reflector returns, waits 1s
Cacher->>CacheWatcher: terminateAllWatchers()
CacheWatcher->>HTTP: close(result)
HTTP-->>HTTP: return (connection closes)
Cacher->>etcd: LIST + WATCH
Note over Cacher: Cache rebuilt, new RV window- Client Recovery: Disconnected clients attempt to resume watching from their
last known
resourceVersion. The server rejects this with a “too old resource version” error, forcing clients to perform a fresh LIST and rebuild their local caches.
Design Principles
The following principles, agreed upon by SIG API Machinery, guide this enhancement:
Watch history cannot be preserved when a corrupt object exists. Since the object’s data cannot be decrypted or decoded, we have no access to the correct previous object state required for a semantically valid DELETE event.
Performance degradation is acceptable during the remediation window. The temporary increase in API server load from client re-lists is an accepted tradeoff for restoring cluster health.
Enable admin remediation: The admin must be able to identify corrupt objects and delete them, even if one by one. Once all corrupt objects are removed, the kube-apiserver and client informers recover automatically.
This approach favors eventual consistency and cluster recovery over preserving individual watch streams during an inherently abnormal situation.
Alternative Approaches Considered
We considered using shallow object representations to enhance error or delete events, enabling targeted removal of the corrupt object from client caches without triggering a full re-list:
DeletedFinalStateUnknown: A client-go type used when the final state of a deleted object is unknown. This approach failed becauseDeletedFinalStateUnknowndoes not implementruntime.Object, which is required by the watch cache.PartialObjectMetadata: A Kubernetes type containing only object metadata. This failed because the watch cache’sgetAttrsFuncperforms type assertions to the specific resource type (e.g.,*api.Secret), whichPartialObjectMetadatacannot satisfy.Type Identity Object: Creating an empty object of the correct type via
newFunc()and copying only essential metadata (namespace, name, resourceVersion, UID). While technically feasible, the added complexity was not justified given the design principles outlined above.
Test Plan
- I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
Unit tests
k8s.io/apiserver/pkg/storage/etcd3:28.9.2023-77%k8s.io/apimachinery/pkg/apis/meta/v1:28.9.2023-48.1%
Integration tests
Alpha:
- TestAllowUnsafeMalformedObjectDeletionFeature
: testgrid
, triage
- Verifies corrupt secrets can be deleted with feature enabled, the new option set and proper RBAC
- Verifies that normal deletion deletion fails with new
StorageError: corrupt object - Verifies that normal secrets can still be deleted with the feature enabled, even with corrupt objects in the database
- Verifies deletion of corrupt objects is blocked when feature is disabled and there is a lack of option and RBAC.
- TestListCorruptObjects
: testgrid
, triage
- Verifies LIST returns errors for corrupt objects when feature is enabled
- Verifies error truncation when too many corrupt objects exist
Beta:
- test that LIST operation is capable of returning multiple corrupt objects
- test delete handler with unsafe deletion flow
- test deletion of bit-flip corrupted objects (deserialization failure, not transformer failure)
- test deletion of corrupt CRs
- validate kube-apiserver transition to healthy state after cleanup
e2e tests
Integration tests are functionally equivalent to e2e tests for this feature. They exercise the full kube-apiserver stack with a real etcd backend. The integration test framework is preferred because it allows direct manipulation of etcd contents, encryption configuration during test execution and they are more stable to handle such manipulation.
Graduation Criteria
Alpha
- Error type is implemented
- Deletion of malformed etcd objects and its admission can be enabled via a feature flag
Beta
- Feature enabled by default
- Dry-run support for unsafe corrupt object deletion
- Comprehensive test coverage as outlined in the Integration tests > Beta section.
Upgrade / Downgrade Strategy
This feature is contained entirely within kube-apiserver with no persistent state changes:
- Upgrade: Enabling the feature gate makes the
IgnoreStoreReadErrorWithClusterBreakingPotentialdelete option functional. No configuration migration required. - Downgrade: Disabling the feature gate makes the delete option non-functional. The option is silently ignored. No cleanup required.
- Mixed version clusters: During rolling updates, some apiservers may have the feature enabled while others don’t. Requests with the unsafe delete option will only succeed on apiservers with the feature enabled. This is acceptable for an emergency recovery feature.
No special upgrade or downgrade procedures are required.
Version Skew Strategy
This feature is entirely within kube-apiserver with no node component interaction:
- API server to API server: In HA setups, some apiservers may have the feature enabled while others don’t during rollout. The unsafe delete option only works on apiservers with the feature enabled. This is acceptable behavior.
- Kubelet: No interaction. This feature doesn’t affect pod lifecycle or node operations.
- Other components: No interaction. The feature only affects DELETE requests with the specific option set.
No version skew concerns exist because:
- The feature doesn’t introduce new API fields that need coordination
- The DeleteOption is ignored by apiservers without the feature
- No persistent state changes that could cause inconsistency
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name: AllowUnsafeMalformedObjectDeletion
- Components depending on the feature gate: kube-apiserver
- Other
- Describe the mechanism: The new error type will always be present once implemented
- Will enabling / disabling the feature require downtime of the control plane? No
- Will enabling / disabling the feature require downtime or reprovisioning of a node? No
Does enabling the feature change any default behavior?
No.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
The feature can be safely enabled and disabled at will.
What happens if we reenable the feature if it was previously rolled back?
There should be no side-effects.
Are there any tests for feature enablement/disablement?
Yes, the integration tests explicitly toggle the feature gate to verify enablement/disablement:
- TestAllowUnsafeMalformedObjectDeletionFeature
- feature gate toggle at L198
: Parametrized test running with
featureEnabled: trueandfeatureEnabled: false. Verifies deletion is blocked when disabled, works when enabled with proper RBAC. - TestListCorruptObjects
- feature gate toggle at L512
: Parametrized test verifying LIST returns
StatusReasonStoreReadErrorwhen enabled,StatusReasonInternalErrorwhen disabled.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
Rollout and rollback cannot fail because:
- No persistent state changes: The feature doesn’t write new data to etcd or modify existing objects (except deleting them when explicitly requested).
- Contained within kube-apiserver: No coordination with kubelet, controllers, or other components required.
- Opt-in behavior: The feature only activates when a client explicitly sets the
IgnoreStoreReadErrorWithClusterBreakingPotentialoption AND has RBAC permission for theunsafe-delete-ignore-read-errorsverb.
Impact on running workloads: None. The feature doesn’t affect normal cluster operations.
What specific metrics should inform a rollback?
Important context: This feature is for emergency cluster recovery. During remediation, temporary performance degradation is expected and acceptable. The following metrics will spike when corrupt objects are deleted - this is the feature working correctly, not a problem.
Rollback should only be considered if:
Unexpected cache resets —
apiserver_watch_cache_initializations_totalspikes occur when no corrupt object deletion was performed. This would indicate the feature gate enablement itself is causing unintended side effects.Recovery does not complete — After corrupt object deletion, the system should stabilize within minutes. If
apiserver_storage_list_totalremains elevated for an extended period (>10 minutes for typical clusters), clients may be stuck in reconnection loops.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
No testing of upgrade->downgrade->upgrade necessary because:
- No new persisted state changes: Either the corrupt object is deleted or not.
- “Atomic” behavior: Either the feature is enabled and the user can perform unsafe deletes (with proper RBAC), or it’s disabled and they cannot.
- Version skew is handled gracefully: The interpretation of a deletion event of a corrupt object is added to k8s 1.32.
- Rollback is trivial: Disabling the feature gate simply makes the
DeleteOptionnon-functional. No cleanup or migration required.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No deprecations.
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
This feature is for cluster administrators performing emergency recovery, not for workload automation.
To detect actual usage (i.e., unsafe deletions being performed):
- Audit logs: Search for annotation:
apiserver.k8s.io/unsafe-delete-ignore-read-error. - RBAC: Check RoleBindings/ClusterRoleBindings granting
unsafe-delete-ignore-read-errorsverb.
How can someone using this feature know that it is working for their instance?
- Events
- Event Reason:
- API .status
- Condition name:
- Other field:
- Other (treat as last resort)
- Details:
- Attempt to delete a corrupt object with the delete option set but without
RBAC permission for
unsafe-delete-ignore-read-errorsverb. Receiving 403 Forbidden (instead of 500 StorageReadError) confirms the feature is enabled and recognizing the option. - Without the delete option, attempting to delete a corrupt object returns the original 500 StorageReadError.
- Use dry-run to safely verify the behavior with various combinations of option and RBAC permissions.
- With proper RBAC permission and the option set, the corrupt object deletion succeeds.
- Attempt to delete a corrupt object with the delete option set but without
RBAC permission for
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
This feature targets emergency cluster recovery scenarios where corrupt objects are blocking normal operations. Temporary performance degradation during remediation is acceptable - the priority is restoring cluster functionality.
The deletion itself is faster as it bypasses preconditions and finalizers, but there are cache resets at the kube-apiserver and its watching clients (informers) that may cause performance degradation.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
Metrics
Note: During corrupt object deletion remediation, temporary metric spikes are expected and acceptable. The priority is restoring cluster functionality, not maintaining SLOs.
Metric name:
apiserver_watch_cache_initializations_total- Labels:
group,resource - Components exposing the metric: kube-apiserver
- Details: Increments when watch cache rebuilds. A spike correlating with corrupt object deletion confirms the expected recovery flow triggered. After remediation completes, this should return to baseline (typically zero or very low).
- Labels:
Metric name:
apiserver_storage_list_total- Labels:
group,resource - Components exposing the metric: kube-apiserver
- Details: Tracks LIST operations hitting etcd storage. Expect a transient spike as clients reconnect and rebuild caches. Recovery is complete when this returns to pre-remediation levels.
- Labels:
Other (treat as last resort)
- Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?
The existing metrics provide sufficient observability for tracking cache rebuilds and recovery:
apiserver_watch_cache_initializations_total— confirms cache rebuild occurredapiserver_storage_list_total— tracks recovery progress (client re-lists)
Known gap: apiserver_storage_decode_errors_total only covers decode errors in store.go
operations (GET, LIST, etc.), not in watcher.go transform/decode failures. This means the
metric won’t increment specifically for the corrupt object deletion watch flow. This is
acceptable because:
- The feature is for emergency recovery where detailed decode error counts are less critical than successful deletion.
- The cache rebuild metrics above provide sufficient signal that the flow completed.
- Adding watcher-specific decode error metrics would require broader consensus in sig-instrumentation.
For tracking actual feature usage (unsafe deletions performed), operators should use audit logs
and search for the apiserver.k8s.io/unsafe-delete-ignore-read-error annotation.
Dependencies
Does this feature depend on any specific services running in the cluster?
No
Scalability
The feature itself should not bring any concerns in terms of performance at scale. In particular as its usage is supposed to run on potentially broken clusters.
An issue in terms of scaling comes with the error that attempts to list all resources that appeared to be malformed while reading from the storage. A limit of 100 presented resources was arbitrarily picked to prevent huge HTTP responses.
Another issue in terms of scaling happens when the corrupt objects are deleted. Client reflectors re-list to recover, this causes temporarily increased load on the client-side and the kube-apiserver.
Will enabling / using this feature result in any new API calls?
No.
Will enabling / using this feature result in introducing new API types?
No.
Will enabling / using this feature result in any new calls to the cloud provider?
No.
Will enabling / using this feature result in increasing size or count of the existing API objects?
DeleteOptions gets a new boolean field, but it is transient: no persistence in etcd.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
DELETE operations:
- Unsafe DELETE path is faster (skips preconditions, validation, finalizers)
- Decreases latency for the unsafe delete itself
LIST operations:
- Client-side reflectors re-list when their watch breaks (after corrupt object deletion ERROR event)
- Temporarily increases LIST request volume to apiserver
- Latency increase depends on: number of watching clients × object count × apiserver resources
Expected impact:
- Negligible under the circumstance that the cluster is in a potentially broken state.
- Potentially noticeable if: popular resource (many watchers) × many objects × resource-constrained apiserver
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
Temporary increase during cleanup, dependent on object and resource type popularity:
- apiserver: CPU / network during re-lists
- client-side: CPU / memory / network during re-lists / rebuilding cache
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No.
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
If the API server is unavailable, no DELETE requests can be processed (including unsafe deletes). This is standard Kubernetes behavior.
If etcd is unavailable, DELETE requests fail with storage errors, including the unsafe delete feature.
What are other known failure modes?
Missing RBAC permission
- Detection: 403 Forbidden responses when using the unsafe delete option
- Mitigation: Grant
unsafe-delete-ignore-read-errorsverb permission to the user - Diagnostics: Audit logs show RBAC denial; API server logs show “forbidden” at verbosity 3+
- Testing: Covered by TestAllowUnsafeMalformedObjectDeletionFeature
Feature gate disabled
- Detection: Unsafe delete option silently ignored; corrupt object still returns 500 StorageReadError
- Mitigation: Enable AllowUnsafeMalformedObjectDeletion feature gate
- Diagnostics: Check feature gate status via /healthz or metrics
- Testing: Covered by TestAllowUnsafeMalformedObjectDeletionFeature
Object not actually corrupt
- Detection: Normal delete succeeds without needing the option
- Mitigation: None needed - use normal delete
- Diagnostics: Object is readable via GET
- Testing: Covered by integration tests
What steps should be taken if SLOs are not being met to determine the problem?
During corrupt object deletion, temporary SLO degradation is expected (see Monitoring Requirements section). If degradation persists:
- Check apiserver_watch_cache_initializations_total - should return to baseline within minutes
- Check apiserver_storage_list_total - elevated counts indicate clients are still rebuilding caches
- Review audit logs - confirm the unsafe delete completed successfully
- If recovery doesn’t complete - restart kube-apiserver to force fresh state
Implementation History
- 2023-03-27: KEP created
- 2023-10-05: KEP merged as provisional
- v1.32: Alpha implementation:
- Deletion of corrupt objects, with client option and RBAC.
- Extended listing of corrupt objects
- Integration tests
- v1.36: Targeting beta
- Cache reset deemed acceptable in sig-api-machinery bi-weekly meeting
- Dry-Run
- Additional integration tests for CRs and serialization failures.
Drawbacks
Potential for misuse: The unsafe delete option bypasses safety mechanisms (finalizers, garbage collection). Misuse could orphan resources or break cluster state.
Vendor support concerns: Using this feature may void support from Kubernetes distributions/vendors, as it allows bypassing normal API guarantees.
No undo: Unsafe deletion is permanent. If used incorrectly, the only recovery is restoring from etcd backup.
These drawbacks are intentional - the feature is designed for emergency recovery where the alternative (direct etcd manipulation) is worse.
Alternatives
Direct etcd manipulation (status quo)
Requires etcd access, bypasses all Kubernetes abstractions, risky, not audited.