KEP-5080: Ordered Namespace Deletion

Implementation History
STABLE Implementable
Created 2024-01-27
Latest v1.34
Milestones
Beta 1.30
Stable 1.34
Ownership
Primary Authors

KEP-5080: Ordered Namespace Deletion

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
    • e2e Tests for all Beta API Operations (endpoints)
    • (R) Ensure GA e2e tests meet requirements for Conformance Tests
    • (R) Minimum Two Week Window for GA e2e tests to prove flake free
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • “Implementation History” section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This kep introduces an opinionated deletion process in the Kubernetes namespace deletion to ensure secure deletion of resources within a namespace. The current deletion process is semi-random, which may lead to security gaps or unintended behavior, such as Pods persisting after the deletion of their associated NetworkPolicies. By implementing an opinionated deletion mechanism, the Pods will be deleted before other resources with respects logical and security dependencies. This design enhances the security and reliability of Kubernetes by mitigating risks arising from the non-deterministic deletion order.

Motivation

The existing random deletion process for resources in a namespace poses significant challenges, particularly in environments with strict security requirements. One critical issue is the potential for a Pod to remain active after its associated NetworkPolicy has been deleted, leaving it exposed to unrestricted network access. This creates a security vulnerability during the cleanup process.

Additionally, the lack of a defined deletion order can lead to operational inconsistencies, where any sort of safety guard resources (not just NetworkPolicy) are deleted before their guarded resources (e.g., Pods), resulting in unnecessary disruptions or errors.

By introducing an opinionated deletion process, this proposal aims to:

  • Enhance Security: Ensure resources like NetworkPolicies remain in effect until all dependent resources have been safely terminated.
  • Increase Predictability: Provide a consistent and logical cleanup process for namespace deletion, reducing unintended side effects.

This opinionated deletion approach aligns with Kubernetes’ principles of reliability, security, and extensibility, providing a solid foundation for managing resource cleanup in complex environments.

Goals

  1. Introduce an Opinionated Deletion Order: Implement a mechanism while namespace deletion to prioritize the deletion of certain resource types before others based on logical dependencies and security considerations (e.g., Pods deleted before NetworkPolicies).2

  2. Maintain Predictability and Consistency: Provide a more deterministic deletion process to improve user confidence and debugging during namespace cleanup.

  3. Integrate with Existing Kubernetes Concepts: Build on the namespace deletion’s current architecture without introducing breaking changes to existing APIs or workflows.

  4. Be safe - don’t introduce unresolvable deadlocks.

  5. Make the most common dependency - workloads and the policies that govern them - safe by default for all types of policies, including CRDs, unless specifically opted out.

Non-Goals

  1. Reordering Deletion Across Namespaces: This design focuses on resource deletion within a single namespace. It does not attempt to enforce or prioritize deletion order across multiple namespaces.

  2. Introducing Custom Per-Resource Deletion Order: While the proposal aims for opinionated ordering, it does not cover fine-grained customization by end-users for specific resources or workloads.

  3. Guaranteeing Real-Time Enforcement: The proposal does not aim to guarantee real-time deletion of resources; the Kubernetes control plane’s reconciliation loop remains the underlying driver.

  4. Replacing Finalizers or Current Garbage Collection Mechanisms: The design does not intend to replace or bypass the existing Finalizer mechanism but works alongside it to enhance the resource cleanup process.

  5. Handling Non-Standard or External Resources: This design does not apply to external or non-Kubernetes resources managed outside the cluster (e.g., external databases, cloud resources).

  6. Global Enforcement of Security Policies: While the proposed changes improve security during deletion, it is not a substitute for broader, cluster-wide security policies or mechanisms.

  7. Implement a full graph: This proposal does not aim to model the dependency relationship between specific objects (instances) or between types as an arbitrary graph.

Proposal

When the feature gate OrderedNamespaceDeletion is enabled, the resources associated with this namespace should be deleted in order:

  • Delete all pods in the namespace (in an undefined order).
  • Wait for all the pods to be stopped or deleted.
  • Delete all the other resources in the namespace (in an undefined order).

Feature Gate handling

Due to this KEP is addressing the security concern and we do wanna give options to close security gaps in the past, the feature gate will be introduced as beta and on by default in 1.33 release. We will backport the feature gate with off-by-default configuration to all supported releases. See the detailed discussion on slack

User Stories (Optional)

Story 1 - Pod VS NetworkPolicy

A user has pods which listen on the network and network policies which help protect those pods. While namespace deletion, there could be cases that NetworkPolicy has deleted while the pods are running which cause the security concern of having Pods running unprotected.

After this feature was introduced, we would have NetworkPolicy always deleted after the Pods to avoid the above security concern.

Story 2 - having finalizer conflicts with deletion order

E.g. if the pod has a finalizer which is waiting for network policies (which is opaque to Kubernetes), it will cause dependency loops and block the deletion process.

Refer to the section Handling Cyclic Dependencies.

Story 3 - having policy set up with parameter resources

When ValidatingAdmissionPolicy is used in the cluster with parameterization, it is possible to use pod as the parameter resources. In this case, the parameter resources will be deleted before VAP and lead the VAP not in use. To make it even worse, if the ValidatingAdmissionPolicyBinding is configured with .spec.paramRef.parameterNotFoundAction: Deny, it could block certain resources operations and also hang the termination process. Similar concern applies to Webhooks with parameter resources.

It is an existing issue with current namespace deletion as well. As long as we don’t plan to have a dependency graph built, it will rely more on best practices and user’s configuration.

Notes/Constraints/Caveats (Optional)

Having ownerReference conflicts with deletion order

When deciding the deletion priority for resources, it should take ownerReference into consideration. E.g. the deployment VS pod. However, it should not matter much in terms of namespace deletion. Namespace deletion specifically uses metav1.DeletePropagationBackground and all resources would be deleted and the ownerReference dependencies would be handled by the garbage collection.

In Kubernetes, ownerReferences define a parent-child relationship where child resources are automatically deleted when the parent is removed. This is mostly handled by garbage collection. While namespace deletion, the ownerReferences is not part of the consideration and the garbage collector controller will make sure
no child resources still existing after the parent resource deleted.

Risks and Mitigations

Dependency cycle

The introduction of deletion order could potentially cause dependency loops especially when finalizers are specified against deletion priority.

When a lack of progress detected(maybe caused by the dependency cycle described above), it could hang the deletion process same as the current behavior.

Mitigation: Delete the blocking finalizer to proceed.

Design Details

DeletionOrderPriority Mechanism

For the namespace deletion process, we would like to have the resources associated with this namespace be deleted as following:

  • Delete all pods in the namespace (in an undefined order).
  • Wait for all the pods to be stopped or deleted.
  • Delete all the other resources in the namespace (in an undefined order).

The above order will be strict enforced as long as the feature gate is turned on.

Handling Cyclic Dependencies

Cyclic dependencies can occur if resources within the namespace have finalizers set which conflicts with the DeletionOrderPriority. For example, consider the following scenario:

  • Pod A has a finalizer that depends on the deletion of Resource B.

  • Pod A suppose to be deleted before Resource B.

In this case, the finalizers set would conflict with the NamespaceDeletionOrder and could lead to cyclic dependencies and cause namespace deletion process hanging.

To mitigate the issue, user would have to manually resolve the dependency lock by either remove the finalizer or force delete the blocking resources which would be the same as current mechanism.

Test Plan

[ X ] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates
Unit tests
  • <package>: <date> - <test coverage>
Integration tests
  • :
e2e tests
  • :

Graduation Criteria

Beta

  • Feature implemented behind a feature flag
  • Initial e2e tests completed and enabled
  • Complete features specified in the KEP
  • Proper metrics added
  • Additional tests are in Testgrid and linked in KEP

GA

  • Related CVE has been mitigated
  • Conformance tests

Note: Generally we also wait at least two releases between beta and GA/stable, because there’s no opportunity for user feedback, or even bug reports, in back-to-back releases.

For non-optional features moving to GA, the graduation criteria must include conformance tests .

Upgrade / Downgrade Strategy

In alpha, no changes are required to maintain previous behavior. And the feature gate could be turned on to make use of the enhancement.

Version Skew Strategy

Not applicable

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?
  • Feature gate (also fill in values in kep.yaml)
    • Feature gate name: NamespaceDeletionOrder
    • Components depending on the feature gate:
      • kube-apiserver
Does enabling the feature change any default behavior?

No, default behavior is the same.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, through the feature gate.

What happens if we reenable the feature if it was previously rolled back?

The namespace deletion will respect the order specified again.

Are there any tests for feature enablement/disablement?

Yes. Unit test and integration test will be introduced in alpha implementation.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

This feature should not impact rollout.

What specific metrics should inform a rollback?

N/A

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

N/A

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

Check if the feature gate is enabled. The feature is a security fix which should not be user detectable.

How can someone using this feature know that it is working for their instance?

N/A

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

The feature only affect namespace deletion and should not affect existing SLOs.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

The error or blocker will be updated to namespace status subresource to follow the existing pattern.

Are there any missing metrics that would be useful to have to improve observability of this feature?

Namespace status will be used to capture the possible error or blockers while deletion.

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

No.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

The namespace controller will act exactly the same with/without this feature.

What are other known failure modes?

Namespace deletion might hang if pod resources deletion running into issues with the feature gate enabled.

What steps should be taken if SLOs are not being met to determine the problem?

Delete the blocking resources manually.

Implementation History

Drawbacks

Alternatives

Using finalizers to define the deletion ordering

Finalizers could potentially solve this problem or work as a workaround for this issue. Having a controller running and watching the NetworkPolicy and adding a finalizer to make sure Pods always deleted before NetworkPolicy is the alternative solution. However, it is not the best way to go because:

  • User would always have customized controller introduced and it is hard to educate everyone to follow the best practices
  • It could not address the previous behavior completely
  • It is not generic enough in case of later there is request coming for other resources deletion ordering

Infrastructure Needed (Optional)