KEP-3716: Admission Webhook Match Conditions

Release Signoff Checklist
Summary
Motivation
Proposal
- API
- Risks and Mitigations
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Future Work
- Cross-webhook match conditions
Alternatives

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests for meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This KEP proposes adding “match conditions” to admission webhooks, as an extension to the existing rules to define the scope of a webhook. A matchCondition is a CEL expression that must evaluate to true for the admission request to be sent to the webhook. If a matchCondition evaluates to false, the webhook is skipped for that request (implicitly allowed).

Motivation

Reliability: Admission webhooks continue to be an operational sore spot for many Kubernetes users. Webhooks that target cluster critical resources put the admission controller backing the webhook in the critical path of cluster stability. Even if tools like namespace scoping are used to avoid circular-dependencies and exclude critical system resources, a webhook outage can still have a major impact on cluster availability. This proposal aims to mitigate (but not eliminate) these issues by allowing webhooks to be more narrowly scoped and targeted.

Performance: Admission webhooks sit in the critical request path for write-requests. Validating webhooks can be run in parallel, but Mutating webhooks must be run in serial (up to 2 times!). This makes webhooks extremely latency sensitive, and even a webhook that doesn’t do any work still needs to pay the network round-trip cost.

Supportability: For hosted or managed Kubernetes distributions, webhooks can be a problem when they interfere with requests by managed components. The existing criteria for filtering out requests are insufficient for many use cases, and aren’t easily appended with provider rules.

What about CEL for Admission Control ? ValidatingAdmissionPolicy is an exciting new feature which we hope will greatly reduce the need for admission webhooks, but it is intentionally not attempting to cover every possible use case. This proposal aims to improve the situation for those webhooks that cannot be migrated.

User Stories

Exclude resources from a wildcard rule

I want to enforce metadata policy through an admission webhook without adding latency & risk to high QPS system requests.

Currently, if a webhook uses wildcard match rules, there is no way to filter out a subset of resources or requests from matching the wildcard. If the webhook instead enumerates every resource that should match, it must be kept up-to-date with every CRD that’s added.

With CEL match conditions, the webhook could specify wildcard match rules, and add match conditions to filter out the desired resources:

rules:
  # Match CREATE & UPDATE on all resources:
  - operations:
    - CREATE
    - UPDATE
    apiGroups: '*'
    apiVersions: '*'
    resources: '*'
matchConditions:
  - name: 'exclude-leases'
    expression: '!(request.resource.group == "coordination.k8s.io" && resource.resource == "leases")'

Exempt system users from security policy

As a managed cluster provider, I want to prevent user webhooks from intercepting critical system requests.

System resources can currently be exempted through a namespace or label selector, but requests by system components against non-system resources cannot be. For example, update pod status requests by Kubelets cannot be excluded from user webhooks intercepting all pod requests.

With matchConditions, a managed cluster could append system-exclusion rules to each webhook. For example:

matchConditions:
  - name: 'exclude-kubelet-requests'
    expression: '!("system:nodes" in request.userInfo.groups)'

Since the expression will be evaluated using a common Kubernetes CEL library, these expressions should also get automatic access to the secondary authorization check mechanism described in KEP-3488: CEL for Admission Control . In practice, this means that RBAC bindings can be used to opt-out privileged users from security policy:

matchConditions:
  # Requests by users without breakglass should be included.
  - name: 'breakglass'
    expression: 'authorizer.resource('admissionregistration.k8s.io', 'validatingwebhookconfigurations', '*').name('security-policy').check('breakglass').denied()'

Scope an NFS access management webhook to Pods mounting NFS volumes

I want to narrowly scope my webhook to only the relevant requests, in order to reduce load on the webhook and reduce latency in irrelevant requests.

Concrete example:

A NFS deployment uses an third-party access management system. I have an admission webhook that performs an access check for against the external system for pods that mount NFS volumes. Only pods with NFS volumes need to be checked.

Currently, there is no way to achieve this. Many webhook implementations today start by checking that the request is within scope, and return early if it’s not. This adds latency and an additional failure point to irrelevant requests. This example requires an external integration, and thus is not a candidate for migration to CEL ValidatingAdmissionPolicy.

With match conditions, the expressions can check whether the request object is in-scope for the webhook:

rules:
  - operations: ['CREATE']
    apiGroups: '' # core
    apiVersions: '*'
    resources: 'pods'
matchConditions:
  # Only include pods with an NFS volume.
  - name: 'nfs-volume-present'
    expression: 'object.spec.volumes.exists(v, v.has(nfs))'

Goals

Provide a filtering mechanism for excluding requests from an admission webhook
Maintain consistency with ValidatingAdmissionPolicy

Non-Goals

Provide a mechanism to exclude requests from all webhooks.

Proposal

API

Both ValidatingWebhook and MutatingWebhook (in admissionregistration.k8s.io) will be updated with a new MatchConditions field:


type ValidatingWebhook struct {
  // ...

  // MatchConditions is a list of conditions on the AdmissionRequest ('request') that must be met
  // for a request to be sent to this webhook. All conditions in the list must evaluate to TRUE for
  // the request to be matched.
  // +optional
  // +patchMergeKey=name
  // +patchStrategy=merge
  MatchConditions []MatchCondition `json:"matchConditions,omitempty"`
}

type MutatingWebhook struct {
  // ...
  MatchConditions []MatchCondition `json:"matchConditions,omitempty"`
}

// MatchCondition represents a condition which must by fulfilled for a request to be sent to a webhook.
type MatchCondition struct {
  // Name is an identifier for this match condition, used for strategic merging of MatchConditions,
  // as well as providing an identifier for logging purposes. A good name should be descriptive of
  // the associated expression.
  // Name must be a valid RFC 1123 DNS subdomain, and unique in a set of MatchConditions.
  // Required.
  Name string `json:"name"`
  // NOTE: Placeholder documentation, to be replaced by https://github.com/kubernetes/website/issues/39089.
  //
	// Expression represents the expression which will be evaluated by CEL.
	// ref: https://github.com/google/cel-spec
	// CEL expressions have access to the contents of the AdmissionRequest, organized into CEL variables:
	//
	// 'object' - The object from the incoming request. The value is null for DELETE requests.
	// 'oldObject' - The existing object. The value is null for CREATE requests.
	// 'request' - Attributes of the admission request([ref](https://raw.githubusercontent.com/kubernetes/enhancements/master/pkg/apis/admission/types.go#AdmissionRequest)).
	//
	// Required.
	Expression string `json:"expression"`
}

The match condition expression is evaluated by the same libraries as those used for CEL ValidatingAdmissionPolicy. The only difference in expressions is the availability of the params variable. Expressions requiring access to additional information outside the AdmissionRequest must be performed in the webhook, and are out of scope for this proposal.

Risks and Mitigations

Security

Risk: Attacker adds or changes a match condition to weaken an admission policy.

This does not represent a new threat, as doing so would require update access to the admission registration object, and with that permission an attacker could already disable the policy through manipulating match rules, namespace selector, or object selector (or reroute the webhook entirely).

Risk: Logic error in match condition expression.

Currently the match conditions must be encoded in the webhook backend itself. Moving the logic into a CEL expression adds a potential failure point. This can be mitigated by testing, but the CEL ecosystem currently lacks some of the tools that would make this easier.

Of particular significance are match conditions tied to non-functional properties of an object, such as using labels to decide whether to opt an object out of a policy. Without additional admission controls on who can set those non-functional aspects, exempting the policy based on that could be a security vulnerability. In contrast, the NFS example usecase exempts the policy on a functional aspect - whether an NFS volume is mounted, and thus whether the policy is relevant.

These risks are inherent to the feature being proposed and cannot be mitigated through technical means, but should be highlighted in the documentation.

Debuggability

We do not normally log, audit, or emit an event when a webhook is out-of-scope for a request, and the same will mostly be true for match conditions.

At log level V(5) , we will emit a log when a request that would otherwise be in-scope for a webhook is excluded for a non-matching match condition.

Short of increasing log verbosity, the recommended debug strategy is to capture or reproduce a relevant AdmissionRequest (for example, in a non-prod cluster disable all match conditions and log the requests from a webhook). Then, manually test the match conditions against the request, and iterate as necessary.

Performance

The CEL expression evaluation will leverage the same Resource Constraints used by CEL CRD Validation & CEL Admission Control. The runtime cost budgets are defined here CEL Runtime Cost .

The per call limit is shared with Validating Admission Policy CEL expressions and set at roughly 0.1 second for each expression evaluation call. The total budget per object (i.e. per ValidatingWebhook) for CEL match conditions is roughly .25 seconds and 1/4 the budget of Validating Admission Policy limit.

Design Details

Test Plan

[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

Integration tests

Test cases to add:

Feature gate enablement / disablement is a no-op when no matchConditions are set (until graduation to GA as feature gate will go away)
Feature gate enablement / disablement works as expected when matchConditions are set (until graduation to GA as feature gate will go away)
Single match condition:
- Request out of scope without matchConditions
- Request in scope without matchConditions, but not matching
- Request in scope without matchConditions, and also matching
Multiple match conditions, covering the same cases as the single-condition case

e2e tests

We will test the edge cases mostly in integration tests and unit tests.

Once the feature is default enabled in beta, a single E2E test covering the single-match-condition cases outlined above will be added.

Graduation Criteria

Alpha

Feature implemented behind AdmissionWebhookMatchConditions feature flag
Integration tests implemented

Beta

Add E2E test coverage
Resolve resource constraints validation
Smart reload/recompile of Webhook Accessors, see issue
ValidatingAdmissionPolicy is promoted to Beta.

GA

Promote appropriate E2E tests to conformance
- https://github.com/kubernetes/kubernetes/blob/master/test/e2e/apimachinery/webhook.go
- “should be able to create and update validating webhook configurations with match conditions”
- “should be able to create and update mutating webhook configurations with match conditions”
- “should reject validating webhook configurations with invalid match conditions”
- “should reject mutating webhook configurations with invalid match conditions”
- “should mutate everything except ‘skip-me’ configmaps”
Cover any missing test coverage

Upgrade / Downgrade Strategy

Downgrading in a way that disables match conditions after it is already in use can increase the scope of requests evaluated by a webhook. See Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? for more details

Version Skew Strategy

The new field is only evaluated by the apiserver, so only HA apiserver version skew is relevant. In this case, if the feature is enabled in one apiserver and not another, a request could non-deterministically be sent to a webhook. Enabling match conditions without setting matchConditions on an webhooks is a no-op, so the version skew non-determinism is best avoided by waiting until it has been enabled in all apiservers before starting to use the new field.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: AdmissionWebhookMatchConditions
- Components depending on the feature gate: kube-apiserver

Does enabling the feature change any default behavior?

No. If the feature is enabled, but the matchConditions field is unset, the default behavior remains unchanged.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. Disabling the feature gate will ignore any matchConditions set, and return to the default behavior. Disabling AdmissionWebhookMatchConditions could increase the traffic to the webhook, and potentially increase the error rate if the webhook fails to process the additional requests correctly.

What happens if we reenable the feature if it was previously rolled back?

Any matchConditions that were already stored on existing webhooks will be enforced.

Note: enabling matchConditions can only reduce the number of requests being sent to a webhook (or remain unchanged). Enabling it will never increase the number of requests.

Are there any tests for feature enablement/disablement?

We will add tests that verify the functionality is turned off when feature gate is toggled off and turned on when toggled on.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

In general, rollout / rollback should not fail since the feature is not enabled by default. However, there are risks on rollback if webhook preconditions was enabled and then unexpectedly disabled on rollback.

What specific metrics should inform a rollback?

webhook_admission_match_condition_evaluation_errors_total is high
webhook_admission_match_condition_exclusions_total is too high or too low
webhook_admission_match_condition_evaluation_seconds is high

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Not yet, but manual testing should be completed and documented prior to beta.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

A new per-webhook metric will measure the number of requests excluded by match conditions:

Metric name: webhook_admission_match_condition_exclusions_total Labels:

name: webhook name
type: validate or admit
kind: match condition on a webhook or policy
operation: the admission operation

Metric name: webhook_admission_match_condition_evaluation_errors_total Labels:

name: webhook name
type: validate or admit
kind: match condition on a webhook or policy
operation: the admission operation

Metric name: webhook_admission_match_condition_evaluation_seconds Labels:

name: webhook name
type: validate or admit
kind: match condition on a webhook or policy
operation: the admission operation

How can an operator determine if the feature is in use by workloads?

The metric webhook_admission_match_condition_evaluation_seconds should indicate if the match conditions are being used and being evaluated for invoking webhooks.

How can someone using this feature know that it is working for their instance?

Other (treat as last resort)
- Details:
  - Check the preconditions field in the webhook object and check the webhook_admission_match_condition_exclusions_total metric for exclusions
  - Check webhook_admission_match_condition_evaluation_errors_total

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

Only negligible impact to admission latency due to evaluation of CEL rules
CEL evaluation time (webhook_admission_match_condition_evaluation_seconds)
CEL evaluation errors (webhook_admission_match_condition_evaluation_errors_total)

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name:
  - webhook_admission_match_condition_evaluation_seconds
  - webhook_admission_match_condition_evaluation_errors_total
- [Optional] Aggregation method:
- Components exposing the metric: kube-apiserver

Are there any missing metrics that would be useful to have to improve observability of this feature?

Yes, the following metrics will be considered for Beta which will improve observability of this feature:

webhook_admission_match_condition_evaluation_seconds
webhook_admission_match_condition_exclusions_total

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

No, this feature only adds new fields to existing webhook APIs

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Yes, it can increase size of webhook configuration objects. There is a limit in place for the number of preconditions a webhook can have, however, webhook objects can still increase in size significantly if large expressions are used.

API types(s): ValidatingWebhookConfiguration, MutatingWebhookConfiguration Estimated increase in size: depends on size of CEL expressions, but should be negligible in most cases Estimated amount of new objects: none

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Yes, it can impact latency SLI/SLO if evaluating CEL expressions add significant latency.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Yes, it has potential to increase CPU usage in kube-apiserver if there is a webhook intercepting many requests with many precondition rules.

Troubleshooting

See Debuggability .

How does this feature react if the API server and/or etcd is unavailable?

N/A – since the feature is part of kube-apiserver.

What are other known failure modes?

N/A

What steps should be taken if SLOs are not being met to determine the problem?

Feature can be disabled per webhook by removing preconditions or for all webhooks by disabling the feature gate in kube-apiserver.

Implementation History

Drawbacks

Future Work

Cross-webhook match conditions

In the future, we should explore ways to apply common match conditions across multiple webhooks.

Example use cases:

Apply a break-glass exemption across many (or all) webhooks.
Managed cluster provider wants to exempt provider-managed resources from user-managed webhooks.

Considerations:

Access by managed cluster provider vs. cluster admin
Side effects & mutations

Alternatives

Exclusion Expressions

The matchCondition expression could be inverted, so that requests that match are excluded rather than included. In this case, we would probably also want to change from requiring all expressions to match, to excluding the request if any match.

Although this approach would simplify some usecases, such as excluding resources from a wildcard rule or exempting system users from a security policy , it means other expressions would become double-negatives, which generally goes against API design best-practices.

Resource Exclusions

KEP-3693 Proposes an alternative approach using a more structured format for expressing resource exclusions. This approach may be more approachable to users who are not comfortable writing CEL expressions, but it is significantly less powerful. This would address Exclude resources from a wildcard rule , and could be extended with subject exclusions to address Exempt system users from security policy , but would not be sufficient to address Scope an NFS access management webhook to Pods mounting NFS volumes .

These two approaches are not mutually exclusive.

CEL Admission Control

KEP-3488: CEL for Admission Control adds the ability for admission webhooks to be replaced entirely by CEL expressions, but this is not intended to cover 100% of webhook use cases. For example, the user story described in Scope an NFS access management webhook to Pods mounting NFS volumes requires integrating with a third-party system, and is not implementable through a CEL ValidatingAdmissionPolicy.

With a mutating CEL admission policy (not yet implemented), a combination of mutating & validating policies could ensure that objects have a designated scoping label applied, which could be filtered using the ObjectSelector on the webhook. However, such an approach adds a lot of overhead and complexity beyond this proposal.