KEP-3962: Mutating Admission Policies

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
- Alternative 2: Introduce new syntax
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This enhancement adds mutating admission policies, declared using CEL expressions, as an alternative to mutating admission webhooks. This continues the work started by KEP-3488 API for validating admission policies.

This enhancement proposes an approach where mutations are declared in CEL by combining CEL’s object instantiation, JSON Patch, and Server Side Apply’s merge algorithms.

Motivation

A large proportion of mutating admission needs are for relatively simple operations such as setting a label, setting a field, or adding a sidecar container to a pod. These mutations can be expressed trivially in only a few lines of CEL, eliminating the developmental and operational complexity of a webhook.

Offering CEL based mutation also has other fundamental advantages over webhooks. CEL mutations can be declared in a way that allows the kube-apiserver to introspect the mutation and extract useful information about which fields the mutation operation reads and writes. This information can be leveraged to do things like finding order for mutating admission policies that minimizes the need for reinvocation. Also, in-process mutation is sufficiently fast, especially when compared with webhooks, that it is reasonable to re-run mutations to do things like validate that multiple muation policy is still applied after all other mutating admission operations have been applied.

Goals

Provide an alternative to mutating webhooks for the vast majority of mutating admission use cases.
Provide the in-tree extensions needed to build policy frameworks for Kubernetes, again without requiring webhooks for the vast majority of use cases.
Provide an out-of-tree implementation of this enhancement (using a webhook) that is supported by the Kubernetes org to provide this enhancement functionality to Kubernetes versions where this enhancement is not available.
Provide core functionality as a library so that use cases like GitOps, CI/CD pipelines, and auditing can run the same CEL validation checks that the API server does.

Non-Goals

Build a comprehensive in-tree policy framework. We believe the ecosystem is best equipped to explore and develop policy frameworks.
Full feature parity with mutating admission webhooks. For example, this enhancement is not expected to ever support making requests to external systems.
Replace the admission controllers compiled into the API server.
Static or on-initialization specification of admission config. This is a needed feature but should be solved in a general way and not in this KEP (xref: https://github.com/kubernetes/enhancements/issues/1872) .

Proposal

Summary

Before getting into all the individual fields and capabilities, let’s look at the general “shape” of the API. Very similar to what we have in ValidatingAdmissionPolicy, this API separates policy definition from policy configuration by splitting responsibilities across resources. The resources involved are:

Policy definitions (MutatingAdmissionPolicy)
Policy bindings (MutatingAdmissionPolicyBinding)
Policy param resources (custom resources or config maps)

The idea is to leverage the CEL power of the object construction and allow users to define how they want to mutate the admission request through CEL expression. This proposal aims to allow mutations to be expressed using JSON Patch, or the “apply configuration” introduced by Server Side Apply. And users would be able to define only the fields they care about inside MutatingAdmissionPolicy, the object will be constructed using CEL which would be similar to a Server Side Apply configuration patch and then be merged into the request object using the structural merge strategy. See sigs.k8s.io/structured-merge-diff for more details.

Note: See the alternative consideration section for the alternatives.

Pros:

JSON Patch provides a migration path from mutating admission webhooks, which must use JSON Patch.
Also build on Server Side Apply so that we will continue investing SSA as the best way to do patch updates to resources;
- Does not require the users to learn a new syntax;
- Inherit the declarative nature;
- Leverages existing merging strategy, markers and openapi extensions.

Cons:

Lack of deletion support (see the unsetting values section for the details and workaround);
Migration effort from Mutation Webhook

Phase 1

API Shape

Similar to the validations field in ValidatingAdmissionPolicy, a mutations field will be defined inside MutatingAdmissionPolicy which allows users to define a list of mutations that apply to the specific resources. Each mutation field contains a CEL expression which evaluates to a partially populated object representing a Server Side Apply “apply configuration”. The apply configuration is then merged into the request object.

Here is an example of injecting an initContainer.

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingAdmissionPolicy
metadata:
  name: "sidecar-policy.example.com"
spec:
  paramKind:
    group: mutations.example.com
    kind: Sidecar
    version: v1
  matchConstraints:
    resourceRules:
    - apiGroups:   ["apps"]
      apiVersions: ["v1"]
      operations:  ["CREATE"]
      resources:   ["pods"]
  matchConditions:
    - name: does-not-already-have-sidecar
      expression: "!object.spec.initContainers.exists(ic, ic.name == params.name)"
  failurePolicy: Fail
  reinvocationPolicy: IfNeeded
  mutations:
    - patchType: "ApplyConfiguration" // "ApplyConfiguration", "JSONPatch" supported. 
      expression: >
        Object{
          spec: Object.spec{
            initContainers: [
              Object.spec.initContainers{
                name: params.name,
                image: params.image,
                args: params.args,
                restartPolicy: params.restartPolicy
                // ... other container fields injected here ...
              }
            ]
          }
        }

The field patchType is used to specify which strategy is used for the mutation. Supported values include “ApplyConfiguration”, “JSONPatch”. The “ApplyConfiguration” strategy will prevent user from performing ambiguous action like manipulating atomic list. The detailed definition of ambiguous action should be reviewed before beta. For any mutation requires modification regarding with ambiguous action, “JSONPatch” strategy is needed.

The “JSONPatch” strategy will use JSON Patch like what is done in Mutating Webhook.

Example JSON Patch:

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingAdmissionPolicy
metadata:
  name: "sidecar-policy.example.com"
spec:
  paramKind:
    group: mutations.example.com
    kind: Sidecar
    version: v1
  matchConstraints:
    resourceRules:
    - apiGroups:   ["apps"]
      apiVersions: ["v1"]
      operations:  ["CREATE"]
      resources:   ["pods"]
  matchConditions:
    - name: does-not-already-have-sidecar
      expression: "!object.spec.initContainers.exists(ic, ic.name == params.name)"
  failurePolicy: Fail
  reinvocationPolicy: IfNeeded
  mutations:
    - patchType: "JSONPatch"
      expression: >
        JSONPatch{op: "add", path: "/spec/initContainers/-", value: Object.spec.initContainers{
                name: params.name,
                image: params.image,
                args: params.args,
                restartPolicy: params.restartPolicy
                // ... other container fields injected here ...
        }

When “ApplyConfiguration” specified, the expression evaluates to an object that has the same type as the incoming object, and the type alias Object refers to the type (see Type Handling for details).

By using Server Side Apply merge algorithms, schema declarations like x-kubernetes-list-type: map, that control how a merge is performed, will be respected.

However, unlike with server side apply, these mutations will not have a field manager specified. This has important implications in how the merge is performed that will be discussed in more detail in the below “Unsetting values” section.

Note: Mutation policy will generally follow the way how mutation webhook deals with field manager.

In this example, note that:

Object{}, Object.spec{} and similar are CEL object instantiations, and are used to create a subset of the fields of a Pod.
object refers to the state of the object before the mutation policy is applied.
oldObject refers to the state of object currently in etcd.
params refers to the param resource.

To use this MutatingAdmissionPolicy we first must create a policy binding and Sidecar parameter resource:

# Policy Binding
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingAdmissionPolicyBinding
metadata:
  name: "sidecar-binding-test.example.com"
spec:
  policyName: "sidecar-policy.example.com"
  paramRef:
   name: "meshproxy-test.example.com"
   namespace: "default"
---
# Sidecar parameter resource
apiVersion: mutations.example.com
kind: Sidecar
metadata: 
  name: meshproxy-test.example.com
spec:
  name: mesh-proxy
  image: mesh/proxy:v1.0.0
  args: ["proxy", "sidecar"]
  restartPolicy: Always

Next, we can test the policy with a pod:

kind: Pod
spec:
  initContainers:
  - name: myapp-initializer
    image: example/initializer:v1.0.0
  containers:
  - name: myapp
    image: example/myapp:v1.0.0

Since the API request to create the pod matches the MutatingAdmissionPolicy the pod will be mutated, resulting in:

kind: Pod
spec:
  initContainers:
  - name: mesh-proxy
    image: mesh/proxy:v1.0.0
    args: ["proxy", "sidecar"]
    restartPolicy: Always
  - name: myapp-initializer
    image: example/initializer:v1.0.0
  containers:
  - name: myapp
    image: example/myapp:v1.0.0

Old object state

Sometimes the current state of the object will be needed. This is available via the object variable. For example, to update all containers in a pod to use the “Always” imagePullPolicy:

Object{
  spec: Object.spec{
    containers: object.spec.containers.map(c,
      Object.spec.containers.item{
          name: c.name,
          imagePullPolicy: "Always"
      }
    )
  }
}

Construct Typed Object

In the mutation expressions, CEL supports constructing the object with a named type. At the language level, the named type can be anything that the CEL library registers with the environment. For MutatingAdmissionPolicy, Object is the alias of the type that the incoming object confirms.

With the object construction syntax, field names are no longer quoted because they are no longer map keys. If any of the object construction violates the defined schema, the expression compilation will error and the user can retrieve the error from the rejection message. For list types, i.e. with the OpenAPI type of “array”, the special item field resolves the type of its items. In Alpha 1, the CEL environment and its type providers compile the constructed object, but make no effort to check if the field names and types match these of the schemas (i.e. everything is still Dyn). See Construct Type Enforcement in Alpha 2 for future plans.

Bindings

Bindings will be almost the same as ValidatingAdmissionPolicyBinding, but with the following difference:

No validationActions field (unless anyone can think of any useful way to offer a dry-run type option)

Parameterization

Similar to ValidatingAdmissionPolicy, the mutation admission policy can refer to a param, and the param object can be specified per-namespace. We expect to fully reuse existing params handling logic from ValidatingAdmissionPolicy.

Reinvocation

The existing reinvocation policy established between webhooks and admission controllers will be extended to also handle admission policies. Ref: the current re-invocation mechanism for webhook . Admission policies will be reinvoked after admission controllers and before webhooks.

With mutating admission policies added, the the mutating admission plugin order will become:

Mutating admission controllers(e.g. DefaultIngressClass, DefaultStorageClass, etc)
Mutating admission policies (introduced within this enhancement and the order will be discussed below)
Mutating admission webhooks (ordered lexicographically by webhook name)

To allow mutating admission plugins to observe changes made by other plugins, built-in mutating admission plugins are re-run if a mutating webhook modifies an object, same will apply with mutating policy. The mutating policies are rerun if a mutating webhook or mutation policy modifies an object.

For the running order within mutating admission policies, there are a couple options proposed:

option 1(suggested by @deads2k): ordered randomly but keep the same random order while reinvocation.
- Pros:
  - Encourage user to write order-independent mutations
- Cons:
  - The final state of request is not deterministic
  - The mutation should not have dependencies in between
option 2: the lexicographical ordering of the resource names
- Pros:
  - Align with the behavior with mutating webhook
- Cons:
  - User has to be mindful on the order if there is dependency existing
  - User has a hacky way to enforce the order

Considering it would be easier to go with random order and then switch to a particular order,

Notes: If the mutations run in random order, a concern would be if people didn’t write idempotence mutations, the result might be different between two admission request. Please refer to Safety section for ways to check idempotence.

Metrics

Goals:

Parity with validating admission policy metrics
- Should include counter of deny, success violations
- Label by {policy, policy binding, mutation expression} identifiers
Counters for number of policy definitions and policy bindings in cluster
- Label by state (active vs. error), enforcement action (deny, warn)
Counters for Variable Composition
- Should include a counter of variable resolutions to measure time saved.
- Label by policy identifier

Phase 2

All these capabilities are required and should be discussed thoughtfully before Beta, but will not be implemented in the first alpha release of this enhancement due to the size and complexity of this enhancement.

Construct Type Enforcement

The type alias Object and its descendants, in Phase 2, are now real types that derive from the resolved OpenAPI schemas. If any type violations happen in the constructed object, the CEL checker will raise the errors before the expressions evaluate.

For bigger schemas, the construction of CEL types can be expensive. It is recorded to take ~100ms to resolve and parse apps/v1.Deployment. Optimizations like caching or lazy sub-schema resolution can be candidates of beta/GA graduation criteria.

Unsetting values

Since there is no field manager used for the merge, the server side apply merge algorithm will only add and replace values. To unset values, JSON Patch mutations must be used.

Safety

To ensure mutations are not “broken” by other mutations (overwritten, undone, or otherwise invalidated) and ensure the deterministic final state due to the random running order, we provide an option to check if rerun certain mutation policy leads to object change. It also helps to ensure the mutation is written in a idempotent way in consideration of the random running ordering.

These validation checks will be declared using a mutationValidationPolicy field, which is an enum of the following values:

Fail - Replaying the mutation on the mutated object should result in an identical object, if not, fail the request.
Warn - Replaying the mutation on the mutated object should result in an identical object, if not, pass the request with a warning message.
Skip - Don’t replay the mutation.

For example:

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingAdmissionPolicy
metadata:
  name: "sidecar-policy.example.com"
spec:
  # ...
  mutations:
    expression: >
        Object{
          spec: Object.spec{
            initContainers: [
              Object.spec.initContainers{
                name: params.name,
                image: params.image,
                args: params.args,
                restartPolicy: params.restartPolicy
                // ... other container fields injected here ...
              }
            ] + object.spec.initContainers
          }
        }
  mutationValidationPolicy: Fail

If rerun the mutation policy caused object change, the request should be failed

To validate an object after all mutations are guaranteed complete, we highly recommend to use a validating admission policy to validate the final state of object.

CEL Library Change

We expect this feature to require minimal changes to the core or the Kubernetes-specific CEL library. However, this feature uses the optional library in a way that the library was not designed to. We acknowledge the risk where not all current or future features of the optional library will be available.

We will be evaluating the existing CEL library to see if any specific func should be added for mutation use case. A potential candidate would be the hashing function which might be helpful in the recent discussion of controller sharding. Ref: https://github.com/timebertt/kubernetes-controller-sharding/blob/main/docs/design.md#the-clusterring-resource-and-sharder-webhook (the kubernetes-controller-sharding project could also eliminate their webhook if we supported this).

In consideration of written expression for deep nested list/map, library which could help with flatten the list or accumulation alike functions might be useful to add.

The suggested best practice of using MutatingAdmissionPolicy would be having ValidatingAdmissionPolicy also set to validate if the request matches the desired result. However, having both MutatingAdmissionPolicy and ValidatingAdmissionPolicy with parameterization would result in 6 new resources(policy, binding and params for both Mutating and validating). A possible path would be allowing one binding to bind both MutatingAdmissionPolicy and ValidatingAdmissionPolicy which could be further discussed before going to Beta.

Type Handling

The type system works differently than ValidatingAdmissionPolicy in the following aspects.

Schema Enforcement of Structural Merge Diff

The resulting object will be converted into typed objects as understood by SMD, with existing SMD schema validations still effective. Should SMD return an error during the conversion, the error handling follows the failure policy. Note that it is possible to pass CEL expression compilation but still fail the schema validation if Variable Composition is used, see Variable Composition below. 2. Type Checking Before Runtime

Similar to ValidatingAdmissionPolicy, a controller, running in kube-controller-manager, compiles the expressions against the types defined in matchConstraints. The number of types to check are also heavily limited to prevent the checks taking up too much computing time from the KCM.

Composition variables

To control the size of mutation expression, and to better reuse parts of the expressions, sub-expressions can be extracted into a separate variables section, similar to variables of ValidatingAdmissionPolicy.

variables:
  - name: targetContainers
    expression: >-
      object.spec.template.containers.filter(c,
      c.image.contains("example.com"))
  - name: transformedContainers
    expression: >
      variables.targetContainers.map(c, {"name": c.name, "env": {"name": "FOO",
      "value": "foo"}})
mutations:
  - patchType: "ApplyConfiguration"
    expression: |
      Object{
          spec: Object.spec{
              template: Object.spec.template{
                  containers: variables.transformedContainers
              }
          }
      }

With variable composition, it is possible to escape from compile-time type checking. For example

variables:
  - name: definitelyNotAContainer # resulting type is Dyn
    expression: >-
      params.foo == "bar" ? true : "true"
mutations:
  - patchType: "ApplyConfiguration"
    expression: |
        Object{
          spec: Object.spec{
              template: Object.spec.template{
                  containers: [variables.definitelyNotAContainer] # will pass, but error at runtime.
              }
          }
      }

Risk

Ensure the final state match expectation. There might be multiple mutating admission policies, mutating webhooks, other controllers trying to mutate the incoming request and each happens separately, and they might mutate the same part of the object. It might be hard to ensure that the final state matches expectations. For best practice, the validation process is highly recommended whenever there is a mutation process set up. The validating admission policy is recommended to be set up whenever a mutation admission policy is set to verify the final state of the data matches the expectation. Also refer to the Safety section for further details.
Failures in MutatingAdmissionPolicy will fail request in admission chain. If the failure policy is set to fail and the mutation admission policy matches all resources, the failure/error in MAP might infect the control plane availability.

User Stories

Use case: Set a label

Object.spec{
  Object.metadata{
    labels:
      "label-to-set": "label-value"
  }
}

Use case: AlwaysPullImages

Force all image pull policy under containers to Always

Object{
    spec: Object.spec{
        containers: object.spec.containers.map(c,
            Object.spec.containers.item{
                name: c.name,
                imagePullPolicy: "Always"
            }
        )
        // ... same for initContainers and ephemeralContainers ...
    }
}

Use case: DefaultIngressClass

While creation of Ingress objects that do not request any specific ingress class, adds a default ingress class to them.

matchConditions:
  - name: 'need-default-ingress-class'
    expression: '!has(object.spec.ingressClassName)'
mutations:
  - patchType: "ApplyConfiguration"
    expression: |
      Object{
        spec: Object.spec{
          ingressClassName: "defaultIngressClass"
        }
      }

Use case: DefaultStorageClass

While creation of PersistentVolumeClaim objects that do not request any specific storage class, adds a default storage class to them.

matchConditions:
  - name: 'need-default-storage-class'
    expression: '!has(object.spec.storageClassName)'
mutations:
  - patchType: "ApplyConfiguration"
    expression: |
      Object{
        spec: Object.spec{
          storageClassName: "defaultStorageClass"
        }
      }

Use case: DefaultTolerationSeconds

Sets the default forgiveness toleration for pods to tolerate the taints notready:NoExecute.

Should be supported through JSONPatch.

Use case: if-conditional based on value contained in nested map-list

If the volumemount specified in containers does not have a volume associated, add a volume.

variables:
  - name: volumeMountsList
    expression: "object.spec.containers.map(c, c.volumeMounts.map(v, v.name))"
  - name: volumesList
    expression: "object.spec.volumes.map(v, v.name)"
mutations:
  - patchType: "ApplyConfiguration"
    expression: |
      Object{
        spec: Object.spec{
          volumes: volumeMountsList.filter(n, !(n in volumesList)).map(v, {
              name: v,
              configMap: params.addFields
          })
        }
      }

It could be simplified with composition variables. I have a gist example here.

Use case: LimitRanger

Apply default resource requests to Pods that don’t specify any.

mutations:
  - patchType: "ApplyConfiguration"
    expression: |
      Object{
        spec: Object.spec{
          containers: object.spec.containers.filter(c, !has(c.resources)).map(c, 
            {
                name: c.name,
                resources: {#default resources settings}
            }
        }
      }

Use case: priority class

Add a default priority class if it is not set in pod

matchConditions:
  - name: 'no-priority-class'
    expression: '!has(object.spec.priorityClassName)'
mutations:
  - patchType: "ApplyConfiguration"
    expression: |
      Object{
        spec: Object.spec{
          priorityClassName: params.defaultPriorityClass
        }
      }

Use case: Sidecar injection

Object{
  spec: Object.spec{
    initContainers: [
      Object.spec.initContainers{
        name: params.name,
        image: params.image,
        args: params.args,
        restartPolicy: params.restartPolicy
        // ... other container fields injected here ...
      }
    ] + object.spec.initContainers
  }
}

Use case: Remove an annotation

JSONPatch{
    op: "remove",
    path: "/metadata/annotations/annotation-to-unset"
}

Use case: If an annotation is set, set a field instead

Object{
  metadata: Object.metadata{
    annotations:
      ?"some-annotation": optional.none()
  }
  spec: Object.spec{
    someField: object.annotations["some-annotation"]
  }
}

Use case: modify deprecated field under CRD versions

Support atomic list modification through JSON Patch

Use Case - mutation VS controller fight

https://github.com/open-policy-agent/gatekeeper/issues/2963#issuecomment-1683971371 Out of scope. The proposed feature is going to be added as an admission plugin which has no control over other controllers potentially being added.

Use Case - limitation

The current design will not support the following use cases

Involves creation of additional resources
Reference additional resources which is not fixed
It is tricky to write expression in deeply nested list/map with conditional check

For 1, additional resources creation is not supported since the current design focuses on updating the incoming request.

For 2, parameter resources could potentially be used for any operation requires additional resource involved. However, if the additional resource involved is based on querying in incoming request, it will not be supported.

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Risk: Enabling CEL object instantiation enables users to allocate memory in a more directly way than previously available.

Mitigation/Justification: List and map literals were already possible, so this doesn’t fundamentally change the memory allocation situation. We could further mitigate by statically estimating the “memory cost” of CEL expressions that include any form of data literal (list, map, object or scalar).

Risk: Final state of the object might not match the output of mutation policies. There might be multiple mutating admission policies, mutating webhooks, other controllers trying to mutate the incoming request and each happens separately, and they might mutate the same part of the object. It might be hard to ensure that the final state matches expectations.

Mitigation/Justification: For best practice, the validation process is highly recommended whenever there is a mutation process set up. The validating admission policy is recommended to be set up whenever a mutation admission policy is set to verify the final state of the data matches the expectation. Also refer to the Safety section for further details.

Design Details

Object type names

As part of this enhancement, we are enabling CEL object instantiation, which we have left disabled in previous CEL features.

When enabling CEL object instantiation we need to decide:

How object type names will represented in CEL. This KEP shows a “Object.spec.container” naming system. Is this what we will use? Or will be use actual schema type names, e.g. “v1.Pod.spec.container”?
Will validating admission features also gain the ability to instantiate CEL types? This has memory consumption implications.

SSA Merge algorithm reuse

Reusing Server Side Apply merge algorithms is complicated by presence of numerous different representations of schema types in Kubernetes (Structural schemas, multiple OpenAPI schema representations, SMD schemas, …). For an initial alpha, we may simply perform the needed conversions, but longer term memory consumption and runtime performance may demand that we minimize the conversions needed.

Test Plan

[ x ] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

<package>: <date> - <test coverage>

Integration tests

In the alpha phase, the integration tests are expected to be added for:

The behavior with feature gate and API turned on/off and mix match
The happy path with everything configured and mutation proceeded successfully
Mutation with different failure policies
Mutation with different Match Criteria
Mutation violations for different reasons including type checking failures, misconfiguration, failed mutation, etc and formatted messages
:

e2e tests

We will test the edge cases mostly in integration test and unit test. We may add e2e test for spot check of the feature presence.

Graduation Criteria

Alpha

Feature implemented behind a feature flag
Support both JSONType and ApplyConfiguration for patchType
Composition variable support is needed before going to beta
Initial e2e tests completed and enabled

Beta

Have proper monitoring for MAP admission plugin
Fix any blocking issues/bugs surfaced before code freeze
Additional tests are in Testgrid and linked in KEP
More rigorous forms of testing—e.g., downgrade tests and scalability tests
Including all function needed with performance and security in consideration

GA

N examples of real-world usage
N installs
Allowing time for feedback

Note: Generally we also wait at least two releases between beta and GA/stable, because there’s no opportunity for user feedback, or even bug reports, in back-to-back releases.

For non-optional features moving to GA, the graduation criteria must include conformance tests .

Deprecation

Announce deprecation and support policy of the existing flag
Two versions passed since introducing the functionality that deprecates the flag (to address version skew)
Address feedback on usage/changed behavior, provided on GitHub issues
Deprecate the flag

Upgrade / Downgrade Strategy

No changes are required for a cluster to make an upgrade and maintain existing behavior. There is new API that does not effect the cluster during upgrade. It only has effects if it is used after the upgrade.

If a cluster is downgraded, no changes are required. The cluster continues to work as expected since the alpha version will have functionality compatible with beta and stable release, any downgrade will be to a version that also contains the feature.

Version Skew Strategy

This feature is implemented in the kube-apiserver component, skew with other kubernetes components do not require coordinated behavior.

Clients should ensure the kube-apiserver is fully rolled out before using the feature.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

[ x ] Feature gate (also fill in values in kep.yaml)
- Feature gate name: MutatingAdmissionPolicy
- Components depending on the feature gate: kube-apiserver

Does enabling the feature change any default behavior?

No, default behavior is the same.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, disabling the feature will result in mutation expressions being ignored.

What happens if we reenable the feature if it was previously rolled back?

The MutatingAdmissionPolicy will be enforced again.

Are there any tests for feature enablement/disablement?

Unit test and integration test will be introduced in alpha implementation.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

The existing workload could potentially be mutated and cause unexpected data stored if the mutation is misconfigured.

While rollout, the cluster administrator could use mutationValidationPolicy to reduce the risk of unexpected mutation. The failurePolicy could be configured to decide if a failure should reject the admission request. In this way it will minimize the effect on the running workloads.

What specific metrics should inform a rollback?

On a cluster that has not yet opted into MutatingAdmissionPolicy, non-zero counts for either of the following metrics mean the feature is not working as expected:

cel_admission_mutation_total
cel_admission_mutation_errors

On a cluster that opt into MutatingAdmissionPolicy, consider rollout if observed elevated API server errors or excessive apiserver_cel_evaluation_duration_seconds / apiserver_cel_compilation_duration_seconds.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Upgrade and rollback will be tested manually in a kind:

Enabled feature gate, created a MutatingAdmissionPolicy and MutatingAdmissionPolicyBinding with mutation to add a label to a pod.
Disabled feature gate, restarted apiserver, confirmed that the MutatingAdmissionPolicy and MutatingAdmissionPolicyBinding still exist. Added another Pod to verify that the mutation would not happen.
Re-enabled the feature gate, restarted apiserver, confirmed that the mutation will occur for new incoming pod creation request.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

The following metrics could be used to see if the feature is in use:

mutating_admission_policy/check_total
mutating_admission_policy/definition_total

How can someone using this feature know that it is working for their instance?

Metrics like mutating_admission_policy/check_total can be used to check how many mutations applied in total

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

No impact on latency for admission request when MutatingAdmissionPolicy are absent. Performance when MutatingAdmissionPolicy are in use will need to be measured and optimized before GA.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name: The Metrics below could be used: mutating_admission_policy/check_total mutating_admission_policy/definition_total mutating_admission_policy/check_duration_seconds

Are there any missing metrics that would be useful to have to improve observability of this feature?

No.

Dependencies

No.

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

Yes. A new API group is introduced which will be used for this feature.

Will enabling / using this feature result in introducing new API types?

Yes. We introduced two new kinds for this feature: MutatingAdmissionPolicy and MutatingAdmissionPolicyBinding as described in this doc.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

The existing admission request latency might be affected when the feature is used. We expect this to be negligible and will measure it before GA.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

We don’t expect it to. Especially comparing to the existing method to achieve the same goal, using this feature will not result in non-negligible increase of resource usage.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

No change from existing behavior. The feature will serve same as if it’s disabled.

What are other known failure modes?

Same as without this feature.

What are other known failure modes?

N/A

What steps should be taken if SLOs are not being met to determine the problem?

The feature can be disabled by disabling the API or setting the feature-gate to false if the performance impact of it is not tolerable. Try to run the validations separately to see which rule is slow Remove the problematic rules or update the rules to meet the requirement

Implementation History

v1.32: Alpha
v1.34: Beta
v1.36: Stable

Drawbacks

Alternatives

Here are the alternative considerations compared to using JSON Patch and the apply configurations introduced by Server Side Apply.

Alternative 2: Introduce new syntax

Another alternative consideration would be rewriting your own merge algorithm which is a lot of duplicated effort.

Pros:
- More flexibility on how merging works
- Support most of the existing use cases
Cons:
- Duplicated effort
- Introducing a new language model into k8s which increase the maintenance effort

KEP-3962: Mutating Admission Policies

KEP-3962: Mutating Admission Policies

Release Signoff Checklist

Summary

Motivation

Goals

Non-Goals

Proposal

Summary

Phase 1

API Shape

Old object state

Construct Typed Object

Bindings

Parameterization

Reinvocation

Metrics

Phase 2

Construct Type Enforcement

Unsetting values

Safety

CEL Library Change

Share Bindings

Type Handling

Composition variables

Risk

User Stories

Use case: Set a label

Use case: AlwaysPullImages

Use case: DefaultIngressClass

Use case: DefaultStorageClass

Use case: DefaultTolerationSeconds

Use case: if-conditional based on value contained in nested map-list

Use case: LimitRanger

Use case: priority class

Use case: Sidecar injection

Use case: Remove an annotation

Use case: If an annotation is set, set a field instead

Use case: modify deprecated field under CRD versions

Use Case - mutation VS controller fight

Use Case - limitation

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Design Details

Object type names

SSA Merge algorithm reuse

Test Plan

Prerequisite testing updates

Unit tests

Integration tests

e2e tests

Graduation Criteria

Alpha

Beta

GA

Deprecation

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?