KEP-4292: Custom profiling support in kubectl debug command
KEP-4292: Custom profiling support in kubectl debug command
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
This proposal adds a new custom profiling feature on top of predefined profiles in kubectl debug command.
Motivation
kubectl debug command provides a set of predefined profiles and users can pick the appropriate ones
according to their needs and roles. However, in some cases (maybe even in most cases), users might want
to customize these profiles by adding:
- environment variables https://github.com/kubernetes/kubectl/issues/1486
- replicate volume mounts https://github.com/kubernetes/kubectl/issues/1071
- security contexts https://github.com/kubernetes/kubernetes/pull/113009
- labels https://github.com/kubernetes/kubectl/issues/1364 , https://github.com/kubernetes/kubernetes/issues/115679
- image pull secrets https://github.com/kubernetes/kubectl/issues/1506
and many others. Users can overcome these problems by manually patching pod specs (unless the required intervention prevents the pod to become available) but this is impractical, especially if this should be done frequently and that’s why, users first reaction is opening an issue about this or pull request proposing a new flag to manage only a particular fields in pod specs. Adding a new flag for every field puts a debug command at risk of unmanageable and unmaintainable state from not only maintainers point of view but also users.
Due to all these reasons, this proposal adds a custom profiling support on top of predefined profiles debug command. Custom profiling mitigates the new flag request pressure.
Goals
- Make kubectl debug pod/node or ephemeral container spec configurable.
Non-Goals
- Change the functionality of how
kubectl debugworks
Proposal
There will be a new flag, namely custom, in kubectl debug which is used to pass
a json file that includes the fields that are compatible with partial container spec (e.g.
{
"ports": [
{
"containerPort": 80
}
],
"resources": {
"limits": {
"cpu": "0.5",
"memory": "512Mi"
},
"requests": {
"cpu": "0.2",
"memory": "256Mi"
}
},
"env": [
{
"name": "ENV_VAR1",
"value": "value1"
},
{
"name": "ENV_VAR2",
"value": "value2"
}
]
}
)
It is expected that this file passed to custom flag is decodable to corev1.Container
(please note that this doesn’t have to be a complete corev1.Container spec, but all the
fields should be mapped to the fields in corev1.Container), otherwise kubectl debug returns an error
mentioning that only corev1.Container compatible json files are accepted.
User can still continue using current profiles (general, restricted, baseline, etc.) and when custom profile is passed, custom profile json is patched onto the latest version of container spec generated by predefined profiles. As a result, custom profiles always suppress the properties inside predefined profiles in cases of conflicts. Because for example, user may pass security context via custom profiles and netadmin profile has its own security context properties and custom profile should override it.
To achieve this patching (and overriding) mechanism, custom profiling uses StrategicMergePatch that has already been used in code base
and proves that it covers such cases. This is an example of code portion demonstrates that how SMP will be used;
patchedContainer, err := strategicpatch.StrategicMergePatch(debugContainerJS, customJS, corev1.Container{})
if err != nil {
return fmt.Errorf("error creating three way patch to add debug container: %v", err)
}
This feature focuses only on container spec customization and has no attempt to change the other fields in pod spec. The reason of this decision is that there are three types of debugging methods(copy to node, copy to pod and ephemeral container) and their largest intersection type is corev1.Container. Copy to node and Copy to pod both cover corev1.PodSpec type as opposed to ephemeral container which only manifests itself as corev1.Container. Therefore, pod spec related changes can be managed by flags which presumably should be only a few (annotations, labels, etc.).
Besides that Container spec has several fields and to prevent possible confusions and start more restrictive, custom profile does not allow some fields to be used to overwrite, such as Name, Command, Image, Lifecycle and VolumeDevices ( first 3 fields have their own flags already). We can extend disallowed fields as a starting point and consider enabling more fields per request separately in the future.
User Stories (Optional)
Story 1
As a cluster administrator, I’d like to debug a pod that requires an environment variable to work regardless of the profile I choose.
Story 2
As a network administrator, I’d like to debug a node that requires to mount a specific volume using netadmin profile.
Story 3
As a restricted user, I’d be able to debug a pod to simulate the exact environment of the problematic pod which has resource requests and limits.
Notes/Constraints/Caveats (Optional)
- When user mounts a persistent volume claim to the debug pod, this must not affect the actual pod’s functionality. If such a feature or property exists in storage, kubectl debug should handle this and prevent mounting.
- kubectl debug pod –copy-to already copies the fields in original pod spec to the copied pod spec which seems justifying the custom profiling. But in cases where user may want to make changes on copied pod which is a different value than the original, this custom profiling would also work for that purpose.
Risks and Mitigations
- Unauthorized users may test the privileges by trying to mount a volume or use a port which is different from the application’s, etc. But that shouldn’t be a problem because API server should reject the request. If user has no permission to mount a PV, then this will be rejected.
Design Details
This is the prospective code snippet of a function that applies the custom profile.
func (o *DebugOptions) applyCustomProfile(debugPod *corev1.Pod, containerName string, ephemeral bool) error {
// Find container in pod and encode it as json
patchedContainer, err := strategicpatch.StrategicMergePatch(debugContainerJS, customJS, corev1.Container{})
if err != nil {
return fmt.Errorf("error creating three way patch to add debug container: %v", err)
}
err = json.Unmarshal(patchedContainer, &debugPod.Spec.Containers[index])
if err != nil {
return fmt.Errorf("unable to unmarshall patched container to ephemeral container: %v", err)
}
...
}
After predefined profiles are applied, there will be a generated podSpec that is pending for creation.
Before sending a request to API server to create this pod or ephemeral container, applyCustomProfile function will be called
with the required parameters. This function modifies podSpec internally and this podSpec will be created/updated.
Test Plan
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
Unit tests
k8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/debug:13-10-2023-62.7%
New unit test cases will be added for the following scenarios:
When custom profile is set, custom profile does not modify container name, command, image
During the ephemeral container debugging, custom profile is set and
--profileisnetadminand the pod output is expectedDuring the ephemeral container debugging, Custom profile is set and
--profileisgeneraland the output is expectedDuring the copying pod, custom profile is set and
--profileisgeneraland the output is expectedDuring the node debugging, custom profile is set and
--profileisgeneraland the output is expectedk8s.io/kubernetes/vendor/k8s.io/kubectl/pkg/cmd/debug:30-09-2024-67.3%
Integration tests
- Prepare a custom profile json including the fields in predefined profiles with different value and assure that the custom profile’s value supersedes the value in predefined profile.
- Prepare a custom profile json including the fields in pod spec with different value and assure that the custom profile’s value supersedes the value in original pod spec.
- Prepare a custom profile json including a new field not existed in pod spec or predefined profiles and assure that this value is in the pod spec.
- Send invalid custom profile json(not in corev1.Container type or completely invalid json) and assure that the error message is correct.
integration tests (defined in https://k8s.io/kubernetes/test/cmd/debug.sh#L571-L661 ) are running in https://storage.googleapis.com/k8s-triage/index.html?pr=1&job=pull-kubernetes-integration
e2e tests
- Prepare a custom profile and after applying it to live cluster, assure that pod spec’s values are expected and debugging works properly.
Graduation Criteria
Alpha
- Feature hidden behind an environment variable in kubectl.
- Unit tests are implemented and enabled.
Beta
- Gather feedback from developers and surveys.
- Environment variable is enabled by default and feature can be disabled explicitly.
- Integration tests are implemented and enabled.
- YAML formatted custom profile support is added
GA
- Feature gate (i.e.
KUBECTL_DEBUG_CUSTOM_PROFILE) is locked to true and will be removed in 1.34. - e2e tests are implemented and enabled.
Upgrade / Downgrade Strategy
NA
Version Skew Strategy
Copying pod and node debugging use built-in API endpoints. Ephemeral container functionality was
promoted to stable in 1.25. Besides custom profiling feature only relies on
corev1.Container type’s existence differently from kubectl debug which is also built-in.
It is zero probability that this feature touches unavailable API endpoints in regard to
version skew strategy.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
This feature in alpha phase will be hidden behind KUBECTL_DEBUG_CUSTOM_PROFILE
environment variable and new flag can only be seen when this flag is exported
export KUBECTL_DEBUG_CUSTOM_PROFILE=true. User can easily disable this feature
via unsetting this environment variable.
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name:
- Components depending on the feature gate:
- Other
- Describe the mechanism: export KUBECTL_DEBUG_CUSTOM_PROFILE=true
- Will enabling / disabling the feature require downtime of the control plane? No
- Will enabling / disabling the feature require downtime or reprovisioning of a node? No
Does enabling the feature change any default behavior?
Enabling KUBECTL_DEBUG_CUSTOM_PROFILE environment variable will enable a new custom flag
in kubectl debug.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
Yes, the way of enabling this feature is exporting KUBECTL_DEBUG_CUSTOM_PROFILE=true.
The way of disabling is simply unsetting this environment.
What happens if we reenable the feature if it was previously rolled back?
Flag becomes visible and no risk for the cluster.
Are there any tests for feature enablement/disablement?
Enablement and disablement of this feature is managed by KUBECTL_DEBUG_CUSTOM_PROFILE environment
variable. When user sets this environment variable, new custom flag becomes visible. We can
add basic enablement/disablement test to check that flag is visible or not.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
User may pass a configmap(or any other that is used by other components) in custom profiling and during debugging, user may modify (intentionally or unintentionally) these resources in an unexpected way for the workloads. But this scenario is not different from any simple pod creation. Since there is no scenario impacting the workloads specific to this feature. My answer is no to this question.
What specific metrics should inform a rollback?
NA
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
NA
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
NA
Monitoring Requirements
NA
How can an operator determine if the feature is in use by workloads?
Users can determine by checking the value of KUBECTL_DEBUG_CUSTOM_PROFILE environment variable.
However, from the cluster admin point of view, all the APIs that this feature use have already been GA-ed,
so that it is hard to distinguish whether users enable this feature on their locals after exporting
the feature environment variable.
How can someone using this feature know that it is working for their instance?
- Events
- Event Reason:
- API .status
- Condition name:
- Other field:
- Other (treat as last resort)
- Details: Checking the value of
KUBECTL_DEBUG_CUSTOM_PROFILEenvironment variable, or they can run a test debug and see if their profile is respected in the resulting container.
- Details: Checking the value of
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
NA
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- Other (treat as last resort)
- Details: Not applicable
Are there any missing metrics that would be useful to have to improve observability of this feature?
No
Dependencies
Does this feature depend on any specific services running in the cluster?
No
Scalability
Will enabling / using this feature result in any new API calls?
No, SMP patching only happens on client side and there is no additional request to API server.
Will enabling / using this feature result in introducing new API types?
No
Will enabling / using this feature result in any new calls to the cloud provider?
No
Will enabling / using this feature result in increasing size or count of the existing API objects?
This may slightly increase the size of copied debugging pod according to the users custom profile spec or debugging is performed via ephemeral container, this will slightly increase the pod size.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
No
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
Custom profiling will happen on client side and since there is no change in kubectl debug functionality,
debugging will fail because pod will not be created.
What are other known failure modes?
- invalid debug pod after invalid custom profiling
- Detection: Debug pod is not in running state or not having required privileges to debug smoothly
- Mitigations: Pod can be deleted and re-run after modifying the custom profile
- Diagnostics: Pod can’t be attached or in not running state
- Testing: There are several unit tests are implemented in https://github.com/kubernetes/kubectl/blob/master/pkg/cmd/debug/debug_test.go
What steps should be taken if SLOs are not being met to determine the problem?
Not applicable
Implementation History
- 2023-10-13: Kep is proposed as alpha feature
- 2024-06-04: Kep is promoted to beta
- 2024-09-05: Kep is promoted to stable
- 2025-01-20: Kep is marked as implemented
Drawbacks
Alternatives
Flags for all fields
The alternative solution would be providing a flag in kubectl debug for each field. For example,
if user wants to mount a volume, there will be --mount-volume flag and user explicitly
specifies the volume. However, this has a major drawback that it results in kubectl debug command
falling into unmanageable category with a numerous flags and users don’t understand which one should be
used and for all new fields in pod spec, there is pressure to create a new flag in kubectl debug to support.
Customizable Pod Spec instead corev1.Container
The alternative solution would be instead of custom profile json accepts corev1.Container type, it will get PodSpec template and this covers the need of customizing all fields in Pod Spec including the labels and annotations. But this isn’t applicable for ephemeral containers because they are residing in the original pod rather than copied pod.