KEP-2799: Reduction of Secret-based Service Account Tokens
KEP-2799: Reduction of Secret-based Service Account Tokens
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- (R) Graduation criteria is in place
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
This KEP proposes actions to reduce the surface area of secret-based service account tokens.
Motivation
As BoundServiceAccountTokenVolume is GA in 1.22, pods’ service account tokens would be obtained via TokenRequest API and stored as projected volume. This change obviates the need for auto-generation of secret-based service account tokens which are less secure than the bound token .
Goals
- No auto-generation of secret-based service account token.
- Removal of unused auto-generated secret-based service account tokens
Non-Goals
Proposal
- Change the service account control loop in Token Controller to not auto-create secret for service accounts. At the same time, warn usage of auto-created secret-based service account tokens and encourage users to use TokenRequest API or manually-created secret-based service account tokens.
- Purge unused auto-generated secret-based service account tokens.
User Stories (Optional)
Notes/Constraints/Caveats
- A warning mechanism should be implemented to help users migrate.
- Auto generated secret-based service account tokens are those requested by Token Controller.
- Only clean up auto-generated tokens which:
- are not referenced by pods
- have not been used to authenticate for some duration (time duration or number of releases)
- To consult active usage of secret-based tokens, metric
serviceaccount_legacy_tokens_totalor audit annotationauthentication.k8s.io/legacy-tokencould be used.
Risks and Mitigations
- When feature LegacyServiceAccountTokenNoAutoGeneration is Beta, consumers
depending directly on waiting for and reading tokens out of auto-generated
secrets might stop working. To mitigate,
- Emit warnings when using auto-generated token secrets.
- Publish pointers to TokenRequest or the manual secret request flow.
- When LegacyServiceAccountTokenCleanUp is Beta, usage of auto-generated
secret-based token might stop working. To mitigate,
- When Alpha, annouce the cleanup starts at Beta
- Emit warnings when using auto-generated token secrets.
- Add pointers of TokenRequest API and manually created tokens in the validation result.
- Marked the auto-generated tokens as invalid if they are not used for more
than the duration configured by
--legacy-service-account-token-clean-up-period(one year by default). And allow the users to re-activate the invalid auto-generated tokens within the duration of--legacy-service-account-token-clean-up-periodbefore the tokens are finally deleted.
Design Details
LegacyServiceAccountTokenNoAutoGeneration:
Token Controller stops auto-creating secret for service accounts. This feature would be enabled when it is implemented since no new code is added and this can make sure new clusters are in good state.
LegacyServiceAccountTokenTracking
To facilitate LegacyServiceAccountTokenCleanUp, we implement a simple controller
in kube-apiserver that maintains a bool value configmap kube-apiserver-legacy-service-account-token-tracking in kube-system to
indicates if tracking is enabled in the cluster. It is similar to the existing
ClusterAuthenticationTrustController that maintains configmap/extension-apiserver-authentication.
When LegacyServiceAccountTokenTracking is enabled in all apiservers,
- the controller creates/updates the configmap
kube-apiserver-legacy-service-account-token-trackinginkube-systemnamespace that stores the current date assince. - when a legacy token is used, issue a warning, update the label
kubernetes.io/legacy-token-last-usedon the secret at date granularity, and record in a metric.
- the controller creates/updates the configmap
When LegacyServiceAccountTokenTracking is disabled in any apiserver,
- the controller ensures the configmap in
kube-systemnamespace is deleted in a periodic way.
- the controller ensures the configmap in
LegacyServiceAccountTokenCleanUp
Token Controller starts to remove unused auto-generated secrets (secrets bi-directionally referenced by the service account) and not mounted by pods.
When this feature is Beta and enabled by default, mark the secrets as invalid iff it is over a sufficient period of time (one year by default) since last used. The period can be configured by cluster admins.
Determine the date that a given secret was last used:
kubernetes.io/legacy-token-last-usedif exists and aftersincestored in the configmapkube-apiserver-legacy-service-account-token-tracking.- defaults to
since
If kube-apiserver-legacy-service-account-token-tracking is unavailable, no secret would be removed.
Mark the secrets as invalid and recover:
- The secrets will be added a label
kubernetes.io/legacy-token-invalid-since, with the date as value. - If the users use the invalid tokens, in the Validate() function of
“kubernetes/pkg/serviceaccount/legacy.go”, it will detect the usage of
invalid tokens and return the error information, telling the users to
re-activate the token by updating the label value or use the tokenrequest. At
the same time, the tokens will be updated with the new
kubernetes.io/legacy-token-last-useddate. - If the users don’t use the invalid tokens, after the duration configured
through
--legacy-service-account-token-clean-up-period(one year by default) since the tokens are marked as invalid, the tokens will be finally deleted.
Test Plan
[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
None
Unit tests
k8s.io/kubernetes/pkg/controller/serviceaccount:2022-06-13-67.5%
Integration tests
- Previously auto-generated secret-based token that’s used within the configurable cleanup duration will continue to work.
- Previously auto-generated secret-based token that’s used after the configurable cleanup duration will be deleted.
e2e tests
- Secret-based tokens would not be auto-generated.
- Still able to explicitly request a secret-based token.
- The explicitly requested token would not be deleted.
Graduation Criteria
LegacyServiceAccountTokenNoAutoGeneration
| Alpha | Beta | GA |
|---|---|---|
| - | 1.24 | 1.26 |
Since in 1.24, all pods should be admitted in 1.22+ and they should be using bound tokens. One release ahead to enable this features would help to reduce legacy tokens for security practices.
Beta -> GA Graduation
- Approved by PRR and scalability
- Any known bugs fixed
- Tests passing
Alpha -> Beta Graduation
- Approved by PRR and scalability
- Any known bugs fixed
- Tests passing
- Document and communicate the available actions that consumers of auto-generated secret-based tokens should take. (migrate to either use tokenrequest or explicitly request secret-based tokens)
LegacyServiceAccountTokenTracking
| Alpha | Beta | GA |
|---|---|---|
| 1.26 | 1.27 | 1.28 |
Beta -> GA Graduation
- In use by multiple distributions
- RedHat
- Approved by PRR and scalability
- Any known bugs fixed
- Tests passing
Alpha -> Beta Graduation
- Approved by PRR and scalability
- Any known bugs fixed
- Tests passing
LegacyServiceAccountTokenCleanUp
| Alpha | Beta | GA |
|---|---|---|
| 1.28 | 1.29 | 1.30 |
Beta -> GA Graduation
- In use by multiple distributions
- Approved by PRR and scalability
- Any known bugs fixed
- Tests passing
Alpha -> Beta Graduation
- Approved by PRR and scalability
- Any known bugs fixed
- Tests passing
Upgrade / Downgrade Strategy
The features can be enabled/disabled via the feature gates in upgrade / downgrade. What would be changed is described in “Feature Enablement and Rollback” section.
Version Skew Strategy
The only touches control plane, so version skew strategy is not applicable.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name: LegacyServiceAccountTokenNoAutoGeneration
- Components depending on the feature gate: kube-controller-manager
- Feature gate name: LegacyServiceAccountTokenTracking
- Components depending on the feature gate: kube-apiserver
- Feature gate name: LegacyServiceAccountTokenCleanUp:
- Components depending on the feature gate: kube-controller-manager
Does enabling the feature change any default behavior?
- LegacyServiceAccountTokenNoAutoGeneration: no legacy tokens are auto-generated.
- LegacyServiceAccountTokenTracking: legacy tokens would have new label and a configmap would be created in kube-system.
- LegacyServiceAccountTokenCleanUp: unused auto-generated legacy tokens will be removed.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
yes for all feature gates.
What happens if we reenable the feature if it was previously rolled back?
- LegacyServiceAccountTokenNoAutoGeneration: the same as enable the feature. before the reenablement, Token Controller would create tokens for serviceaccounts while the feature was off.
- LegacyServiceAccountTokenTracking: during this sequence of operations,
only the label
kubernetes.io/legacy-token-last-usedis persisted, but there is no impact on the functionality of this feature. - LegacyServiceAccountTokenCleanUp: the same as enable the feature.
Are there any tests for feature enablement/disablement?
yes for all feature gates, covered by integration tests.
Rollout, Upgrade and Rollback Planning
How can a rollout fail? Can it impact already running workloads?
- LegacyServiceAccountTokenNoAutoGeneration: workloads that expect new auto-created secrets and extract tokens from them would fail.
- LegacyServiceAccountTokenTracking: no impact.
- LegacyServiceAccountTokenCleanUp: workloads that reads auto-generated secrets after those secrets being considered unused by this feature and removed.
What specific metrics should inform a rollback?
serviceaccount_legacy_tokens_total: cumulative stale service account tokens
used.
this metric is only informational and cannot deterministically tell a rollback is needed. there is no good way for us to detect scrapers of auto-generated secrets.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
no since there is not much change between a upgrade and upgrade->downgrade->upgrade.
see section What happens if we reenable the feature if it was previously rolled back.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
no
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
check if there is a configmap kube-apiserver-legacy-service-account-token-tracking in namespace kube-system.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
serviceaccount_legacy_tokens_total - [Optional] Aggregation method:
- Components exposing the metric: kube-apiserver
- Metric name:
LegacyServiceAccountTokenNoAutoGeneration and LegacyServiceAccountTokenCleanUp might cause few workloads to fail but there is no way for us to inject metric in workloads to detect this.
What are the reasonable SLOs (Service Level Objectives) for the above SLIs?
none. we expect the number recorded in the above metric going down in the long term.
Are there any missing metrics that would be useful to have to improve observability of this feature?
none.
Dependencies
Does this feature depend on any specific services running in the cluster?
no.
Scalability
Will enabling / using this feature result in any new API calls?
up to one additional write request per day could be made to auto-generated secrets still in use.
Will enabling / using this feature result in introducing new API types?
no.
Will enabling / using this feature result in any new calls to the cloud provider?
no.
Will enabling / using this feature result in increasing size or count of the existing API objects?
no. instead, use of the feature reduces the number of API objects.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
no.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
no.
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
kube-apiserver-legacy-service-account-token-trackingconfigmap cannout be created.- unable to remove unused auto-generated secrets.
What are other known failure modes?
- failure to create
kube-apiserver-legacy-service-account-token-trackingconfig map- Detection: check if
kube-apiserver-legacy-service-account-token-trackingexists inkube-system - Mitigations: there is no impact on existing systems.
- Diagnostics: check kube-apiserver log.
- Testing: TBD.
- Detection: check if
What steps should be taken if SLOs are not being met to determine the problem?
n/a.