KEP-5325: Improve pod selection accuracy across workload types
KEP-5625: HPA - Improve pod selection accuracy across workload types
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
The Horizontal Pod Autoscaler (HPA) has a critical limitation in its pod selection mechanism: it collects metrics from all pods that match the target workload’s label selector. This can lead to incorrect scaling decisions when unrelated pods (such as Jobs, CronJobs, or other Deployments) happen to share the same labels.
This often results in unexpected behavior such as:
- HPAs stuck at maxReplicas despite low actual usage in the target workload
- Unnecessary scaling events triggered by temporary workloads
- Unpredictable scaling behavior that’s difficult to diagnose
This proposal adds a parameter to HPAs which ensures the HPA only considers pods that are actually owned by the target workload, through owner references, rather than all pods matching the label selector.
Motivation
Consider this example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-app
spec:
replicas: 1
selector:
matchLabels:
app: test-app
template:
metadata:
labels:
app: test-app
spec:
containers:
- name: nginx
image: nginx
resources:
requests:
cpu: 100m
---
apiVersion: batch/v1
kind: Job
metadata:
name: test-job
spec:
template:
metadata:
labels:
app: test-app # Same label as deployment
workload: scraper
spec:
containers:
- name: cpu-load
image: busybox
command: ["dd", "if=/dev/zero", "of=/dev/null"]
resources:
requests:
cpu: 100m
restartPolicy: Never
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: test-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: test-app
minReplicas: 1
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
In this case, the HPA will factor in CPU consumption from the Job’s pod, despite it not being part of the Deployment, potentially causing incorrect scaling decisions.
Goals
- Improve the accuracy of HPA’s pod selection to only include pods directly managed by the target workload
- Maintain backward compatibility with existing HPA configurations
- Provide clear visibility into which pods are being considered for scaling decisions
- Allow users to choose between selection strategies based on their needs
Non-Goals
- Modifying how metrics are collected from pods
- Changing the scaling algorithm itself
- Addressing other HPA limitations not related to pod selection
Proposal
We propose adding a new field to the HPA specification called SelectionStrategy that allows users to specify how pods should be selected for metric collection:
- If set to
LabelSelector(default): Uses the current behavior of selecting all pods that match the target workload’s label selector. - If set to
OwnerReference: Only selects pods that are owned by the target workload through owner references.
This enumerated type allows for future extension with additional selection strategies if needed, such as Annotations etc.
Risks and Mitigations
- Backward compatibility: Mitigated by making the new behavior opt-in with the current behavior as default.
- User confusion: We’ll provide clear documentation on when and how to use each strategy.
Design Details
The HPA specification (v2) will be extended with a new field to control additional filtering of pods after the initial label selector matching:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
# Existing fields...
SelectionStrategy: OwnerReference # Default: LabelSelector
Since the added field is optional and its omission does not change the existing
autoscaling behavior, this feature will only be added to the latest stable API
version pkg/apis/autoscaling/v2. Older versions (i.e. v1, v2beta1,
v2beta2) will not include the new field, but converters will be updated where
needed to comply with round-trip requirements
.
Pod Selection Process:
- Initial Label Selection (Always happens):
- The HPA first selects pods using the target workload’s label selector
- This is the fundamental selection mechanism and remains unchanged
- Additional Filtering (Based on SelectionStrategy):
LabelSelector(default):- No additional filtering
- All pods that matched the label selector are used for metrics
- Maintains current behavior for backward compatibility
OwnerReference:- Further filters the label-selected pods
- Only keeps pods that are owned by the target workload through owner references
- Follows the ownership chain (e.g., Pods -> ReplicaSet -> Deployment)
- Excludes pods that matched labels but aren’t in the ownership chain
The HorizontalPodAutoscaler API updated to add a new SelectionStrategy field to the HorizontalPodAutoscalerSpec object:
// SelectionStrategy defines how pods are selected for metrics collection
type SelectionStrategy string
const (
// LabelSelector selects all pods matching the target's label selector
LabelSelector SelectionStrategy = "LabelSelector"
// OwnerReference only selects pods owned by the target workload
OwnerReference SelectionStrategy = "OwnerReference"
)
// In HorizontalPodAutoscalerSpec:
type HorizontalPodAutoscalerSpec struct {
// existing fields...
// SelectionStrategy determines how pods are selected for metrics collection.
// Valid values are "LabelSelector" and "OwnerReference".
// If not set, defaults to "LabelSelector" which is the legacy behavior.
// +optional
SelectionStrategy *SelectionStrategy `json:"SelectionStrategy,omitempty"`
}
Pluggable Pod Filtering
The HPA controller introduces a pluggable PodFilter interface to encapsulate different filtering strategies:
// PodFilter defines an interface for filtering pods based on various strategies
type PodFilter interface {
// Filter returns the subset of pods that should be considered for metrics calculation,
// along with the pods that were filtered out
Filter(pods []*v1.Pod) (filtered []*v1.Pod, unfiltered []*v1.Pod, err error)
// Name returns the name of the filter strategy for logging purposes
Name() string
}
Two implementations are provided:
LabelSelectorFilter:
- Default implementation
- Passes through all pods that match the label selector
- Maintains existing behavior for backward compatibility
OwnerReferenceFilter:
- Validates pod ownership through reference chain
- Only includes pods that are owned by the target workload
- Handles different workload types (Deployments, StatefulSets, etc.)
Controller Enhancements
The HPA controller caches filters for improved performance:
type HorizontalController struct {
// ... existing fields ...
podFilterCache map[string]PodFilter
podFilterMux sync.RWMutex
}
All metrics collection methods (e.g., GetResourceReplicas) are updated to accept a PodFilter:
// GetResourceReplicas calculates the desired replica count based on a target resource utilization percentage
// of the given resource for pods matching the given selector in the given namespace, and the current replica count.
// The calculation follows these steps:
// 1. Gets resource metrics for pods in the namespace matching the selector
// 2. Lists all pods matching the selector
// 3. Applies the podFilter to select pods that should be considered for scaling
// 4. Groups considered pods into ready, unready, missing, and ignored pods
// 5. Removes metrics for ignored and unready pods
// 6. Calculates the desired replica count based on the resource utilization of considered pods
//
// Returns:
// - replicaCount: the recommended number of replicas
// - utilization: the current utilization percentage
// - rawUtilization: the raw resource utilization value
// - timestamp: when the metrics were collected
// - err: any error encountered during calculation
func (c *ReplicaCalculator) GetResourceReplicas(ctx context.Context, currentReplicas int32, targetUtilization int32, resource v1.ResourceName, tolerances Tolerances, namespace string, selector labels.Selector, container string, podFilter PodFilter) (replicaCount int32, utilization int32, rawUtilization int64, timestamp time.Time, err error) {
Filtered pods are then used as the basis for replica calculations:
if len(podList) == 0 {
return 0, 0, 0, time.Time{}, fmt.Errorf("no pods returned by selector while calculating replica count")
}
filteredPods, unfilteredPods, err := podFilter.Filter(podList)
if err != nil {
// Fall back to default behavior: use all pods
filteredPods = podList
unfilteredPods = []*v1.Pod{} // empty slice since we're not filtering out any pods
}
unfilteredPodNames := sets.New[string]()
for _, pod := range unfilteredPods {
unfilteredPodNames.Insert(pod.Name)
}
removeMetricsForPods(metrics, unfilteredPodNames)
readyPodCount, unreadyPods, missingPods, ignoredPods := groupPods(filteredPods, metrics, resource, c.cpuInitializationPeriod, c.delayOfInitialReadinessStatus)
removeMetricsForPods(metrics, ignoredPods)
removeMetricsForPods(metrics, unreadyPods)
If filtering fails (e.g., due to RBAC issues), the system defaults to using all pods, ensuring robust behavior.
The HPA controller implements caching to optimize API server queries when checking pod ownership:
type ControllerCache struct {
mutex sync.RWMutex
resources map[string]*ControllerCacheEntry
dynamicClient dynamic.Interface
restMapper apimeta.RESTMapper
cacheTTL time.Duration
}
type ControllerCacheEntry struct {
Resource *unstructured.Unstructured
Error error
LastFetched time.Time
}
The cache system provides several benefits:
- Reduced API Server Load: Caches controller resources to minimize API server queries
- Improved Performance: Faster pod ownership validation through in-memory lookups
- Configurable TTL: Allows tuning of cache freshness vs performance trade-off
- Automatic Cleanup: Background goroutine removes expired entries
When validating pod ownership, the system first checks the cache If a valid (non-expired) entry exists, it’s returned immediately Otherwise, the controller fetches from the API server and updates the cache Expired entries are automatically cleaned up by a background goroutine
Scope of Support
This enhancement applies consistently across the following supported metric types in the HorizontalPodAutoscaler:
- Resource metrics (e.g., CPU, memory)
- Pods metrics
- Container resource metrics
- Object metrics (only when AverageValue1 type is selected with
spec.metrics.object.target.type) - External metrics (only when AverageValue1 type is selected with
spec.metrics.external.target.type)
Reference: Kubernetes HPA metric types
When a user updates an HPA to change its pod selection strategy:
- The controller detects strategy changes during HPA updates
- The pod filter cache is cleared for the modified HPA
- A new filter is created using the updated strategy
- An event is recorded to notify users of the strategy change:
Normal StrategyChanged Pod selection strategy changed from LabelSelector to OwnerReference
Test Plan
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
None required.
Unit tests
Tests for Pod Filters:
- Verify
LabelSelectorFilterincludes all pods matching labels - Verify
OwnerReferenceFilterincludes only pods owned by target workload - Verify filters handle edge cases (no owners, broken chains, multiple owners)
Tests for Replica Calculator:
Verify calculations with
LabelSelectorFiltermatch current behaviorVerify calculations with
OwnerReferenceFilteronly include owned podsVerify correct behavior with mixed owned/unowned pods
/pkg/controller/podautoscaler:16 June 2025-88.0%/pkg/controller/podautoscaler/metrics:16 June 2025-90.0%
Integration tests
N/A, the feature is tested using unit tests and e2e tests.
e2e tests
We will add the following e2e autoscaling tests :
- For owner references strategy:
- Workload should not scale up when CPU/Memory usage comes from pods not owned by the target
- HPA ignores metrics from pods with matching labels but no owner reference to the target
- For label selector strategy:
- Workload scales up when CPU/Memory usage comes from any pods matching labels (current behavior)
- HPA considers metrics from all pods with matching labels regardless of ownership
- Verify backward compatibility when
SelectionStrategyis not set
Graduation Criteria
Alpha
- Feature implemented behind a feature flag:
HPASelectionStrategy - Unit and e2e tests passed as designed in TestPlan .
Beta
- Unit and e2e tests passed as designed in TestPlan .
- Gather feedback from developers and surveys
- All functionality completed
- All security enforcement completed
- All monitoring requirements completed
- All testing requirements completed
- All known pre-release issues and gaps resolved
GA
- No negative feedback.
- All issues and gaps identified as feedback during beta are resolved
Upgrade / Downgrade Strategy
Upgrade
Existing HPAs will continue to work as they do today, using the default LabelSelector strategy. Users can use the new feature by enabling the Feature Gate (alpha only) and setting the SelectionStrategy field to OwnerReference on an HPA.
Downgrade
On downgrade, all HPAs will revert to using the LabelSelector strategy, regardless of any configured SelectionStrategy value on the HPA itself.
Version Skew Strategy
kube-apiserver: More recent instances will accept the new SelectionStrategy field, while older instances will ignore it during validation and persist it as part of the HPA object.kube-controller-manager: An older version could receive an HPA containing the new SelectionStrategy field from a more recent API server, in which case it would ignore it (i.e., continue to use the default LabelSelector strategy regardless of the field’s value).
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name:
HPASelectionStrategy - Components depending on the feature gate:
kube-controller-managerandkube-apiserver.
- Feature gate name:
Does enabling the feature change any default behavior?
No. By default, HPAs will continue to use the LabelSelector strategy unless the new SelectionStrategy field is explicitly set to OwnerReference.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
Yes. If the feature gate is disabled, all HPAs will revert to using the LabelSelector strategy regardless of the value of the SelectionStrategy field.
What happens if we reenable the feature if it was previously rolled back?
When the feature is re-enabled, any HPAs with SelectionStrategy: OwnerReference will resume using the ownership-based pod selection rather than label-based selection.
The HPA controller will immediately begin considering only pods directly owned by the target workload for scaling decisions on these HPAs, potentially changing scaling behavior compared to when the feature was disabled.
Existing HPAs that don’t have SelectionStrategy explicitly set will continue using the default LabelSelector strategy and won’t be affected by re-enabling the feature.
Are there any tests for feature enablement/disablement?
We will add a unit test verifying that HPAs with and without the new SelectionStrategy field are properly validated, both when the feature gate is enabled or not.
This will ensure the HPA controller correctly applies the pod selection strategy based on the feature gate status and presence of the field.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
Rollout failures in this feature are unlikely to impact running workloads significantly, but there are edge cases to consider:
If the feature is enabled during a high-traffic period, HPAs with SelectionStrategy: OwnerReference might suddenly change their scaling decisions based on the reduced pod set. However, this is mitigated by:
The HPA’s existing behavior specs (minReplicas/maxReplicas) which prevent extreme scaling events
The gradual nature of HPA scaling decisions If a kube-controller-manager restarts mid-rollout, some HPAs might temporarily revert to the
LabelSelectorstrategy until the controller fully initializes with the new feature enabled. This is mitigated by:The HPA’s behavior specs which limit the scale of any potential changes
Normal operation resumes after controller initialization
These issues would only affect HPAs that have explicitly set SelectionStrategy: OwnerReference. Existing HPAs will continue to function with the default LabelSelector strategy.
What specific metrics should inform a rollback?
Operators should monitor these signals that might indicate problems:
- Unexpected scaling events shortly after enabling the feature
- Significant changes in the number of replicas for workloads using HPAs with
SelectionStrategy: OwnerReference - Increased latency in the
horizontal_pod_autoscaler_controller_metric_computation_duration_secondsmetric - Increased error rate in
horizontal_pod_autoscaler_controller_metric_computation_totalwith error status If these metrics show unusual patterns after enabling the feature, operators should consider rolling back.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No. This feature only adds a new optional field to the HPA API and doesn’t deprecate or remove any existing functionality. All current HPA behaviors remain unchanged unless users explicitly opt into the new selection mode.
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
The presence of the SelectionStrategy field in HPA specifications indicates that the feature is in use.
How can someone using this feature know that it is working for their instance?
Users can confirm that the feature is active and functioning by inspecting the conditions exposed by the controller. Specifically, they can verify the value of .spec.SelectionStrategy to ensure the expected behavior is enabled.
Moreover, users can verify the feature is working properly through events on the HPA object:
- When creating or updating an HPA with SelectionStrategy: OwnerReference, an event will be emitted, similar to this:
Normal SelectionStrategyActive "Pod selection strategy 'OwnerReference' is active" - When switching strategies, an event will indicate the change, similar to this:
Normal StrategyChanged "Pod selection strategy changed from 'LabelSelector' to 'OwnerReference'"
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
This feature utilizes the existing HPA controller metrics:
horizontal_pod_autoscaler_controller_reconciliation_duration_seconds- The new pod filtering should not significantly impact these durations
horizontal_pod_autoscaler_controller_metric_computation_duration_seconds- Measures time taken to calculate metrics with labels for action, error, and metric_type
- The pod filtering logic should work within existing computation time buckets (exponential buckets from 0.001s to ~16s)
horizontal_pod_autoscaler_controller_metric_computation_total- Counts number of metric computations with labels for action, error, and metric_type
- The pod filtering should not introduce new error cases in metric computation
The feature should maintain the current performance characteristics of these metrics.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
This feature doesn’t fundamentally change how the HPA controller operates; it refines which pods are included in metric calculations.
Therefore, existing metrics for monitoring HPA controller health remain applicable.
Standard HPA metrics (e.g. horizontal_pod_autoscaler_controller_metric_computation_duration_seconds) can be used to verify the HPA controller health.
Are there any missing metrics that would be useful to have to improve observability of this feature?
The following metrics should be added to improve cache observability:
- Cache hit counter: Tracks when the controller successfully retrieves data from cache
- Cache miss counter: Tracks when the controller needs to query the API server
These metrics are essential for:
- Monitoring cache effectiveness
- Optimizing cache TTL settings
- Identifying potential performance issues
- Understanding API server query patterns
Dependencies
Does this feature depend on any specific services running in the cluster?
Scalability
Will enabling / using this feature result in any new API calls?
Yes. Enabling or using this feature will result in new API calls, specifically:
- API Call Type: GET (read) operations
- Resources Involved: Deployments, ReplicaSets, and potentially other workload-related resources
Will enabling / using this feature result in introducing new API types?
No.
Will enabling / using this feature result in any new calls to the cloud provider?
No.
Will enabling / using this feature result in increasing size or count of the existing API objects?
Yes, HorizontalPodAutoscaler objects will increase in size by approximately ~39 bytes for the string field when specified
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
Yes, enabling this feature may introduce a slight increase in latency due to additional resource checks. For example, in the case of a Deployment, the system may need to perform two extra ownership checks (e.g., Pod → ReplicaSet → Deployment). While this added processing could have some impact, it is expected to be negligible in most scenarios.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
Yes, cahing will be implemented for each podsFilter strategy, as well as for other resources to reduce the number of API calls to the API server (as described above).
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No.
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
If the API server and/or etcd becomes unavailable, the entire HPA controller functionality will be impacted, not just this feature. The HPA controller will not be able to:
- Retrieve HPA objects
- Get pod metrics
- Access workload information
- Update HPA status
Therefore, no autoscaling decisions can be made during this period, regardless of the configured selection strategy. The feature itself doesn’t introduce any new failure modes with respect to API server or etcd availability - it’s dependent on these components being available just like the rest of the HPA controller’s functionality. Once API server and etcd access is restored, the HPA controller will resume normal operation, including the pod selection strategy specified in the HPA.
What are other known failure modes?
What steps should be taken if SLOs are not being met to determine the problem?
Check horizontal_pod_autoscaler_controller_metric_computation_duration_seconds to identify if the increased latency correlates with HPAs using the OwnerReference selection strategy.
If latency issues are observed:
- Check if the problem only affects HPAs with
SelectionStrategy: OwnerReference - Verify if the latency increases with deeper ownership chains (e.g., Pod → ReplicaSet → Deployment) For problematic HPAs, you can:
- Temporarily revert to the default label-based selection by removing the
SelectionStrategyfield - Or explicitly set
SelectionStrategy: LabelSelectorto maintain backward compatibility
Implementation History
KEP Published: 05/22/2025