KEP-5325: Improve pod selection accuracy across workload types

KEP-5625: HPA - Improve pod selection accuracy across workload types

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
- Risks and Mitigations
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
“Implementation History” section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

The Horizontal Pod Autoscaler (HPA) has a critical limitation in its pod selection mechanism: it collects metrics from all pods that match the target workload’s label selector. This can lead to incorrect scaling decisions when unrelated pods (such as Jobs, CronJobs, or other Deployments) happen to share the same labels.

This often results in unexpected behavior such as:

HPAs stuck at maxReplicas despite low actual usage in the target workload
Unnecessary scaling events triggered by temporary workloads
Unpredictable scaling behavior that’s difficult to diagnose

This proposal adds a parameter to HPAs which ensures the HPA only considers pods that are actually owned by the target workload, through owner references, rather than all pods matching the label selector.

Motivation

Consider this example:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test-app
  template:
    metadata:
      labels:
        app: test-app
    spec:
      containers:
      - name: nginx
        image: nginx
        resources:
          requests:
            cpu: 100m
---
apiVersion: batch/v1
kind: Job
metadata:
  name: test-job
spec:
  template:
    metadata:
      labels:
        app: test-app  # Same label as deployment
        workload: scraper
    spec:
      containers:
      - name: cpu-load
        image: busybox
        command: ["dd", "if=/dev/zero", "of=/dev/null"]
        resources:
          requests:
            cpu: 100m
      restartPolicy: Never
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: test-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: test-app
  minReplicas: 1
  maxReplicas: 5
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

In this case, the HPA will factor in CPU consumption from the Job’s pod, despite it not being part of the Deployment, potentially causing incorrect scaling decisions.

Goals

Improve the accuracy of HPA’s pod selection to only include pods directly managed by the target workload
Maintain backward compatibility with existing HPA configurations
Provide clear visibility into which pods are being considered for scaling decisions
Allow users to choose between selection strategies based on their needs

Non-Goals

Modifying how metrics are collected from pods
Changing the scaling algorithm itself
Addressing other HPA limitations not related to pod selection

Proposal

We propose adding a new field to the HPA specification called SelectionStrategy that allows users to specify how pods should be selected for metric collection:

If set to LabelSelector (default): Uses the current behavior of selecting all pods that match the target workload’s label selector.
If set to OwnerReference: Only selects pods that are owned by the target workload through owner references.

This enumerated type allows for future extension with additional selection strategies if needed, such as Annotations etc.

Risks and Mitigations

Backward compatibility: Mitigated by making the new behavior opt-in with the current behavior as default.
User confusion: We’ll provide clear documentation on when and how to use each strategy.

Design Details

The HPA specification (v2) will be extended with a new field to control additional filtering of pods after the initial label selector matching:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  # Existing fields...
  SelectionStrategy: OwnerReference  # Default: LabelSelector

Since the added field is optional and its omission does not change the existing autoscaling behavior, this feature will only be added to the latest stable API version pkg/apis/autoscaling/v2. Older versions (i.e. v1, v2beta1, v2beta2) will not include the new field, but converters will be updated where needed to comply with round-trip requirements .

Pod Selection Process:

Initial Label Selection (Always happens):
- The HPA first selects pods using the target workload’s label selector
- This is the fundamental selection mechanism and remains unchanged
Additional Filtering (Based on SelectionStrategy):
- LabelSelector(default):
  - No additional filtering
  - All pods that matched the label selector are used for metrics
  - Maintains current behavior for backward compatibility
- OwnerReference:
  - Further filters the label-selected pods
  - Only keeps pods that are owned by the target workload through owner references
  - Follows the ownership chain (e.g., Pods -> ReplicaSet -> Deployment)
  - Excludes pods that matched labels but aren’t in the ownership chain

The HorizontalPodAutoscaler API updated to add a new SelectionStrategy field to the HorizontalPodAutoscalerSpec object:

// SelectionStrategy defines how pods are selected for metrics collection
type SelectionStrategy string

const (
    // LabelSelector selects all pods matching the target's label selector
    LabelSelector SelectionStrategy = "LabelSelector"
    
    // OwnerReference only selects pods owned by the target workload
    OwnerReference SelectionStrategy = "OwnerReference"
)

// In HorizontalPodAutoscalerSpec:
type HorizontalPodAutoscalerSpec struct {
    // existing fields...

    // SelectionStrategy determines how pods are selected for metrics collection.
    // Valid values are "LabelSelector" and "OwnerReference".
    // If not set, defaults to "LabelSelector" which is the legacy behavior.
    // +optional
    SelectionStrategy *SelectionStrategy `json:"SelectionStrategy,omitempty"`
}

Pluggable Pod Filtering

The HPA controller introduces a pluggable PodFilter interface to encapsulate different filtering strategies:

// PodFilter defines an interface for filtering pods based on various strategies
type PodFilter interface {
	// Filter returns the subset of pods that should be considered for metrics calculation,
	// along with the pods that were filtered out
	Filter(pods []*v1.Pod) (filtered []*v1.Pod, unfiltered []*v1.Pod, err error)
	// Name returns the name of the filter strategy for logging purposes
	Name() string
}

Two implementations are provided:

LabelSelectorFilter:

Default implementation
Passes through all pods that match the label selector
Maintains existing behavior for backward compatibility

OwnerReferenceFilter:

Validates pod ownership through reference chain
Only includes pods that are owned by the target workload
Handles different workload types (Deployments, StatefulSets, etc.)

Controller Enhancements

The HPA controller caches filters for improved performance:

type HorizontalController struct {
    // ... existing fields ...
    podFilterCache map[string]PodFilter
    podFilterMux   sync.RWMutex
}

All metrics collection methods (e.g., GetResourceReplicas) are updated to accept a PodFilter:

// GetResourceReplicas calculates the desired replica count based on a target resource utilization percentage
// of the given resource for pods matching the given selector in the given namespace, and the current replica count.
// The calculation follows these steps:
// 1. Gets resource metrics for pods in the namespace matching the selector
// 2. Lists all pods matching the selector
// 3. Applies the podFilter to select pods that should be considered for scaling
// 4. Groups considered pods into ready, unready, missing, and ignored pods
// 5. Removes metrics for ignored and unready pods
// 6. Calculates the desired replica count based on the resource utilization of considered pods
//
// Returns:
// - replicaCount: the recommended number of replicas
// - utilization: the current utilization percentage
// - rawUtilization: the raw resource utilization value
// - timestamp: when the metrics were collected
// - err: any error encountered during calculation
func (c *ReplicaCalculator) GetResourceReplicas(ctx context.Context, currentReplicas int32, targetUtilization int32, resource v1.ResourceName, tolerances Tolerances, namespace string, selector labels.Selector, container string, podFilter PodFilter) (replicaCount int32, utilization int32, rawUtilization int64, timestamp time.Time, err error) {

Filtered pods are then used as the basis for replica calculations:

  if len(podList) == 0 {
		return 0, 0, 0, time.Time{}, fmt.Errorf("no pods returned by selector while calculating replica count")
	}
  filteredPods, unfilteredPods, err := podFilter.Filter(podList)

  if err != nil {
    // Fall back to default behavior: use all pods
    filteredPods = podList
    unfilteredPods = []*v1.Pod{} // empty slice since we're not filtering out any pods
  }

  unfilteredPodNames := sets.New[string]()
	for _, pod := range unfilteredPods {
		unfilteredPodNames.Insert(pod.Name)
	}
	removeMetricsForPods(metrics, unfilteredPodNames)
	readyPodCount, unreadyPods, missingPods, ignoredPods := groupPods(filteredPods, metrics, resource, c.cpuInitializationPeriod, c.delayOfInitialReadinessStatus)
	removeMetricsForPods(metrics, ignoredPods)
	removeMetricsForPods(metrics, unreadyPods)

If filtering fails (e.g., due to RBAC issues), the system defaults to using all pods, ensuring robust behavior.

The HPA controller implements caching to optimize API server queries when checking pod ownership:

type ControllerCache struct {
    mutex         sync.RWMutex
    resources     map[string]*ControllerCacheEntry
    dynamicClient dynamic.Interface
    restMapper    apimeta.RESTMapper
    cacheTTL      time.Duration
}

type ControllerCacheEntry struct {
    Resource    *unstructured.Unstructured
    Error       error
    LastFetched time.Time
}

The cache system provides several benefits:

Reduced API Server Load: Caches controller resources to minimize API server queries
Improved Performance: Faster pod ownership validation through in-memory lookups
Configurable TTL: Allows tuning of cache freshness vs performance trade-off
Automatic Cleanup: Background goroutine removes expired entries

When validating pod ownership, the system first checks the cache If a valid (non-expired) entry exists, it’s returned immediately Otherwise, the controller fetches from the API server and updates the cache Expired entries are automatically cleaned up by a background goroutine

Scope of Support

This enhancement applies consistently across the following supported metric types in the HorizontalPodAutoscaler:

Resource metrics (e.g., CPU, memory)
Pods metrics
Container resource metrics
Object metrics (only when AverageValue¹ type is selected with spec.metrics.object.target.type)
External metrics (only when AverageValue¹ type is selected with spec.metrics.external.target.type)

Reference: Kubernetes HPA metric types

When a user updates an HPA to change its pod selection strategy:

The controller detects strategy changes during HPA updates
The pod filter cache is cleared for the modified HPA
A new filter is created using the updated strategy
An event is recorded to notify users of the strategy change:

Normal  StrategyChanged  Pod selection strategy changed from LabelSelector to OwnerReference

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

None required.

Unit tests

Tests for Pod Filters:

Verify LabelSelectorFilter includes all pods matching labels
Verify OwnerReferenceFilter includes only pods owned by target workload
Verify filters handle edge cases (no owners, broken chains, multiple owners)

Tests for Replica Calculator:

Verify calculations with LabelSelectorFilter match current behavior
Verify calculations with OwnerReferenceFilter only include owned pods
Verify correct behavior with mixed owned/unowned pods
/pkg/controller/podautoscaler:16 June 2025-88.0%
/pkg/controller/podautoscaler/metrics:16 June 2025-90.0%

Integration tests

N/A, the feature is tested using unit tests and e2e tests.

e2e tests

We will add the following e2e autoscaling tests :

For owner references strategy:
- Workload should not scale up when CPU/Memory usage comes from pods not owned by the target
- HPA ignores metrics from pods with matching labels but no owner reference to the target
For label selector strategy:
- Workload scales up when CPU/Memory usage comes from any pods matching labels (current behavior)
- HPA considers metrics from all pods with matching labels regardless of ownership
- Verify backward compatibility when SelectionStrategy is not set

Graduation Criteria

Alpha

Feature implemented behind a feature flag: HPASelectionStrategy
Unit and e2e tests passed as designed in TestPlan .

Beta

Unit and e2e tests passed as designed in TestPlan .
Gather feedback from developers and surveys
All functionality completed
All security enforcement completed
All monitoring requirements completed
All testing requirements completed
All known pre-release issues and gaps resolved

GA

No negative feedback.
All issues and gaps identified as feedback during beta are resolved

Upgrade / Downgrade Strategy

Upgrade

Existing HPAs will continue to work as they do today, using the default LabelSelector strategy. Users can use the new feature by enabling the Feature Gate (alpha only) and setting the SelectionStrategy field to OwnerReference on an HPA.

Downgrade

On downgrade, all HPAs will revert to using the LabelSelector strategy, regardless of any configured SelectionStrategy value on the HPA itself.

Version Skew Strategy

kube-apiserver: More recent instances will accept the new SelectionStrategy field, while older instances will ignore it during validation and persist it as part of the HPA object.
kube-controller-manager: An older version could receive an HPA containing the new SelectionStrategy field from a more recent API server, in which case it would ignore it (i.e., continue to use the default LabelSelector strategy regardless of the field’s value).

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: HPASelectionStrategy
- Components depending on the feature gate: kube-controller-manager and kube-apiserver.

Does enabling the feature change any default behavior?

No. By default, HPAs will continue to use the LabelSelector strategy unless the new SelectionStrategy field is explicitly set to OwnerReference.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. If the feature gate is disabled, all HPAs will revert to using the LabelSelector strategy regardless of the value of the SelectionStrategy field.

What happens if we reenable the feature if it was previously rolled back?

When the feature is re-enabled, any HPAs with SelectionStrategy: OwnerReference will resume using the ownership-based pod selection rather than label-based selection. The HPA controller will immediately begin considering only pods directly owned by the target workload for scaling decisions on these HPAs, potentially changing scaling behavior compared to when the feature was disabled.

Existing HPAs that don’t have SelectionStrategy explicitly set will continue using the default LabelSelector strategy and won’t be affected by re-enabling the feature.

Are there any tests for feature enablement/disablement?

We will add a unit test verifying that HPAs with and without the new SelectionStrategy field are properly validated, both when the feature gate is enabled or not. This will ensure the HPA controller correctly applies the pod selection strategy based on the feature gate status and presence of the field.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

Rollout failures in this feature are unlikely to impact running workloads significantly, but there are edge cases to consider:

If the feature is enabled during a high-traffic period, HPAs with SelectionStrategy: OwnerReference might suddenly change their scaling decisions based on the reduced pod set. However, this is mitigated by:

The HPA’s existing behavior specs (minReplicas/maxReplicas) which prevent extreme scaling events
The gradual nature of HPA scaling decisions If a kube-controller-manager restarts mid-rollout, some HPAs might temporarily revert to the LabelSelector strategy until the controller fully initializes with the new feature enabled. This is mitigated by:
The HPA’s behavior specs which limit the scale of any potential changes
Normal operation resumes after controller initialization

These issues would only affect HPAs that have explicitly set SelectionStrategy: OwnerReference. Existing HPAs will continue to function with the default LabelSelector strategy.

What specific metrics should inform a rollback?

Operators should monitor these signals that might indicate problems:

Unexpected scaling events shortly after enabling the feature
Significant changes in the number of replicas for workloads using HPAs with SelectionStrategy: OwnerReference
Increased latency in the horizontal_pod_autoscaler_controller_metric_computation_duration_seconds metric
Increased error rate in horizontal_pod_autoscaler_controller_metric_computation_total with error status If these metrics show unusual patterns after enabling the feature, operators should consider rolling back.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No. This feature only adds a new optional field to the HPA API and doesn’t deprecate or remove any existing functionality. All current HPA behaviors remain unchanged unless users explicitly opt into the new selection mode.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

The presence of the SelectionStrategy field in HPA specifications indicates that the feature is in use.

How can someone using this feature know that it is working for their instance?

Users can confirm that the feature is active and functioning by inspecting the conditions exposed by the controller. Specifically, they can verify the value of .spec.SelectionStrategy to ensure the expected behavior is enabled. Moreover, users can verify the feature is working properly through events on the HPA object:

When creating or updating an HPA with SelectionStrategy: OwnerReference, an event will be emitted, similar to this: Normal SelectionStrategyActive "Pod selection strategy 'OwnerReference' is active"
When switching strategies, an event will indicate the change, similar to this: Normal StrategyChanged "Pod selection strategy changed from 'LabelSelector' to 'OwnerReference'"

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

This feature utilizes the existing HPA controller metrics:

horizontal_pod_autoscaler_controller_reconciliation_duration_seconds
- The new pod filtering should not significantly impact these durations
horizontal_pod_autoscaler_controller_metric_computation_duration_seconds
- Measures time taken to calculate metrics with labels for action, error, and metric_type
- The pod filtering logic should work within existing computation time buckets (exponential buckets from 0.001s to ~16s)
horizontal_pod_autoscaler_controller_metric_computation_total
- Counts number of metric computations with labels for action, error, and metric_type
- The pod filtering should not introduce new error cases in metric computation

The feature should maintain the current performance characteristics of these metrics.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

This feature doesn’t fundamentally change how the HPA controller operates; it refines which pods are included in metric calculations. Therefore, existing metrics for monitoring HPA controller health remain applicable. Standard HPA metrics (e.g. horizontal_pod_autoscaler_controller_metric_computation_duration_seconds) can be used to verify the HPA controller health.

Are there any missing metrics that would be useful to have to improve observability of this feature?

The following metrics should be added to improve cache observability:

Cache hit counter: Tracks when the controller successfully retrieves data from cache
Cache miss counter: Tracks when the controller needs to query the API server

These metrics are essential for:

Monitoring cache effectiveness
Optimizing cache TTL settings
Identifying potential performance issues
Understanding API server query patterns

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Yes. Enabling or using this feature will result in new API calls, specifically:

API Call Type: GET (read) operations
Resources Involved: Deployments, ReplicaSets, and potentially other workload-related resources

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

Yes, HorizontalPodAutoscaler objects will increase in size by approximately ~39 bytes for the string field when specified

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Yes, enabling this feature may introduce a slight increase in latency due to additional resource checks. For example, in the case of a Deployment, the system may need to perform two extra ownership checks (e.g., Pod → ReplicaSet → Deployment). While this added processing could have some impact, it is expected to be negligible in most scenarios.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Yes, cahing will be implemented for each podsFilter strategy, as well as for other resources to reduce the number of API calls to the API server (as described above).

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

If the API server and/or etcd becomes unavailable, the entire HPA controller functionality will be impacted, not just this feature. The HPA controller will not be able to:

Retrieve HPA objects
Get pod metrics
Access workload information
Update HPA status

Therefore, no autoscaling decisions can be made during this period, regardless of the configured selection strategy. The feature itself doesn’t introduce any new failure modes with respect to API server or etcd availability - it’s dependent on these components being available just like the rest of the HPA controller’s functionality. Once API server and etcd access is restored, the HPA controller will resume normal operation, including the pod selection strategy specified in the HPA.

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Check horizontal_pod_autoscaler_controller_metric_computation_duration_seconds to identify if the increased latency correlates with HPAs using the OwnerReference selection strategy. If latency issues are observed:

Check if the problem only affects HPAs with SelectionStrategy: OwnerReference
Verify if the latency increases with deeper ownership chains (e.g., Pod → ReplicaSet → Deployment) For problematic HPAs, you can:
Temporarily revert to the default label-based selection by removing the SelectionStrategy field
Or explicitly set SelectionStrategy: LabelSelector to maintain backward compatibility

Implementation History

KEP Published: 05/22/2025

Drawbacks

Alternatives

Infrastructure Needed (Optional)

With AverageValue, the value returned from the custom metrics API is divided by the number of Pods before being compared to the target, thus requiring improved pod selection. However, Value, the target is compared directly to the returned metric from the API. ↩︎ ↩︎