KEP-4346: Add Informer Metrics

Implementation History
ALPHA Implementable
Created 2023-11-27
Latest v1.30
Milestones
Alpha v1.30
Ownership
Primary Authors

KEP-4346: Add Informer Metrics

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
    • e2e Tests for all Beta API Operations (endpoints)
    • (R) Ensure GA e2e tests meet requirements for Conformance Tests
    • (R) Minimum Two Week Window for GA e2e tests to prove flake free
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • “Implementation History” section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Informer is a base component in most K8s controllers, it is important to find a way to check if it is healthy. This enhancement proposal adds metrics to the client-go informer. It will expose reflector/queue/eventHandler internal metrics to Prometheus. These metrics is useful for developers/reliability engineers, they can monitor informer depend on it.

Motivation

A Kubernetes controller will watch objects for the desired state and the actual state, then send instructions to make the actual state be more like the desired state. Most controllers use informer to watch object change, then send work items that require reconcile to the workqueue.

Now the workqueue exposes metrics about queueLatency/workDuration, it is useful to find issues in reconcile routine. When a lot of objects need to be reconciled, but there are no new work items sent into workqueue, the informer most likely blocked. Informer is composed of reflector/queue/eventHandler, to find the root cause, users have to add debug log and change log level.

Informer should expose reflector/queue/eventHandler metrics, it will be easy to find why this informer is blocked. For example, it will show how long in seconds eventHandler processing an item.

This change remove reflector metrics before https://github.com/kubernetes/kubernetes/pull/74636 . It is essential to fix memory leak issue.

Goals

  • Add metrics for informer
  • Expose informer reflector/queue/eventHandler metrics

Non-Goals

  • It does not introduce breaking changes for controllers which use informer.
  • It does not modify core Kubernetes components which use informer.
  • It does not list all informer metrics, which can add as needed

Proposal

  • Introduce the informer metrics struct informerMetrics contains queue/eventHandler metrics
  • Introduce the informer metrics provider interface informerMetricsProvider, implement in k8s.io/component-base/metrics
  • Revert the deleted reflectorMetrics
  • Add a feature gate InformerMetrics to enable informer/reflector metrics

User Stories (Optional)

Story 1

Client-go informer create a RingGrowing pendingNotifications for every eventHandler. This RingGrowing will grow, but never shrink. An informer has some eventHandlers, it is hard to distinguish which pendingNotifications linked to a lot of objects. The pendingNotifications metric will help developers distinguish the slow eventHandler.

Story 2

Users want to know how often the reflector performs a LIST.

Story 3

It is hard to known how many item in informer queue/store. Add metrics for queue/store, it will help developers to find the number of pending deltas.

Notes/Constraints/Caveats (Optional)

N/A

Risks and Mitigations

The informer metrics is disabled by default. When enable informer metrics, the newly added metrics will increase CPU/MEM usage.

If the metrics result memory leak, users can disable the informer metrics.

Design Details

Add a feature gate InformMetrics in client-go. It is disabled when in the Alpha state.

Informer metrics

Introduce the informer metrics struct informerMetrics and eventHandlerMetrics. It is similar to the existing workqueue metrics.

type informerMetrics struct {	
  clock clock.Clock

  // total number of item in store
  numbernOfStoredItem  GaugeMetric
  // total number of item in queue
  numberOfQueuedItem   GaugeMetric
  
  // each eventHandler metrics 
  eventHandlerMetrics map[string]eventHandlerMetrics
}

type eventHandlerMetrics struct {

  // number of pending data
  numberOfPendingNotifications GaugeMetric

  // size of RingGrowring data
  sizeOfRingGrowing GaugeMetric

  // how long processing an item from informer reflector
  prcoessDuration  HistogramMetric
        
}

// MetricsProvider generates various metrics used by the queue.
type MetricsProvider interface {
  // the informer name
	NewStoredItemMetric(name string) GaugeMetric
  NewQueuedItemMetric(name string) GaugeMetric

  // the eventHandler name
  NewPendingNotificationsMetric(name string) GaugeMetric
  NewRingGrowingMetric(name string) GaugeMetric
  NewPrcoessDurationMetric(name string) HistogramMetric
}

Add prometheus metrics item in subsystem informer

namelabelsdescription
store_item_totalinformer nameTotal number of item in store
queued_item_totalinformer nameTotal number of item in queue
pending_notifications_totaleventHandler nameTotal number of pending notifications in eventHandler RingGrowing
ring_growing_capacityeventHandler nameCapacity of eventHandler RingGrowing
event_process_durationeventHandler nameHow long in seconds eventHandler processing an item from RingGrowing takes

Reflector metrics

This change https://github.com/kubernetes/kubernetes/pull/74636 will be reverted.

Each reflector metrics contains 3 counter, 4 summary and 1 gauge.

type reflectorMetrics struct {
	numberOfLists       CounterMetric
	listDuration        HistogramMetric
	numberOfItemsInList HistogramMetric

	numberOfWatches      CounterMetric
	numberOfShortWatches CounterMetric
	watchDuration        HistogramMetric
	numberOfItemsInWatch HistogramMetric

	lastResourceVersion GaugeMetric
}

According to kubernetes/kubernetes#73587, the memory leak is caused by summary. It’d be better to use histograms instead. HistogramMetrics are aggregatable and it will reduce memory usage.

Remove Metrics

When the informers and reflectors stopped, the reference metrics will be removed.

Kube component-base metrics support to delete metrics by matching labels.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates
Unit tests
  • <package>: <date> - <test coverage>

  • Unit tests to ensure that the metrics output meets expectations.

  • Unit tests to ensure that the metrics deletion is functioning properly.

Integration tests

We will have extensive integration testing of the union code in the test/integration/metrics package.

  • When enabling InformerMetrics feature gate, ensure the metrics will be exposed. Ensure the metrics subsystem/label/granularity is correct.
  • When the informers and reflectors are stopped, ensure the reference metrics will be removed.
e2e tests
  • :

Graduation Criteria

Alpha

  • Feature implemented behind a feature gate flag
  • Add related integration and unit tests to ensure functionality and make sure there is no memory leak in existing behavior

Beta

  • Gather feedback from developers and surveys
  • Work on feedback and add additional tests as needed

GA

  • Decision on GA will be made based on beta feedback

Upgrade / Downgrade Strategy

N/A

Version Skew Strategy

N/A

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?
  • Feature gate (also fill in values in kep.yaml)
    • Feature gate name: InformerMetrics
    • Components depending on the feature gate:
      • components via client-go library
Does enabling the feature change any default behavior?

No. It does not change any default behavior. When this feature is enabled, it will increase memory usage in client-go.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, by disabling InformerMetrics FeatureGate for components via client-go library. In this case informers will not expose metrics anymore.

What happens if we reenable the feature if it was previously rolled back?

The expected behavior of the feature will be restored.

Are there any tests for feature enablement/disablement?

For now, there is no tests for feature enablement/disablement. The unit / integration tests will be added.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

Feature has no impact on rollout/rollback, and no impact on running workloads.

What specific metrics should inform a rollback?

The memory used by this metrics continues to grow, consuming a significant amount

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Not yet. In the alpha releases, we could test this.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

This feature does not deprecate or remove any features/APIs/fields/flags/etc.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?
  • Informer / Reflector (e.g., lists_total, watches_total) metrics returned by the operator are populated
How can someone using this feature know that it is working for their instance?
  • Other (treat as last resort)
    • Details:
      • The following metrics are available when InformerMetrics is enabled:
        • lists_total
        • watches_total
        • last_resource_version
        • etc.
What are the reasonable SLOs (Service Level Objectives) for the enhancement?

The feature gate will increase memory usage. The memory usage should not continuously grow. The informerMetrics / eventHandlerMetrics / reflectorMetrics memory consumption is in a stable state.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
  • Metrics
    • Metric name: Memory usage
    • [Optional] Aggregation method:
    • Components exposing the metric: Operating System/golang pprof
Are there any missing metrics that would be useful to have to improve observability of this feature?

Not at the moment.

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?

Yes. The informer metrics will increase CPU/RAM usage.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Yes. When enable informer metrics, kubelet will only increase CPU/RAM usage.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

N/A

What are other known failure modes?

N/A

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

  • 2023-11-29: Initial draft KEP

Drawbacks

N/A

Alternatives

N/A

Infrastructure Needed (Optional)