Queue metrics broken when using the controller builder #436

mrIncompetent · 2019-05-18T18:34:44Z

When starting a controller the following messages get logged:

E0518 20:13:50.847341   12082 client_go_adapter.go:318] descriptor Desc{fqName: "node-application_depth", help: "Current depth of workqueue: node-application", constLabels: {}, variableLabels: []} is invalid: "node-application_depth" is not a valid metric name
E0518 20:13:50.847688   12082 client_go_adapter.go:328] descriptor Desc{fqName: "node-application_adds", help: "Total number of adds handled by workqueue: node-application", constLabels: {}, variableLabels: []} is invalid: "node-application_adds" is not a valid metric name
E0518 20:13:50.848958   12082 client_go_adapter.go:339] descriptor Desc{fqName: "node-application_queue_latency", help: "How long an item stays in workqueuenode-application before being requested.", constLabels: {}, variableLabels: []} is invalid: "node-application_queue_latency" is not a valid metric name
E0518 20:13:50.849992   12082 client_go_adapter.go:350] descriptor Desc{fqName: "node-application_work_duration", help: "How long processing an item from workqueuenode-application takes.", constLabels: {}, variableLabels: []} is invalid: "node-application_work_duration" is not a valid metric name
E0518 20:13:50.851079   12082 client_go_adapter.go:363] descriptor Desc{fqName: "node-application_unfinished_work_seconds", help: "How many seconds of work node-application has done that is in progress and hasn't been observed by work_duration. Large values indicate stuck threads. One can deduce the number of stuck threads by observing the rate at which this increases.", constLabels: {}, variableLabels: []} is invalid: "node-application_unfinished_work_seconds" is not a valid metric name
E0518 20:13:50.852133   12082 client_go_adapter.go:374] descriptor Desc{fqName: "node-application_longest_running_processor_microseconds", help: "How many microseconds has the longest running processor for node-application been running.", constLabels: {}, variableLabels: []} is invalid: "node-application_longest_running_processor_microseconds" is not a valid metric name
E0518 20:13:50.853298   12082 client_go_adapter.go:384] descriptor Desc{fqName: "node-application_retries", help: "Total number of retries handled by workqueue: node-application", constLabels: {}, variableLabels: []} is invalid: "node-application_retries" is not a valid metric name

Used controller-runtime version:

[[constraint]]
  name = "sigs.k8s.io/controller-runtime"
  version = "v0.2.0-beta.1"

Code:

	mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{})
	if err != nil {
		log.Fatal("Unable to start manager", zap.Error(err))
	}

	err = ctrl.NewControllerManagedBy(mgr).
		For(&corev1.Node{}).
		Owns(&corev1.ConfigMap{}).
		Complete(&controller.Reconciler{
			Client: mgr.GetClient(),
			Log:    log.Named("some_controller"),
		})
	if err != nil {
		log.Fatal("Unable to add some_controller", zap.Error(err))
	}

	log.Info("starting manager")
	if err := mgr.Start(ctrl.SetupSignalHandler()); err != nil {
		log.Fatal("problem running manager", zap.Error(err))
	}

The problematic code: https://github.com/kubernetes-sigs/controller-runtime/blob/master/pkg/builder/build.go#L234

func (blder *Builder) getControllerName() (string, error) {
	gvk, err := getGvk(blder.apiType, blder.mgr.GetScheme())
	if err != nil {
		return "", err
	}
	name := fmt.Sprintf("%s-application", strings.ToLower(gvk.Kind))
	return name, nil
}

This leads to invalid metric names like node-application_depth.

Potential solutions IMHO:

Remove the -application suffix
Add a Named(name string) function to the builder so controllers can have more meaningful names

WDYT?

The text was updated successfully, but these errors were encountered:

mrIncompetent · 2019-05-18T18:37:29Z

cc @DirectXMan12

abursavich · 2019-05-19T00:00:14Z

I ran into this issue too. For added clarity, the immediate problem is the dash.

This is the regex for valid metric names: ^[a-zA-Z_:][a-zA-Z0-9_:]*$

adohe-zz · 2019-05-19T06:25:57Z

The controller name is used in tracing, logging, eventing and monitoring. According to what @abursavich mentioned, using controller name as metric names(monitoring) should match prometheus metric name regex, but in other cases, no such restriction. Just as what KCM does, the queue metrics could just api kind as name, and the -application suffix could used in other scenario.

abursavich · 2019-05-19T16:29:29Z

Having read more of the code, it looks like all of the workqueue metrics were "deprecated" and replaced with new metrics, however only a few of them have material changes:

Latency changed from a Summary in microseconds to a Histogram in seconds.
WorkDuration changed from a Summary in microseconds to a Histogram in seconds.
LongestRunningProcessor changed from microseconds to seconds.

AFAICT, the other metrics didn't change even though they were "deprecated".

The reason for deprecating metrics and not immediately replacing them is generally to provide continuity for dashboards, alerting, etc. The changed metrics provided in controller-runtime don't provide this since the "deprecated" metrics are given new names (which are invalid).

My proposal:

Move the old metrics for Latency, WorkDuration, and LongestRunningProcessor to the Deprecated methods of the interface without changing their names, labels, etc.
Add new metrics for Latency, WorkDuration, and LongestRunningProcessor which follow prometheus naming conventions and include the seconds unit in their names and have the queue name as a label.
Keep the old metrics for everything else and add no-op implementations for their Deprecated methods of the interface. However, I haven't verified the naming conventions on all of these metrics. If there are issues, now would be the time to rename them similarly to 1 and 2.

I can put together a PR for this.

abursavich · 2019-05-19T17:47:59Z

Well... It looks like the initial workqueue metrics in controller-runtime used kubernetes/client-go v1.13, but was half implemented in terms of v1.14:

It used Histograms instead of Summaries for Latency and WorkDuration and used seconds in the names, even though the values were still being recorded in microseconds.
LongestRunningProcessor can be deprecated successfully as it was named with microseconds, but Latency and WorkDuration will have to keep the same name and change from microseconds to seconds.
All other deprecated metrics can be noops without losing anything.

DirectXMan12 · 2019-06-03T23:35:59Z

I fixed the name bits in #458 so that we don't have dashes in the name. As for the other stuff, yeah, a PR would be awesome (will take a look at the linked PR -- going through a large notification backlog post kubecon)

cleanup kubebulder-gen

abursavich mentioned this issue May 19, 2019

🐛 remove deprecated workqueue metrics with invalid names #437

Merged

k8s-ci-robot closed this as completed in #437 Jun 7, 2019

DirectXMan12 pushed a commit that referenced this issue Jan 31, 2020

Merge pull request #436 from mengqiy/cleanup

d1fc6f7

cleanup kubebulder-gen

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queue metrics broken when using the controller builder #436

Queue metrics broken when using the controller builder #436

mrIncompetent commented May 18, 2019

mrIncompetent commented May 18, 2019

abursavich commented May 19, 2019 •

edited

Loading

adohe-zz commented May 19, 2019 •

edited

Loading

abursavich commented May 19, 2019

abursavich commented May 19, 2019

DirectXMan12 commented Jun 3, 2019

Queue metrics broken when using the controller builder #436

Queue metrics broken when using the controller builder #436

Comments

mrIncompetent commented May 18, 2019

mrIncompetent commented May 18, 2019

abursavich commented May 19, 2019 • edited Loading

adohe-zz commented May 19, 2019 • edited Loading

abursavich commented May 19, 2019

abursavich commented May 19, 2019

DirectXMan12 commented Jun 3, 2019

abursavich commented May 19, 2019 •

edited

Loading

adohe-zz commented May 19, 2019 •

edited

Loading