Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queue metrics broken when using the controller builder #436

Closed
mrIncompetent opened this issue May 18, 2019 · 6 comments · Fixed by #437
Closed

Queue metrics broken when using the controller builder #436

mrIncompetent opened this issue May 18, 2019 · 6 comments · Fixed by #437

Comments

@mrIncompetent
Copy link

When starting a controller the following messages get logged:

E0518 20:13:50.847341   12082 client_go_adapter.go:318] descriptor Desc{fqName: "node-application_depth", help: "Current depth of workqueue: node-application", constLabels: {}, variableLabels: []} is invalid: "node-application_depth" is not a valid metric name
E0518 20:13:50.847688   12082 client_go_adapter.go:328] descriptor Desc{fqName: "node-application_adds", help: "Total number of adds handled by workqueue: node-application", constLabels: {}, variableLabels: []} is invalid: "node-application_adds" is not a valid metric name
E0518 20:13:50.848958   12082 client_go_adapter.go:339] descriptor Desc{fqName: "node-application_queue_latency", help: "How long an item stays in workqueuenode-application before being requested.", constLabels: {}, variableLabels: []} is invalid: "node-application_queue_latency" is not a valid metric name
E0518 20:13:50.849992   12082 client_go_adapter.go:350] descriptor Desc{fqName: "node-application_work_duration", help: "How long processing an item from workqueuenode-application takes.", constLabels: {}, variableLabels: []} is invalid: "node-application_work_duration" is not a valid metric name
E0518 20:13:50.851079   12082 client_go_adapter.go:363] descriptor Desc{fqName: "node-application_unfinished_work_seconds", help: "How many seconds of work node-application has done that is in progress and hasn't been observed by work_duration. Large values indicate stuck threads. One can deduce the number of stuck threads by observing the rate at which this increases.", constLabels: {}, variableLabels: []} is invalid: "node-application_unfinished_work_seconds" is not a valid metric name
E0518 20:13:50.852133   12082 client_go_adapter.go:374] descriptor Desc{fqName: "node-application_longest_running_processor_microseconds", help: "How many microseconds has the longest running processor for node-application been running.", constLabels: {}, variableLabels: []} is invalid: "node-application_longest_running_processor_microseconds" is not a valid metric name
E0518 20:13:50.853298   12082 client_go_adapter.go:384] descriptor Desc{fqName: "node-application_retries", help: "Total number of retries handled by workqueue: node-application", constLabels: {}, variableLabels: []} is invalid: "node-application_retries" is not a valid metric name

Used controller-runtime version:

[[constraint]]
  name = "sigs.k8s.io/controller-runtime"
  version = "v0.2.0-beta.1"

Code:

	mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{})
	if err != nil {
		log.Fatal("Unable to start manager", zap.Error(err))
	}

	err = ctrl.NewControllerManagedBy(mgr).
		For(&corev1.Node{}).
		Owns(&corev1.ConfigMap{}).
		Complete(&controller.Reconciler{
			Client: mgr.GetClient(),
			Log:    log.Named("some_controller"),
		})
	if err != nil {
		log.Fatal("Unable to add some_controller", zap.Error(err))
	}

	log.Info("starting manager")
	if err := mgr.Start(ctrl.SetupSignalHandler()); err != nil {
		log.Fatal("problem running manager", zap.Error(err))
	}

The problematic code: https://github.com/kubernetes-sigs/controller-runtime/blob/master/pkg/builder/build.go#L234

func (blder *Builder) getControllerName() (string, error) {
	gvk, err := getGvk(blder.apiType, blder.mgr.GetScheme())
	if err != nil {
		return "", err
	}
	name := fmt.Sprintf("%s-application", strings.ToLower(gvk.Kind))
	return name, nil
}

This leads to invalid metric names like node-application_depth.

Potential solutions IMHO:

  • Remove the -application suffix
  • Add a Named(name string) function to the builder so controllers can have more meaningful names

WDYT?

@mrIncompetent
Copy link
Author

cc @DirectXMan12

@abursavich
Copy link
Contributor

abursavich commented May 19, 2019

I ran into this issue too. For added clarity, the immediate problem is the dash.

This is the regex for valid metric names: ^[a-zA-Z_:][a-zA-Z0-9_:]*$

@adohe-zz
Copy link

adohe-zz commented May 19, 2019

The controller name is used in tracing, logging, eventing and monitoring. According to what @abursavich mentioned, using controller name as metric names(monitoring) should match prometheus metric name regex, but in other cases, no such restriction. Just as what KCM does, the queue metrics could just api kind as name, and the -application suffix could used in other scenario.

@abursavich
Copy link
Contributor

Having read more of the code, it looks like all of the workqueue metrics were "deprecated" and replaced with new metrics, however only a few of them have material changes:

  • Latency changed from a Summary in microseconds to a Histogram in seconds.
  • WorkDuration changed from a Summary in microseconds to a Histogram in seconds.
  • LongestRunningProcessor changed from microseconds to seconds.

AFAICT, the other metrics didn't change even though they were "deprecated".

The reason for deprecating metrics and not immediately replacing them is generally to provide continuity for dashboards, alerting, etc. The changed metrics provided in controller-runtime don't provide this since the "deprecated" metrics are given new names (which are invalid).

My proposal:

  1. Move the old metrics for Latency, WorkDuration, and LongestRunningProcessor to the Deprecated methods of the interface without changing their names, labels, etc.
  2. Add new metrics for Latency, WorkDuration, and LongestRunningProcessor which follow prometheus naming conventions and include the seconds unit in their names and have the queue name as a label.
  3. Keep the old metrics for everything else and add no-op implementations for their Deprecated methods of the interface. However, I haven't verified the naming conventions on all of these metrics. If there are issues, now would be the time to rename them similarly to 1 and 2.

I can put together a PR for this.

@abursavich
Copy link
Contributor

Well... It looks like the initial workqueue metrics in controller-runtime used kubernetes/client-go v1.13, but was half implemented in terms of v1.14:

  • It used Histograms instead of Summaries for Latency and WorkDuration and used seconds in the names, even though the values were still being recorded in microseconds.
  • LongestRunningProcessor can be deprecated successfully as it was named with microseconds, but Latency and WorkDuration will have to keep the same name and change from microseconds to seconds.
  • All other deprecated metrics can be noops without losing anything.

@DirectXMan12
Copy link
Contributor

I fixed the name bits in #458 so that we don't have dashes in the name. As for the other stuff, yeah, a PR would be awesome (will take a look at the linked PR -- going through a large notification backlog post kubecon)

DirectXMan12 pushed a commit that referenced this issue Jan 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants