Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric workflow_task_execution_failed is used with different sets of labels causing errors with Prometheus #1450

Closed
wsny opened this issue Apr 25, 2024 · 4 comments · Fixed by #1658

Comments

@wsny
Copy link

wsny commented Apr 25, 2024

Expected Behavior

Should not receive errors about metrics being registered incorrectly.

Actual Behavior

Receiving a Prometheus error about the workflow_task_execution_failed metric being registered with different labels.

a previously registered descriptor with the same fully-qualified name as Desc{fqName: "temporal_workflow_task_execution_failed_total", help: "temporal_workflow_task_execution_failed_total counter", constLabels: {}, variableLabels: {app,client_name,namespace,environment,workflow_type,failure_reason,task_queue,worker_type,deployment,activity_type,env}} has different label names or a different help string

Steps to Reproduce the Problem

I am still working on STR's (don't know how to break all the below with test code yet), but I can see that this metric is being used with different tags in different places.

#1295 added a tag here

metricsHandler.WithTags(metrics.WorkflowTaskFailedTags(failureReason)).Counter(metrics.WorkflowTaskExecutionFailureCounter).Inc(1)

But it is not being added in these uses of the metric.

weh.metricsHandler.Counter(metrics.WorkflowTaskExecutionFailureCounter).Inc(1)

weh.metricsHandler.Counter(metrics.WorkflowTaskExecutionFailureCounter).Inc(1)

And that will cause issues with Prometheus, as shown in this example.

package main

import (
	"fmt"
	"time"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/uber-go/tally/v4"
	tallyprom "github.com/uber-go/tally/v4/prometheus"

	"go.temporal.io/sdk/client"
	sdktally "go.temporal.io/sdk/contrib/tally"
)

func main() {
	metricsHandler := setupPrometheus()
	// These lines in any order will cause an error where the metrics have different labels
	metricsHandler.Counter("testing").Inc(1)
	metricsHandler.WithTags(map[string]string{"someName": "someValue"}).Counter("testing").Inc(1)
}

// Just setup stuff for Prometheus below here
var PrometheusSanitizeOptions = tally.SanitizeOptions{
	NameCharacters:       tally.ValidCharacters{Ranges: tally.AlphanumericRange, Characters: []rune{'_'}},
	KeyCharacters:        tally.ValidCharacters{Ranges: tally.AlphanumericRange, Characters: []rune{'_'}},
	ValueCharacters:      tally.ValidCharacters{Ranges: tally.AlphanumericRange, Characters: []rune{'_'}},
	ReplacementCharacter: tally.DefaultReplacementCharacter,
}

func setupPrometheus() client.MetricsHandler {
	cfg := tallyprom.Configuration{
		ListenAddress: "localhost:9090",
	}
	reporter, _ := cfg.NewReporter(tallyprom.ConfigurationOptions{
		Registry: prometheus.NewRegistry(),
		OnError: func(err error) {
			fmt.Printf("BOOM: %+v\n", err)
		},
	})
	scopeOpts := tally.ScopeOptions{CachedReporter: reporter, SanitizeOptions: &PrometheusSanitizeOptions}
	scope, _ := tally.NewRootScope(scopeOpts, time.Second)
	scope = sdktally.NewPrometheusNamingScope(scope)
	return sdktally.NewMetricsHandler(scope)
}

Which causes this error

a previously registered descriptor with the same fully-qualified name as Desc{fqName: "testing_total", help: "testing_total counter", constLabels: {}, variableLabels: {someName}} has different label names or a different help string

Specifications

  • Version: sdk-go@v1.26.1
@cretz
Copy link
Member

cretz commented Apr 25, 2024

👍 We should ideally only be recording this metric in one place and it should always contain the same tag/label set. Thanks for reporting! This is on our backlog and we will see about fixing.

@AmirSoleimani
Copy link

I noticed that the issue has been closed and addressed. I've been facing the same issue and the merged changes addressed it (I had to pull the master branch to test). However, I haven't seen a new release with the fix yet. Can you provide any updates on when the fix might be included in a release? Thank you for your help! - cc @yuandrew

@yuandrew
Copy link
Contributor

Hey Amir, sorry for the delay, there are a few more changes we are hoping to get into the next release, I am hoping we can cut a new release in the next 2 weeks!

@yuandrew
Copy link
Contributor

v1.30.0 has just been released

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants