panic: duplicate metrics collector registration attempted #1449

automatedops · 2022-05-25T21:09:54Z

Describe the bug
When I configure metrics generator on Grafana Tempo, I had a misconfigured TLS path.

But metrics generator pod boots up and crashed immediately.

In the full crash log, it show creating WAL multiple times, which is leading to this crash. I think it will be helpful if we can exit the program on any misconfigured remote write endpoints.

Feel free to suggest other error handling approach.

To Reproduce
Steps to reproduce the behavior:

Start Tempo (1.4.1)
Perform Operations (Read/Write/Others)

Deploy a metrics generator
Set remote write TLS path set to a non-existing location
Process/Pod will panic with panic: duplicate metrics collector registration attempted

Expected behavior

Process or pod should exit if TLS cert file does not exist

Environment:

Infrastructure: [Kubernetes]
Deployment tool: [helm with Tempo chart 0.17.4]
Metrics generator is deployed via yaml

Additional Context
Sample configuration

  metrics_generator:
        ring:
          kvstore:
            store: memberlist
        processor:
          service_graphs:
            max_items: 20000
            dimensions: ['cluster', 'namespace', 'service.name']
          span_metrics:
            # Default buckets: [0.002, 0.004, 0.008, 0.016, 0.032, 0.064, 0.128, 0.256, 0.512, 1.02, 2.05, 4.10]
            histogram_buckets: [0.004, 0.008, 0.016, 0.032, 0.064, 0.128, 0.256, 0.512, 1.02, 2.05, 4.10, 8.19, 16.38, 32.77]
            dimensions: ['cluster', 'namespace', 'service.name']
        storage:
          path: /var/tempo/wal
          remote_write:
          # Dev cluster remote write to dev Thanos receive
          - url: http://thanos-receive-route.thanos.svc.cluster.local:19291/api/v1/receive
            send_exemplars: true
            write_relabel_configs:
            - source_labels: [cluster]
              target_label: exported_cluster
              action: replace
          
          # Dev cluster remote write to non-prod Thanos receive
          - url: https://thanos-receive-nonprod.thanos.svc.cluster.local:19291/api/v1/receive
            send_exemplars: true
            tls_config:
              insecure_skip_verify: false
              # This file (/etc/tls/ca-bundle.crt) does not exist
              ca_file: /etc/tls/ca-bundle.crt
            write_relabel_configs:
            - source_labels: [is_prod]
              regex: true
              action: drop
            - source_labels: [cluster]
              target_label: exported_cluster
              action: replace

Full Panic log

level=info ts=2022-05-25T19:24:59.697477161Z caller=main.go:191 msg="initialising OpenTracing tracer"
level=info ts=2022-05-25T19:24:59.698326455Z caller=main.go:106 msg="Starting Tempo" version="(version=, branch=HEAD, revision=d3880a979)"
level=info ts=2022-05-25T19:24:59.698541401Z caller=server.go:260 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2022-05-25T19:24:59.699267041Z caller=memberlist_client.go:394 msg="Using memberlist cluster node name" name=tempo-dev-tempo-distributed-metrics-generator-1-7cdc81ff
level=info ts=2022-05-25T19:24:59.699310091Z caller=module_service.go:64 msg=initialising module=server
level=info ts=2022-05-25T19:24:59.699378051Z caller=module_service.go:64 msg=initialising module=memberlist-kv
level=info ts=2022-05-25T19:24:59.699391595Z caller=module_service.go:64 msg=initialising module=overrides
level=info ts=2022-05-25T19:24:59.699992191Z caller=module_service.go:64 msg=initialising module=metrics-generator
level=info ts=2022-05-25T19:24:59.700977765Z caller=basic_lifecycler.go:260 msg="instance not found in the ring" instance=tempo-dev-tempo-distributed-metrics-generator-1 ring=metrics-generator
level=info ts=2022-05-25T19:24:59.701209972Z caller=app.go:284 msg="Tempo started"
level=info ts=2022-05-25T19:24:59.722763569Z caller=memberlist_client.go:513 msg="joined memberlist cluster" reached_nodes=12
level=info ts=2022-05-25T19:25:00.693203119Z caller=instance.go:46 tenant=single-tenant msg="creating WAL" dir=/var/tempo/wal/single-tenant
level=warn ts=2022-05-25T19:25:00.704466882Z caller=grpc_logging.go:38 method=/tempopb.MetricsGenerator/PushSpans duration=11.367267ms err="unable to load specified CA cert /etc/tls/ca-bundle.crt: open /etc/tls/ca-bundle.crt: no such file or directory" msg="gRPC\n"
level=info ts=2022-05-25T19:25:00.859014381Z caller=instance.go:46 tenant=single-tenant msg="creating WAL" dir=/var/tempo/wal/single-tenant
panic: duplicate metrics collector registration attempted

goroutine 270 [running]:
github.com/prometheus/client_golang/prometheus.(*wrappingRegisterer).MustRegister(0xc00108c2a0, {0xc000a15eb0, 0x1, 0x1dc9cb4})
	/drone/src/vendor/github.com/prometheus/client_golang/prometheus/wrap.go:104 +0x151
github.com/prometheus/prometheus/tsdb/wal.NewWatcherMetrics({0x2259ca8, 0xc00108c2a0})
	/drone/src/vendor/github.com/prometheus/prometheus/tsdb/wal/watcher.go:139 +0x374
github.com/prometheus/prometheus/storage/remote.NewWriteStorage({0x223be60, 0xc00043d040}, {0x2259ca8, 0xc00108c2a0}, {0xc000acfb20, 0x1c}, 0xdf8475800, {0x223af00, 0x32cf470})
	/drone/src/vendor/github.com/prometheus/prometheus/storage/remote/write.go:77 +0xa5
github.com/prometheus/prometheus/storage/remote.NewStorage({0x223a940, 0xc0006a4230}, {0x2259ca8, 0xc00108c2a0}, 0x1e6b688, {0xc000acfb20, 0x1c}, 0x0, {0x223af00, 0x32cf470})
	/drone/src/vendor/github.com/prometheus/prometheus/storage/remote/storage.go:75 +0x116
github.com/grafana/tempo/modules/generator/storage.New(0xc000b5cfa0, {0x1dce4fd, 0xd}, {0x2259c78, 0xc0000bca50}, {0x223a940, 0xc0002f7360})
	/drone/src/modules/generator/storage/instance.go:60 +0x4b6
github.com/grafana/tempo/modules/generator.(*Generator).createInstance(0xc00063b710, {0x1dce4fd, 0xd})
	/drone/src/modules/generator/generator.go:231 +0x4f
github.com/grafana/tempo/modules/generator.(*Generator).getOrCreateInstance(0xc00063b710, {0x1dce4fd, 0xd})
	/drone/src/modules/generator/generator.go:214 +0xe5
github.com/grafana/tempo/modules/generator.(*Generator).PushSpans(0xc00063b710, {0x22776c8, 0xc00108c1e0}, 0x1a306e0)
	/drone/src/modules/generator/generator.go:190 +0x1b2
github.com/grafana/tempo/pkg/tempopb._MetricsGenerator_PushSpans_Handler.func1({0x22776c8, 0xc00108c1e0}, {0x1cf4600, 0xc00028fcc8})
	/drone/src/pkg/tempopb/tempo.pb.go:1246 +0x78
github.com/grafana/tempo/cmd/tempo/app.glob..func2({0x22776c8, 0xc00108c1b0}, {0x1cf4600, 0xc00028fcc8}, 0xc000f67678, 0xc000b993c8)
	/drone/src/cmd/tempo/app/fake_auth.go:22 +0x83
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1({0x22776c8, 0xc00108c1b0}, {0x1cf4600, 0xc00028fcc8})
	/drone/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x3a
github.com/weaveworks/common/middleware.UnaryServerInstrumentInterceptor.func1({0x22776c8, 0xc00108c1b0}, {0x1cf4600, 0xc00028fcc8}, 0xc000443ba0, 0xc000443bc0)
	/drone/src/vendor/github.com/weaveworks/common/middleware/grpc_instrumentation.go:33 +0xa2
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1({0x22776c8, 0xc00108c1b0}, {0x1cf4600, 0xc00028fcc8})
	/drone/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x3a
github.com/opentracing-contrib/go-grpc.OpenTracingServerInterceptor.func1({0x22776c8, 0xc00108c840}, {0x1cf4600, 0xc00028fcc8}, 0xc000443ba0, 0xc000443be0)
	/drone/src/vendor/github.com/opentracing-contrib/go-grpc/server.go:57 +0x40f
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1({0x22776c8, 0xc00108c840}, {0x1cf4600, 0xc00028fcc8})
	/drone/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x3a
github.com/weaveworks/common/middleware.GRPCServerLog.UnaryServerInterceptor({{0x22a3e18, 0xc000af6300}, 0x20}, {0x22776c8, 0xc00108c840}, {0x1cf4600, 0xc00028fcc8}, 0xc000443ba0, 0xc000443c00)
	/drone/src/vendor/github.com/weaveworks/common/middleware/grpc_logging.go:29 +0xbe
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1({0x22776c8, 0xc00108c840}, {0x1cf4600, 0xc00028fcc8})
	/drone/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x3a
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1({0x22776c8, 0xc00108c840}, {0x1cf4600, 0xc00028fcc8}, 0xc00098ebb8, 0x1ae64a0)
	/drone/src/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:34 +0xbf
github.com/grafana/tempo/pkg/tempopb._MetricsGenerator_PushSpans_Handler({0x1d3d640, 0xc00063b710}, {0x22776c8, 0xc00108c840}, 0xc00064c480, 0xc000b58a50)
	/drone/src/pkg/tempopb/tempo.pb.go:1248 +0x138
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000ad6a80, {0x229c4d0, 0xc00056da00}, 0xc00063dd40, 0xc000b7e180, 0x326ab20, 0x0)
	/drone/src/vendor/google.golang.org/grpc/server.go:1282 +0xccf
google.golang.org/grpc.(*Server).handleStream(0xc000ad6a80, {0x229c4d0, 0xc00056da00}, 0xc00063dd40, 0x0)
	/drone/src/vendor/google.golang.org/grpc/server.go:1616 +0xa2a
google.golang.org/grpc.(*Server).serveStreams.func1.2()
	/drone/src/vendor/google.golang.org/grpc/server.go:921 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
	/drone/src/vendor/google.golang.org/grpc/server.go:919 +0x294

The text was updated successfully, but these errors were encountered:

automatedops · 2022-05-25T21:15:04Z

Another note. After fixing TLS path in my case. Metrics generate wasn't able to send metrics properly. I have to delete wal files on all my pods to recover it. Maybe calling Creating WAL multiple times caused some kind of corruption on the wal files.

joe-elliott · 2022-05-26T13:06:41Z

Nice find. I agree with your thoughts about exiting the process on failure to load the TLS file. We will also look into the issue with WAL corruption.

yvrhdn · 2022-05-30T16:12:31Z

Okay, I think we are not cleaning up resources properly after running into an error. The remote write config is loaded after the WAL has been created:

tempo/modules/generator/storage/instance.go

Lines 66 to 69 in f883e84

    
           err = remoteStorage.ApplyConfig(remoteStorageConfig) 
        
           if err != nil { 
        
           	return nil, err 
        
           }

If this errors, we just exit the function but do not destroy the newly created WAL. When a second batch is pushed we try to create the WAL again and this causes the duplicate registration.

I'll check if we can validate the remote write config in advance. If not, we should just make sure we clean up new resources before returning the error.

About the corrupted WAL: creating the WAL multiple times should be fine. The next invocations will reuse the same directory. I'm thinking the panic might have disrupted some async process, leaving the WAL in an invalid state.

yvrhdn · 2022-05-30T16:51:31Z

So this is an issue upstream in Prometheus: if the first attempt to create the remote write structure fails, not all resources are cleaned up correctly and the second attempt will panic when registering metrics. We can't fix this from Tempo code.
I've created an issue upstream: prometheus/prometheus#10779

joe-elliott · 2022-05-31T12:32:46Z

@kvrhdn in the event that we can't create the WAL or whatever remote write structure is having issues, should we just exit cleanly instead of trying again?

Could this impact us during normal operations? or only on startup?

yvrhdn · 2022-05-31T12:59:43Z

We create these structures when we first receive data for that tenant. Since we don't know in advance which tenants are active, we can't create them at startup.
We can try to validate the config at start up, but maintaining this ourselves will be tricky...

We could deliberately exit the process when creating the remote write structures fails, but I don't know if we can do that in a clean way. Btw, an error in the remote write config will most likely impact every tenant, so in practice the first tenant that sends data will trigger this error.

nayanamana · 2022-11-09T22:40:20Z

I have configured the metrics generator to write to Prometheus remote-write storage. However, no metrics are received to Prometheus, although some data is written to the Prometheus wal directory.

As in the panic log above, I am seeing a similar message as below, when starting the metric generator. Could this issue that outputs the below message, cause the metric generator not to send the metrics to Prometheus?

"level=info ts=2022-05-25T19:24:59.700977765Z caller=basic_lifecycler.go:260 msg="instance not found in the ring" instance=ivapp14... ring=metrics-generator
level=info ts=2022-05-25T19:24:59.701209972Z caller=app.go:284 msg="Tempo started""

github-actions · 2023-01-10T00:03:14Z

This issue has been automatically marked as stale because it has not had any activity in the past 60 days.
The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity.
Please apply keepalive label to exempt this Issue.

joe-elliott added this to the v1.5 milestone May 26, 2022

yvrhdn added the component/metrics-generator label May 30, 2022

yvrhdn self-assigned this May 30, 2022

yvrhdn mentioned this issue Jun 20, 2022

Add Parquet block format #1479

Merged

3 tasks

cristiangsp added this to Tempo squad Jun 23, 2022

cristiangsp modified the milestones: v1.5, Next Aug 12, 2022

yvrhdn removed their assignment Aug 12, 2022

github-actions bot added the stale Used for stale issues / PRs label Jan 10, 2023

joe-elliott added type/bug Something isn't working keepalive Label to exempt Issues / PRs from stale workflow and removed stale Used for stale issues / PRs labels Jan 10, 2023

nikimanoledaki mentioned this issue Jun 2, 2023

encrypt-datasource-passwords: duplicate metrics collector registration attempted grafana/grafana#68213

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

panic: duplicate metrics collector registration attempted #1449

panic: duplicate metrics collector registration attempted #1449

automatedops commented May 25, 2022

automatedops commented May 25, 2022

joe-elliott commented May 26, 2022

yvrhdn commented May 30, 2022

yvrhdn commented May 30, 2022

joe-elliott commented May 31, 2022

yvrhdn commented May 31, 2022

nayanamana commented Nov 9, 2022 •

edited

Loading

github-actions bot commented Jan 10, 2023

panic: duplicate metrics collector registration attempted #1449

panic: duplicate metrics collector registration attempted #1449

Comments

automatedops commented May 25, 2022

automatedops commented May 25, 2022

joe-elliott commented May 26, 2022

yvrhdn commented May 30, 2022

yvrhdn commented May 30, 2022

joe-elliott commented May 31, 2022

yvrhdn commented May 31, 2022

nayanamana commented Nov 9, 2022 • edited Loading

github-actions bot commented Jan 10, 2023

nayanamana commented Nov 9, 2022 •

edited

Loading