-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panic: duplicate metrics collector registration attempted #1449
Comments
Another note. After fixing TLS path in my case. Metrics generate wasn't able to send metrics properly. I have to delete wal files on all my pods to recover it. Maybe calling |
Nice find. I agree with your thoughts about exiting the process on failure to load the TLS file. We will also look into the issue with WAL corruption. |
Okay, I think we are not cleaning up resources properly after running into an error. The remote write config is loaded after the WAL has been created: tempo/modules/generator/storage/instance.go Lines 66 to 69 in f883e84
If this errors, we just exit the function but do not destroy the newly created WAL. When a second batch is pushed we try to create the WAL again and this causes the duplicate registration. I'll check if we can validate the remote write config in advance. If not, we should just make sure we clean up new resources before returning the error. About the corrupted WAL: creating the WAL multiple times should be fine. The next invocations will reuse the same directory. I'm thinking the panic might have disrupted some async process, leaving the WAL in an invalid state. |
So this is an issue upstream in Prometheus: if the first attempt to create the remote write structure fails, not all resources are cleaned up correctly and the second attempt will panic when registering metrics. We can't fix this from Tempo code. |
@kvrhdn in the event that we can't create the WAL or whatever remote write structure is having issues, should we just exit cleanly instead of trying again? Could this impact us during normal operations? or only on startup? |
We create these structures when we first receive data for that tenant. Since we don't know in advance which tenants are active, we can't create them at startup. We could deliberately exit the process when creating the remote write structures fails, but I don't know if we can do that in a clean way. Btw, an error in the remote write config will most likely impact every tenant, so in practice the first tenant that sends data will trigger this error. |
I have configured the metrics generator to write to Prometheus remote-write storage. However, no metrics are received to Prometheus, although some data is written to the Prometheus wal directory. As in the panic log above, I am seeing a similar message as below, when starting the metric generator. Could this issue that outputs the below message, cause the metric generator not to send the metrics to Prometheus? "level=info ts=2022-05-25T19:24:59.700977765Z caller=basic_lifecycler.go:260 msg="instance not found in the ring" instance=ivapp14... ring=metrics-generator |
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. |
Describe the bug
When I configure metrics generator on Grafana Tempo, I had a misconfigured TLS path.
But metrics generator pod boots up and crashed immediately.
In the full crash log, it show
creating WAL
multiple times, which is leading to this crash. I think it will be helpful if we can exit the program on any misconfigured remote write endpoints.Feel free to suggest other error handling approach.
To Reproduce
Steps to reproduce the behavior:
panic: duplicate metrics collector registration attempted
Expected behavior
Environment:
Additional Context
Sample configuration
Full Panic log
The text was updated successfully, but these errors were encountered: