-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometneus_client error in telegraf 0.2.3 #405
Comments
This actually appears to be a different problem. The previous panic was because of a NaN value trying to get sent to InfluxDB. This panic is coming from the prometheus client code here: https://github.com/prometheus/client_golang/blob/master/prometheus/vec.go#L143 It is getting an error on this function call: https://godoc.org/github.com/prometheus/client_golang/prometheus#MetricVec.GetMetricWith, which states:
This doesn't really mean anything to me as I'm not familiar with Prometheus, does it mean anything to you? Do you know why this function would be returning an error? |
@oldmantaiter @brian-brazil Any ideas on the above panic and error? |
This means that some metric has say two labels sometimes and three at other times, which is not allowed (it's an anti-pattern). I'd guess that there's a clash on some metric name. |
I ran into this while writing/testing with the aerospike plugin (which is why there was a "namespace" called "_service"). @logicall could you post a copy of your config (removing any sensitive information)? |
@oldmantaiter Could you explain the TBH, panic'ing seems like a very extreme thing for the prometheus client library to do in this situation. Most golang libraries try to avoid panic'ing at all costs and prefer to return an error. |
@oldmantaiter
|
I think I at least have a solution for the panic, we can use There will still be an error and the metric will have to be dropped, but at least it won't crash the entire telegraf process. |
Deals with part of #405, although it doesn't deal with the underlying label error that is occuring.
Deals with part of #405, although it doesn't deal with the underlying label error that is occuring.
Any steps to reproduce? I haven't been able to on darwin or linux (ubuntu) |
@sparrc There are measurements that are for both the global service level and the namespace level statistics for an aerospike server. For example, the |
Hello!
|
That's exactly what the Prometheus client is trying to avoid. Dealing with multiple levels of aggregation mixed into a single metric is quite tricky (e.g. a naive sum will give twice the real answer) and thus we strongly discourage this in Prometheus. |
@logicall can you run in debug mode with only the io plugin enabled and attach all of the output? I'm still not sure I understand how you are getting measurements with different labels. |
@logicall nevermind I think I understand the problem now. InfluxDB doesn't care if two measurements have the same name, it can separate measurements based on different tags, ie, these measurements are OK in InfluxDB:
But from what I am understanding, this is not OK for prometheus? @brian-brazil @oldmantaiter is that correct? |
@brian-brazil InfluxDB is designed to handle filtering on multiple tags and fields within the same metric. Currently it is limited to 255 fields within a single metric but with the new storage engine this limit will be removed. In fact Telegraf is soon going to move to a model where we put most plugin metrics into a single measurement. |
These measurements are also fine with Prometheus, as it has a very similar data model. What's technically okay but strongly discouraged (and the Prometheus client libraries will try and stop you from doing as you've discovered) is:
This is hard to deal with for users, as a
In Prometheus we'd generally advise only adding a aggregation like this on the client side for performance reasons, and otherwise just exposing the more granular version of the data. I presume the same would apply to InfluxDB and suggest that ye should copy this best practice. |
@brian-brazil I see what you mean, looks like the |
@logicall Can you run |
|
@logicall Seems like that should be working, could you provide the full debug output leading up to the error messages? |
|
@logicall actually I need to see all of the output please! (the part you replaced with |
|
thank you @logicall! I believe you can workaround this temporarily by setting
I'll have a fix up for it soon |
@sparrc |
No problem, thank you for following up and providing debug output, it helps a lot! 👍 |
@sparrc My apologies for not checking all possibilities from plugins when adding prometheus_client. I should have done some more in depth testing to prevent extra work. Thanks for fixing it up! |
@oldmantaiter no problem 😄, it's actually a bit of a corner case where there is some sort of "disk" showing up in /proc/diskstats that doesn't have a readable serial number |
Hello!
I've updated to version 0.2.3 Telegraph, but the problem has remained.
Telegraph falls as before
The text was updated successfully, but these errors were encountered: