-
Notifications
You must be signed in to change notification settings - Fork 104
Conversation
- the lastUpdate timestamp of items in the index are now updated everytime a new point is seen. So we dont need to add an offset to the lastSeen param passed to metricInde.Find().
see also #119 |
I am unsure if this is the right approach. The primary reason for filtering the search results is so that we dont try and hit cassandra for series we know have no data. So perhaps a better approach is to just add an additional lastUpdate field to api.Models.Req. We can then remove the lastSeen offset and always just return all series in the index. Instead relying on max-stale settings in cassandraIdx to clean out metrics that are no longer being updated. However, in dynamic environments like k8s, it would be preferable to use this change and update Grafana to pass the time range for metrics/find calls (in the query editor and in template vars). This would allow us to increase the max-stale setting to a few months so that we will still be able to see metrics from deleted containers if we request a time range when they were running. |
One thing to keep in mind is that the Ceres backend for Graphite (intended to replace whisper) only returns series that have data for the requested range. Based on this and the fact that i like the idea of being able to keep old series in the index so old time ranges can be rendered, I think we should merge this PR. |
also see grafana/grafana#8055 |
I think the argument that the metrics about containers which have been stopped some time ago should still be viewable is a good one. |
as an end user there can be use cases where it's useful, even necessary to see which series are returned, even when they're all null.
assuming you mean lastUpdate is set by metrictank (not by the user), and based on the lastUpdate it got while querying the index when creating all the requests inside of the renderMetrics handler, then yes this seems like a good idea. or it could simply be a bool like So I think by default we shouldn't do any |
No. Graphite already supports the paramaters. It doesnt use them with whisper, but does with Ceres and will also pass them through to metrictank. |
@Dieterbe can you give an example.
Is that not exactly what the removeEmptySeries() function does?
but if you do that, the query editor in Grafana and template vars will show every metric that ever existed (unless index pruning is enabled) regardless of whether there is any data for it. Given the trend to move to more dynamic infrastructure that is a big issue imho. |
oh. huh. seems so.
I ran into this at vimeo a couple times. I forgot what the exact use cases were. But I think it was probably mostly just sanity checks. - e.g. which servers have reported data, i.e. have monitoring setup correctly - and when you want to plot a fraction where the denominator is like number of servers - e.g. divding statsd counts (which can be legitimately null, which then means 0) by number of series that exist.
I'm talking about render requests. you're talking about find requests. For find requests, I think this is useful, see my comment in grafana/grafana#8055 |
But they should have the same behavior. A render request performs a find request, then renders all of the returned series. It would be weird if the query editor said there were 10metrics, but your query only returned 5. That is already a problem right now (for metrictank), and grafana/grafana#8055 will resolve it. what you are suggesting would be even weirder, with the template vars or query editor showing 5 metrics, but the query returning 10.
but if they havent sent any data within the query time range, then they are not reporting data.
this is a valid usecase. But i believe that statsd should report 0 for a 0 count, not null. Null means no information, which is very different to the knowledge of a number being 0. |
ok, i am going to merge this.
While this is a legit concern, the current behavior where we have to prune metrics from the index after 2-3days in order to clear out metrics that are no longer sending data means that these style of queries will potentially be wrong anyway if the query time range is greater then the 2-3days max-stale setting. And it wont just be the series counts that are wrong, all of the counts from series that had data but stopped sending 3days ago will be lost. |
I generally agree but this is a fundamental problem with statsd for which there is no good solution. If it would keep sending zeroes but then your service goes offline, it'll just keep sending zeroes and it'll look like the service has no work when actually it is down. With stateful services there are kludgy workarounds, e.g. report the current unix time in a different series, but I haven't seen anyone do this in practice because it complicates alerting. (of course if you have a stateful service you're better off sending aggregated metrics directly to graphite) And with stateless servers (e.g. php on apache) you can only send metrics when the page is being requested (unless you set up additional monitoring). Statsd has options like deleteCounters, in practice some people have true others have it false, both result in troubles. We may not like it, but we have to take these things into account.
there are uses cases where occasionally not sending data is totally legitimate or even irrelevant to information needs such as the one I described which is you want to know which servers have reported data in the past (crontabs and batch jobs are also an example. you want to know if they've been able to run and send metrics in the past, whether or not there's data for the current select time range)
(emphasis above mine)
to me this sounds like an index max-stale that is set way too aggressively where it becomes very harmful.
good point, that would be weird. I think you missed a spot:
|
the lastUpdate timestamp of items in the index are now updated everytime a new point is seen. So we dont need to add an offset to the lastSeen param passed to metricInde.Find().