Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

remove seenAfter offset #595

Merged
merged 1 commit into from
Apr 7, 2017
Merged

remove seenAfter offset #595

merged 1 commit into from
Apr 7, 2017

Conversation

woodsaj
Copy link
Member

@woodsaj woodsaj commented Apr 6, 2017

the lastUpdate timestamp of items in the index are now updated everytime a new point is seen. So we dont need to add an offset to the lastSeen param passed to metricInde.Find().

- the lastUpdate timestamp of items in the index are now updated
  everytime a new point is seen. So we dont need to add an offset
  to the lastSeen param passed to metricInde.Find().
@woodsaj
Copy link
Member Author

woodsaj commented Apr 6, 2017

see also #119

@woodsaj
Copy link
Member Author

woodsaj commented Apr 6, 2017

I am unsure if this is the right approach.

The primary reason for filtering the search results is so that we dont try and hit cassandra for series we know have no data.

So perhaps a better approach is to just add an additional lastUpdate field to api.Models.Req.
If lastUpdate < from, then we dont need to query for data.

We can then remove the lastSeen offset and always just return all series in the index. Instead relying on max-stale settings in cassandraIdx to clean out metrics that are no longer being updated.

However, in dynamic environments like k8s, it would be preferable to use this change and update Grafana to pass the time range for metrics/find calls (in the query editor and in template vars). This would allow us to increase the max-stale setting to a few months so that we will still be able to see metrics from deleted containers if we request a time range when they were running.

@woodsaj
Copy link
Member Author

woodsaj commented Apr 6, 2017

One thing to keep in mind is that the Ceres backend for Graphite (intended to replace whisper) only returns series that have data for the requested range.

https://github.com/graphite-project/graphite-web/blob/master/webapp/graphite/finders/ceres.py#L31

Based on this and the fact that i like the idea of being able to keep old series in the index so old time ranges can be rendered, I think we should merge this PR.

@woodsaj
Copy link
Member Author

woodsaj commented Apr 6, 2017

also see grafana/grafana#8055

@replay
Copy link
Contributor

replay commented Apr 6, 2017

I think the argument that the metrics about containers which have been stopped some time ago should still be viewable is a good one.
If we modify Grafana to pass in the time range, do we also need to modify graphite-api for that? Or does that require that by then we already don't have graphite-api between Grafana and Metrictank anymore?

@Dieterbe
Copy link
Contributor

Dieterbe commented Apr 6, 2017

as an end user there can be use cases where it's useful, even necessary to see which series are returned, even when they're all null.
OTOH there are cases where this behavior is annoying and you don't want to see any nulls. Though grafana lets you hide them, graphite has no option to not include them. I always envisioned we could add a parameter or function to the graphite-api to exclude null series.

So perhaps a better approach is to just add an additional lastUpdate field to api.Models.Req.
If lastUpdate < from, then we dont need to query for data.

assuming you mean lastUpdate is set by metrictank (not by the user), and based on the lastUpdate it got while querying the index when creating all the requests inside of the renderMetrics handler, then yes this seems like a good idea. or it could simply be a bool like metaOnly (meaning only return the series name with all nulls, not any actual data). or we could even (ab)use one of the existing properties (like set archive to -1). adding a timestamp field to the Req type just seems a bit excessive for a simple boolean choice we can determine when creating the request.

So I think by default we shouldn't do any seenAfter-based filtering from the index, though I want to add a function or parameter to enable it (yet another reason for a MT datasource in grafana)

@woodsaj
Copy link
Member Author

woodsaj commented Apr 6, 2017

If we modify Grafana to pass in the time range, do we also need to modify graphite-api for that?

No. Graphite already supports the paramaters. It doesnt use them with whisper, but does with Ceres and will also pass them through to metrictank.

@woodsaj
Copy link
Member Author

woodsaj commented Apr 6, 2017

as an end user there can be use cases where it's useful, even necessary to see which series are returned, even when they're all null.

@Dieterbe can you give an example.

graphite has no option to not include them

Is that not exactly what the removeEmptySeries() function does?

So I think by default we shouldn't do any seenAfter-based filtering from the index, though I want to add a function or parameter to enable it (yet another reason for a MT datasource in grafana)

but if you do that, the query editor in Grafana and template vars will show every metric that ever existed (unless index pruning is enabled) regardless of whether there is any data for it. Given the trend to move to more dynamic infrastructure that is a big issue imho.

@Dieterbe
Copy link
Contributor

Dieterbe commented Apr 6, 2017

Is that not exactly what the removeEmptySeries() function does?

oh. huh. seems so.

can you give an example.

I ran into this at vimeo a couple times. I forgot what the exact use cases were. But I think it was probably mostly just sanity checks. - e.g. which servers have reported data, i.e. have monitoring setup correctly - and when you want to plot a fraction where the denominator is like number of servers - e.g. divding statsd counts (which can be legitimately null, which then means 0) by number of series that exist.

but if you do that, the query editor in Grafana and template vars will show every metric that ever existed (unless index pruning is enabled) regardless of whether there is any data for it. Given the trend to move to more dynamic infrastructure that is a big issue imho.

I'm talking about render requests. you're talking about find requests. For find requests, I think this is useful, see my comment in grafana/grafana#8055

@woodsaj
Copy link
Member Author

woodsaj commented Apr 6, 2017

I'm talking about render requests. you're talking about find requests

But they should have the same behavior. A render request performs a find request, then renders all of the returned series. It would be weird if the query editor said there were 10metrics, but your query only returned 5. That is already a problem right now (for metrictank), and grafana/grafana#8055 will resolve it.

what you are suggesting would be even weirder, with the template vars or query editor showing 5 metrics, but the query returning 10.

mostly just sanity checks. - e.g. which servers have reported data, i.e. have monitoring setup correctly

but if they havent sent any data within the query time range, then they are not reporting data.

when you want to plot a fraction where the denominator is like number of servers - e.g. divding statsd counts (which can be legitimately null, which then means 0) by number of series that exist.

this is a valid usecase. But i believe that statsd should report 0 for a 0 count, not null. Null means no information, which is very different to the knowledge of a number being 0.

@woodsaj
Copy link
Member Author

woodsaj commented Apr 7, 2017

ok, i am going to merge this.

when you want to plot a fraction where the denominator is like number of servers

While this is a legit concern, the current behavior where we have to prune metrics from the index after 2-3days in order to clear out metrics that are no longer sending data means that these style of queries will potentially be wrong anyway if the query time range is greater then the 2-3days max-stale setting. And it wont just be the series counts that are wrong, all of the counts from series that had data but stopped sending 3days ago will be lost.

@woodsaj woodsaj merged commit 02eadce into master Apr 7, 2017
@woodsaj woodsaj deleted the removeIndexFindOffset branch April 7, 2017 06:53
@Dieterbe
Copy link
Contributor

Dieterbe commented Apr 10, 2017

But i believe that statsd should report 0 for a 0 count, not null. Null means no information, which is very different to the knowledge of a number being 0.

I generally agree but this is a fundamental problem with statsd for which there is no good solution. If it would keep sending zeroes but then your service goes offline, it'll just keep sending zeroes and it'll look like the service has no work when actually it is down. With stateful services there are kludgy workarounds, e.g. report the current unix time in a different series, but I haven't seen anyone do this in practice because it complicates alerting. (of course if you have a stateful service you're better off sending aggregated metrics directly to graphite) And with stateless servers (e.g. php on apache) you can only send metrics when the page is being requested (unless you set up additional monitoring). Statsd has options like deleteCounters, in practice some people have true others have it false, both result in troubles. We may not like it, but we have to take these things into account.

if they havent sent any data within the query time range, then they are not reporting data.

there are uses cases where occasionally not sending data is totally legitimate or even irrelevant to information needs such as the one I described which is you want to know which servers have reported data in the past (crontabs and batch jobs are also an example. you want to know if they've been able to run and send metrics in the past, whether or not there's data for the current select time range)

the current behavior where we have to prune metrics from the index after 2-3days in order to clear out metrics that are no longer sending data means that these style of queries will potentially be wrong anyway if the query time range is greater then the 2-3days max-stale setting.

(emphasis above mine)
do we really "have to"? I always viewed the index max-stale setting as a measure that helps with controlling index size and hence latency of index operations, but should always be set up large enough so that it doesn't interfere with a user's needs. While until now it is hard to balance this against the downside of stale metrics showing up when you don't want them (especially when there's lots of churn in dynamic environments), now that we're going to implement the filtering on index requests this becomes perfectly achievable: metrics you don't care about won't show up in the query builder, nor in the render output, so you can set up max-stale so that it is very conservative (and keeps the metricdefs for a long time). there's optimizations we can do to the index to make this doable on larger scale as well.

And it wont just be the series counts that are wrong, all of the counts from series that had data but stopped sending 3days ago will be lost.

to me this sounds like an index max-stale that is set way too aggressively where it becomes very harmful.

what you are suggesting would be even weirder, with the template vars or query editor showing 5 metrics, but the query returning 10.

good point, that would be weird.
But do you agree that the time filtering should be optional and be controllable from the query editor?
for the reasons I explained above but also on grafana/grafana#8055

I think you missed a spot:

func (s *Server) indexFind(ctx *middleware.Context, req models.IndexFind) {
	// metricDefs only get updated periodically (when using CassandraIdx), so we add a 1day (86400seconds) buffer when
	// filtering by our From timestamp.  This should be moved to a configuration option
	if req.From != 0 {
		req.From -= 86400
	}

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants