-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enforce max series for metrics queries #4525
base: main
Are you sure you want to change the base?
Conversation
the way this is implemented, tempo will truncate the final results at the frontend level. We can implement it in a way that will return as soon as 1000 series is reached regardless of how many data points are in each series to exit early. Not sure which we prefer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding docs.
@@ -55,15 +61,18 @@ func NewQueryRange(req *tempopb.QueryRangeRequest, trackDiffs bool) (Combiner, e | |||
sortResponse(resp) | |||
return resp, nil | |||
}, | |||
quit: func(resp *tempopb.QueryRangeResponse) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a test for early exit from the combiner?
604930d
to
2c986ed
Compare
|
||
metrics: | ||
# Maximum number of time series returned for a metrics query. | ||
[max_response_series: <int> | default = 1000] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an interesting choice. normally we would communicate the max series through a query param from the frontend to the queriers. the negative of your approach is that we have to make sure that 2 settings are aligned or tempo may appear subtly broken. the advantage is that we don't repeatedly marshal something like series=1000
once for every subquery.
can you bring this up with the team and see if we have consensus either way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does make me wonder if we should have a shared section of the config for querying like we do for storage. that feels like overkill for one setting tho.
query_backend_after: 0 # setting these both to 0 will force all range searches to hit the backend | ||
query_ingesters_until: 0 | ||
metrics: | ||
max_response_series: 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the test should we set it on the querier as well?
sendLoop: | ||
for { | ||
select { | ||
case <-ticker.C: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious as to why you chose this loop structure. the goal seems to be loop 10 times and send data?
@@ -17,6 +17,7 @@ | |||
* [CHANGE] **BREAKING CHANGE** Enforce max attribute size at event, link, and instrumentation scope. Make config per-tenant. | |||
Renamed max_span_attr_byte to max_attribute_bytes | |||
[#4633](https://github.com/grafana/tempo/pull/4633) (@ie-pham) | |||
* [CHANGE] Enforce max series in response for metrics queries [#4525](https://github.com/grafana/tempo/pull/4525) (@ie-pham) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd mention the addition of the config param(s) to control behavior.
@@ -45,6 +46,13 @@ func NewQueryRange(req *tempopb.QueryRangeRequest) (Combiner, error) { | |||
if resp == nil { | |||
resp = &tempopb.QueryRangeResponse{} | |||
} | |||
if maxSeries > 0 && len(resp.Series) >= maxSeries { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe we need a similar check in the diff function?
if err != nil { | ||
return err | ||
} | ||
|
||
collector := pipeline.NewGRPCCollector(next, cfg.ResponseConsumers, c, func(qrr *tempopb.QueryRangeResponse) error { | ||
// Translate each diff into the instant version and send it | ||
resp := translateQueryRangeToInstant(*qrr) | ||
// series already limited by the query range combiner just need to copy the status and message | ||
resp.Status = qrr.Status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't we do this in translateQueryRangeToInstant? seems like we need similar logic in the http handler
if err != nil { | ||
return nil, err | ||
} | ||
|
||
mtx := sync.Mutex{} // combiner doesn't lock, so take lock before calling Combine to make is safe | ||
forEach := func(ctx context.Context, client tempopb.MetricsGeneratorClient) error { | ||
if c.MaxSeriesReached() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we not enforce this in the generators to prevent {} | rate() by (span:id)
or whatever from overwhelming them?
have you tested such a query on this branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, don't we need similar code on the backend path? we also need to be thoughtful about the situation where the output series are different than the intermediate series.
for instance a quantile_over_time()
calculation will pass up intermediate histograms that are then turned into quantiles in the frontend. so the queriers/generators may actually be handling more series than the output result. i'm wondering if we want a 2 tier limit where the queriers/generators that are doing the intermediate work have a higher limit than the frontend? can you push this discussion internally?
// used to track which series were updated since the previous diff | ||
// todo: it may not be worth it to track the diffs per series. it would be simpler (and possibly nearly as effective) to just calculate a global | ||
// max/min for all series | ||
seriesUpdated map[string]tsRange |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this was from a previous feature that's been removed
return | ||
} | ||
|
||
// Here is where the job results are reentered into the pipeline | ||
q.eval.ObserveSeries(resp.Series) | ||
|
||
if q.maxSeries > 0 && len(q.eval.Results()) >= q.maxSeries { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm guessing "Results()" can get quite expensive. it might be easier to count and limit input series? this would be similar in concept to limiting input streams in a loki or prometheus query. can you do some research here to determine the cost?
What this PR does: Add config to enforce max time series returned in a metrics query. This is enforced in the combiner. As soon as the max series is reached, the shouldQuit function will return true and the combiner returns everything that it has combined so far even if there are just a single data point in each of the series when the max series is reached.
new config: max_response_series <default 1000>
Which issue(s) this PR fixes:
Fixes #
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]