Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enforce max series for metrics queries #4525

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

ie-pham
Copy link
Contributor

@ie-pham ie-pham commented Jan 7, 2025

What this PR does: Add config to enforce max time series returned in a metrics query. This is enforced in the combiner. As soon as the max series is reached, the shouldQuit function will return true and the combiner returns everything that it has combined so far even if there are just a single data point in each of the series when the max series is reached.

new config: max_response_series <default 1000>

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@ie-pham
Copy link
Contributor Author

ie-pham commented Jan 21, 2025

the way this is implemented, tempo will truncate the final results at the frontend level. We can implement it in a way that will return as soon as 1000 series is reached regardless of how many data points are in each series to exit early. Not sure which we prefer.

pkg/api/http.go Outdated Show resolved Hide resolved
pkg/tempopb/tempo.proto Show resolved Hide resolved
CHANGELOG.md Show resolved Hide resolved
docs/sources/tempo/configuration/_index.md Outdated Show resolved Hide resolved
modules/frontend/combiner/metrics_query_range.go Outdated Show resolved Hide resolved
pkg/api/http.go Outdated Show resolved Hide resolved
pkg/tempopb/tempo.proto Show resolved Hide resolved
Copy link
Contributor

@knylander-grafana knylander-grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding docs.

@@ -55,15 +61,18 @@ func NewQueryRange(req *tempopb.QueryRangeRequest, trackDiffs bool) (Combiner, e
sortResponse(resp)
return resp, nil
},
quit: func(resp *tempopb.QueryRangeResponse) bool {
Copy link
Member

@electron0zero electron0zero Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a test for early exit from the combiner?


metrics:
# Maximum number of time series returned for a metrics query.
[max_response_series: <int> | default = 1000]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an interesting choice. normally we would communicate the max series through a query param from the frontend to the queriers. the negative of your approach is that we have to make sure that 2 settings are aligned or tempo may appear subtly broken. the advantage is that we don't repeatedly marshal something like series=1000 once for every subquery.

can you bring this up with the team and see if we have consensus either way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does make me wonder if we should have a shared section of the config for querying like we do for storage. that feels like overkill for one setting tho.

query_backend_after: 0 # setting these both to 0 will force all range searches to hit the backend
query_ingesters_until: 0
metrics:
max_response_series: 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the test should we set it on the querier as well?

sendLoop:
for {
select {
case <-ticker.C:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious as to why you chose this loop structure. the goal seems to be loop 10 times and send data?

@@ -17,6 +17,7 @@
* [CHANGE] **BREAKING CHANGE** Enforce max attribute size at event, link, and instrumentation scope. Make config per-tenant.
Renamed max_span_attr_byte to max_attribute_bytes
[#4633](https://github.com/grafana/tempo/pull/4633) (@ie-pham)
* [CHANGE] Enforce max series in response for metrics queries [#4525](https://github.com/grafana/tempo/pull/4525) (@ie-pham)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd mention the addition of the config param(s) to control behavior.

@@ -45,6 +46,13 @@ func NewQueryRange(req *tempopb.QueryRangeRequest) (Combiner, error) {
if resp == nil {
resp = &tempopb.QueryRangeResponse{}
}
if maxSeries > 0 && len(resp.Series) >= maxSeries {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe we need a similar check in the diff function?

if err != nil {
return err
}

collector := pipeline.NewGRPCCollector(next, cfg.ResponseConsumers, c, func(qrr *tempopb.QueryRangeResponse) error {
// Translate each diff into the instant version and send it
resp := translateQueryRangeToInstant(*qrr)
// series already limited by the query range combiner just need to copy the status and message
resp.Status = qrr.Status
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we do this in translateQueryRangeToInstant? seems like we need similar logic in the http handler

if err != nil {
return nil, err
}

mtx := sync.Mutex{} // combiner doesn't lock, so take lock before calling Combine to make is safe
forEach := func(ctx context.Context, client tempopb.MetricsGeneratorClient) error {
if c.MaxSeriesReached() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we not enforce this in the generators to prevent {} | rate() by (span:id) or whatever from overwhelming them?

have you tested such a query on this branch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, don't we need similar code on the backend path? we also need to be thoughtful about the situation where the output series are different than the intermediate series.

for instance a quantile_over_time() calculation will pass up intermediate histograms that are then turned into quantiles in the frontend. so the queriers/generators may actually be handling more series than the output result. i'm wondering if we want a 2 tier limit where the queriers/generators that are doing the intermediate work have a higher limit than the frontend? can you push this discussion internally?

// used to track which series were updated since the previous diff
// todo: it may not be worth it to track the diffs per series. it would be simpler (and possibly nearly as effective) to just calculate a global
// max/min for all series
seriesUpdated map[string]tsRange
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this was from a previous feature that's been removed

return
}

// Here is where the job results are reentered into the pipeline
q.eval.ObserveSeries(resp.Series)

if q.maxSeries > 0 && len(q.eval.Results()) >= q.maxSeries {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm guessing "Results()" can get quite expensive. it might be easier to count and limit input series? this would be similar in concept to limiting input streams in a loki or prometheus query. can you do some research here to determine the cost?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants