-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[new feature] API: Introduce exemplar api #7974
Conversation
http response text:
|
./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. - ingester -1.8%
+ distributor 0%
- querier -3.2%
+ querier/queryrange 0%
- iter -24%
- storage -7.7%
- chunkenc -6%
- logql -15.6%
+ loki 0% |
./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. - ingester -1.8%
+ distributor 0%
- querier -3.2%
+ querier/queryrange 0.1%
- iter -24%
- storage -7.7%
- chunkenc -6%
- logql -15.6%
+ loki 0% |
./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. - ingester -1.8%
+ distributor 0%
- querier -3.2%
- querier/queryrange -0.1%
- iter -24%
- storage -7.7%
- chunkenc -6%
- logql -15.6%
+ loki 0% |
This is very interesting @liguozhong! I am wondering though if it might be simpler to implement such a feature in just the front end? If you were to click on a metric graph at any point couldn't Grafana make a query for the corresponding log line(s) in a lookup query? This won't be perfect however because a point on a graph can encompass many log lines. A related question? How are the exemplars chosen? |
./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. - ingester -1.8%
+ distributor 0%
- querier -3.2%
+ querier/queryrange 0%
- iter -24%
- storage -7.7%
- chunkenc -6%
- logql -15.6%
+ loki 0% |
@slim-bean , Thanks for such a timely review. This idea is great. Compared with the implementation of this PR, this idea can reduce a lot of s3 query calculations. I hadn't thought of this way before. But maybe we can provide 2 kinds of exemplar api option. Although the amount of code in this PR is too bloated, from a mathematical point of view, the implementation of this PR is the correct answer. At the same time, this PR is similar to the design of Prometheus's exemplar. As far as I know, the exemplar concept was designed by Google's monarch system after a lot of practice,exemplar is a very good monitoring standard. loki can also follow the exemplar standard. I suggest that we can provide two exemplar APIs for users to choose from. Comparing the calculation amount of this PR is basically the same as that of /query_range, it is necessary to initiate a very large amount of query requests to s3.Even if the implementation of this PR is as slow as /query_range, we can write the exemplar into prometheus or mimir through the recording rule later, and complete the linkage of exemplar from log to metrics |
func (r *streamRangeExemplarIterator) load(start, end int64) {
for lbs, sample, hasNext := r.iter.Peek(); hasNext; lbs, sample, hasNext = r.iter.Peek() {
rangeAgg, ok = r.windowRangeAgg[lbs]
p := exemplar.Exemplar{
Ts: sample.TimestampMs,
Value: sample.Value,
Labels: logproto.FromLabelAdaptersToLabels(sample.Labels),
}
rangeAgg.agg(p)
_ = r.iter.Next()
}
}
func (r *streamRangeExemplarIterator) At() (int64, []exemplar.QueryResult) {
if r.exemplars == nil {
r.exemplars = make([]exemplar.QueryResult, 0, len(r.windowRangeAgg))
}
r.exemplars = r.exemplars[:0]
ts := r.current/1e+6 + r.offset/1e+6
for lbs, rangeAgg := range r.windowRangeAgg {
exp := exemplar.Exemplar{
Labels: rangeAgg.at().Labels,
Ts: ts,
}
eps := make([]exemplar.Exemplar, 0)
eps = append(eps, exp)
r.exemplars = append(r.exemplars, exemplar.QueryResult{
SeriesLabels: r.metrics[lbs],
Exemplars: eps,
})
}
return ts, r.exemplars
} type ExemplarAgg struct {
exemplar exemplar.Exemplar
}
func (a *ExemplarAgg) agg(exemplar exemplar.Exemplar) {
a.exemplar = exemplar
}
func (a *ExemplarAgg) at() exemplar.Exemplar {
return a.exemplar
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really cool PR, but I have a few reservations.
- Performance: Exemplars are likely meant to run alongside queries. How can we ensure this doesn't double query load?
- Selection: All logs are processed in order to select the last log line as an exemplar (related to performance). I don't think we can accept this approach, especially on large datasets.
- Maintainability: This PR duplicates most of the read path -- I don't think this is a sustainable pattern.
Considering the downsides, do we need this now? I don't think so. I'd be interested in figuring out a more scalable & sustainable approach to this problem. Perhaps we could extend the QuerySample API to include exemplars, scanning the data only once? It feels bad to abandon such a large and well constructed PR and I feel terrible blocking it -- in the future it's worth talking about possible solutions before writing all the code (github issues are great places to do this).
I agree with you, this PR took us 1 week, and the code is difficult to maintain, not suitable for merging into master. |
I personally support two different APIs(/loki/api/v1/query_exemplars + /loki/api/v1/query_range), which are more consistent with prometheus. |
I'm going to close this PR in the meantime since we don't plan to accept it. |
What this PR does / why we need it:
Provides loki's http api for this feature.
Introduce exemplar http api ( /loki/api/v1/query_exemplars ).
prometheus exemplar doc link: https://prometheus.io/docs/prometheus/latest/querying/api/#querying-exemplars
query_range:
http://localhost:3100/loki/api/v1/query_range?direction=BACKWARD&limit=1732&query=count_over_time({job="varlogs"}[1m])&start=1671449435000000000&end=1671453036000000000&step=2
VS
query_exemplars:
http://localhost:3100/loki/api/v1/query_exemplars?direction=BACKWARD&limit=1732&query=count_over_time({job="varlogs"}[1m])&start=1671449435000000000&end=1671453036000000000&step=2
Which issue(s) this PR fixes:
Fixes ##7876
Special notes for your reviewer:
Why the api parameters of /loki/api/v1/query_exemplars are exactly the same as those of /loki/api/v1//query_range?
because the label of loki can be dynamically generated during the logql query process through parsers such as | json and | regex, which is different from the fixed label of prometheus. Therefore, the parameters of loki's exemplars api cannot use label match like prometheus, but should use the same logql parameters as query_range to make better use of the schema on read loki design.
Therefore, this exemplar code PR becomes relatively long, which will cause the reviewer to work very hard. I am sorry for this.
Why is the test coverage rate dropping dramatically?
Because the amount of code in this PR is too much, this PR only completes the prototype of the exemplar feature with the smallest amount of code, and the testing and doc will be completed in another PR.
Checklist
CONTRIBUTING.md
guideCHANGELOG.md
updateddocs/sources/upgrading/_index.md