[new feature] API: Introduce exemplar api #7974

liguozhong · 2022-12-20T06:12:27Z

What this PR does / why we need it:

Provides loki's http api for this feature.

Introduce exemplar http api ( /loki/api/v1/query_exemplars ).
prometheus exemplar doc link: https://prometheus.io/docs/prometheus/latest/querying/api/#querying-exemplars

query_range:
http://localhost:3100/loki/api/v1/query_range?direction=BACKWARD&limit=1732&query=count_over_time({job="varlogs"}[1m])&start=1671449435000000000&end=1671453036000000000&step=2

VS
query_exemplars:
http://localhost:3100/loki/api/v1/query_exemplars?direction=BACKWARD&limit=1732&query=count_over_time({job="varlogs"}[1m])&start=1671449435000000000&end=1671453036000000000&step=2

Which issue(s) this PR fixes:
Fixes ##7876

Special notes for your reviewer:
Why the api parameters of /loki/api/v1/query_exemplars are exactly the same as those of /loki/api/v1//query_range？
because the label of loki can be dynamically generated during the logql query process through parsers such as | json and | regex, which is different from the fixed label of prometheus. Therefore, the parameters of loki's exemplars api cannot use label match like prometheus, but should use the same logql parameters as query_range to make better use of the schema on read loki design.

Therefore, this exemplar code PR becomes relatively long, which will cause the reviewer to work very hard. I am sorry for this.

Why is the test coverage rate dropping dramatically?
Because the amount of code in this PR is too much, this PR only completes the prototype of the exemplar feature with the smallest amount of code, and the testing and doc will be completed in another PR.

-           ingester	-1.8%
+        distributor	0%
-            querier	-3.2%
+ querier/queryrange	0%
-               iter	-24%
-            storage	-7.7%
-           chunkenc	-6%
-              logql	-15.6%
+               loki	0%

Checklist

Reviewed the CONTRIBUTING.md guide
Documentation added
Tests updated
CHANGELOG.md updated
Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md

liguozhong · 2022-12-20T06:16:31Z

http response text：

{
	status: "success",
	data: {
		resultType: "exemplars",
		result: [{
				metric: {
					filename: "/var/log/McAfeeSecurity.log",
					job: "varlogs"
				},
				values: [{
					labels: {
						_line: "Dec 15 15:02:12 B-Q6PJMD6M-0154 McAfee: [173]: Info: LogTime: 2022-Dec-15 15:02:12 HealthCheckManager::reviveFM - Starting Up FM with id=6 as an attempt to revive it",
						filename: "/var/log/McAfeeSecurity.log",
						job: "varlogs"
					},
					value: "0",
					timestamp: 1671452983
				}]
			},
			{
				metric: {
					filename: "/var/log/cloudprintupdate.log",
					job: "varlogs"
				},
				values: [{
						labels: {
							_line: "[Info]2022/12/19 20:36:07 main.go:43: Ali cloud print updater check again at 2022-12-19 20:36:07.363764 +0800 CST m=+295201.049795879",
							filename: "/var/log/cloudprintupdate.log",
							job: "varlogs"
						},
						value: "0",
						timestamp: 1671452983
					},
					{
						labels: {
							_line: "[Info]2022/12/19 20:36:07 main.go:43: Ali cloud print updater check again at 2022-12-19 20:36:07.363764 +0800 CST m=+295201.049795879",
							filename: "/var/log/cloudprintupdate.log",
							job: "varlogs"
						},
						value: "0",
						timestamp: 1671452985
					}
				]
			},
			{
				metric: {
					filename: "/var/log/fsck_apfs.log",
					job: "varlogs"
				},
				values: [{
					labels: {
						_line: "",
						filename: "/var/log/fsck_apfs.log",
						job: "varlogs"
					},
					value: "0",
					timestamp: 1671452983
				}]
			},
			{
				metric: {
					filename: "/var/log/fsck_apfs_error.log",
					job: "varlogs"
				},
				values: [{
					labels: {
						_line: "fsck_apfs completed at Mon Dec 19 14:20:32 2022",
						filename: "/var/log/fsck_apfs_error.log",
						job: "varlogs"
					},
					value: "0",
					timestamp: 1671452983
				}]
			},
			{
				metric: {
					filename: "/var/log/fsck_hfs.log",
					job: "varlogs"
				},
				values: [{
					labels: {
						_line: "",
						filename: "/var/log/fsck_hfs.log",
						job: "varlogs"
					},
					value: "0",
					timestamp: 1671452983
				}]
			},
			{
				metric: {
					filename: "/var/log/install.log",
					job: "varlogs"
				},
				values: [{
						labels: {
							_line: "2022-12-19 19:05:58+08 B-Q6PJMD6M-0154 suhelperd[92167]: Exiting Daemon SUHelperExitCodeNoSenders",
							filename: "/var/log/install.log",
							job: "varlogs"
						},
						value: "0",
						timestamp: 1671452983
					},
					{
						labels: {
							_line: "2022-12-19 19:05:58+08 B-Q6PJMD6M-0154 suhelperd[92167]: Exiting Daemon SUHelperExitCodeNoSenders",
							filename: "/var/log/install.log",
							job: "varlogs"
						},
						value: "0",
						timestamp: 1671452985
					},
					{
						labels: {
							_line: "2022-12-19 19:05:58+08 B-Q6PJMD6M-0154 suhelperd[92167]: Exiting Daemon SUHelperExitCodeNoSenders",
							filename: "/var/log/install.log",
							job: "varlogs"
						},
						value: "0",
						timestamp: 1671452987
					},
					{
						labels: {
							_line: "2022-12-19 19:05:58+08 B-Q6PJMD6M-0154 suhelperd[92167]: Exiting Daemon SUHelperExitCodeNoSenders",
							filename: "/var/log/install.log",
							job: "varlogs"
						},
						value: "0",
						timestamp: 1671452989
					},
					{
						labels: {
							_line: "2022-12-19 19:05:58+08 B-Q6PJMD6M-0154 suhelperd[92167]: Exiting Daemon SUHelperExitCodeNoSenders",
							filename: "/var/log/install.log",
							job: "varlogs"
						},
						value: "0",
						timestamp: 1671452991
					}
				]
			},
			{
				metric: {
					filename: "/var/log/system.log",
					job: "varlogs"
				},
				values: [{
						labels: {
							_line: "Dec 19 20:31:50 B-Q6PJMD6M-0154 com.apple.xpc.launchd[1] (com.apple.xpc.launchd.domain.user.502): Service "
							com.apple.xpc.launchd.unmanaged.loginwindow .203 " tried to register for endpoint "
							com.apple.tsm.uiserver " already registered by owner: com.apple.TextInputMenuAgent",
							filename: "/var/log/system.log",
							job: "varlogs"
						},
						value: "0",
						timestamp: 1671452983
					},
					{
						labels: {
							_line: "Dec 19 20:31:50 B-Q6PJMD6M-0154 com.apple.xpc.launchd[1] (com.apple.xpc.launchd.domain.user.502): Service "
							com.apple.xpc.launchd.unmanaged.loginwindow .203 " tried to register for endpoint "
							com.apple.tsm.uiserver " already registered by owner: com.apple.TextInputMenuAgent",
							filename: "/var/log/system.log",
							job: "varlogs"
						},
						value: "0",
						timestamp: 1671452989
					},
					{
						labels: {
							_line: "Dec 19 20:31:50 B-Q6PJMD6M-0154 com.apple.xpc.launchd[1] (com.apple.xpc.launchd.domain.user.502): Service "
							com.apple.xpc.launchd.unmanaged.loginwindow .203 " tried to register for endpoint "
							com.apple.tsm.uiserver " already registered by owner: com.apple.TextInputMenuAgent",
							filename: "/var/log/system.log",
							job: "varlogs"
						},
						value: "0",
						timestamp: 1671453011
					},
					{
						labels: {
							_line: "Dec 19 20:31:50 B-Q6PJMD6M-0154 com.apple.xpc.launchd[1] (com.apple.xpc.launchd.domain.user.502): Service "
							com.apple.xpc.launchd.unmanaged.loginwindow .203 " tried to register for endpoint "
							com.apple.tsm.uiserver " already registered by owner: com.apple.TextInputMenuAgent",
							filename: "/var/log/system.log",
							job: "varlogs"
						},
						value: "0",
						timestamp: 1671453033
					}
				]
			},
			{
				metric: {
					filename: "/var/log/wifi.log",
					job: "varlogs"
				},
				values: [{
						labels: {
							_line: "Mon Dec 19 19:05:21.749  Mon Dec 19 20:56:13.930 <kernel> postMessage::1349 APPLE80211_M_BSSID_CHANGED received",
							filename: "/var/log/wifi.log",
							job: "varlogs"
						},
						value: "0",
						timestamp: 1671452983
					},
					{
						labels: {
							_line: "Mon Dec 19 19:05:21.749  Mon Dec 19 20:56:13.930 <kernel> postMessage::1349 APPLE80211_M_BSSID_CHANGED received",
							filename: "/var/log/wifi.log",
							job: "varlogs"
						},
						value: "0",
						timestamp: 1671452985
					}
				]
			}
		],
	}
}

grafanabot · 2022-12-20T11:54:24Z

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

-           ingester	-1.8%
+        distributor	0%
-            querier	-3.2%
+ querier/queryrange	0%
-               iter	-24%
-            storage	-7.7%
-           chunkenc	-6%
-              logql	-15.6%
+               loki	0%

grafanabot · 2022-12-20T12:29:30Z

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

-           ingester	-1.8%
+        distributor	0%
-            querier	-3.2%
+ querier/queryrange	0.1%
-               iter	-24%
-            storage	-7.7%
-           chunkenc	-6%
-              logql	-15.6%
+               loki	0%

grafanabot · 2022-12-20T12:47:58Z

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

-           ingester	-1.8%
+        distributor	0%
-            querier	-3.2%
- querier/queryrange	-0.1%
-               iter	-24%
-            storage	-7.7%
-           chunkenc	-6%
-              logql	-15.6%
+               loki	0%

slim-bean · 2022-12-20T19:37:26Z

This is very interesting @liguozhong!

I am wondering though if it might be simpler to implement such a feature in just the front end?

If you were to click on a metric graph at any point couldn't Grafana make a query for the corresponding log line(s) in a lookup query? This won't be perfect however because a point on a graph can encompass many log lines.

A related question? How are the exemplars chosen?

grafanabot · 2022-12-21T05:12:33Z

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

-           ingester	-1.8%
+        distributor	0%
-            querier	-3.2%
+ querier/queryrange	0%
-               iter	-24%
-            storage	-7.7%
-           chunkenc	-6%
-              logql	-15.6%
+               loki	0%

liguozhong · 2022-12-21T05:16:17Z

If you were to click on a metric graph at any point couldn't Grafana make a query for the corresponding log line(s) in a lookup query? This won't be perfect however because a point on a graph can encompass many log lines.

@slim-bean , Thanks for such a timely review. This idea is great. Compared with the implementation of this PR, this idea can reduce a lot of s3 query calculations. I hadn't thought of this way before. But maybe we can provide 2 kinds of exemplar api option.

Although the amount of code in this PR is too bloated, from a mathematical point of view, the implementation of this PR is the correct answer. At the same time, this PR is similar to the design of Prometheus's exemplar. As far as I know, the exemplar concept was designed by Google's monarch system after a lot of practice,exemplar is a very good monitoring standard. loki can also follow the exemplar standard.

I suggest that we can provide two exemplar APIs for users to choose from. Comparing the calculation amount of this PR is basically the same as that of /query_range, it is necessary to initiate a very large amount of query requests to s3.Even if the implementation of this PR is as slow as /query_range, we can write the exemplar into prometheus or mimir through the recording rule later, and complete the linkage of exemplar from log to metrics

liguozhong · 2022-12-21T05:21:29Z

A related question? How are the exemplars chosen?

range_exemplar.go code here completes the logic of how to select the exemplar. From the code point of view, the last log in the range is selected as the exemplar.

func (r *streamRangeExemplarIterator) load(start, end int64) {
  for lbs, sample, hasNext := r.iter.Peek(); hasNext; lbs, sample, hasNext = r.iter.Peek() {
    rangeAgg, ok = r.windowRangeAgg[lbs]
    p := exemplar.Exemplar{
      Ts:     sample.TimestampMs,
      Value:  sample.Value,
      Labels: logproto.FromLabelAdaptersToLabels(sample.Labels),
    }
    rangeAgg.agg(p)
    _ = r.iter.Next()
  }
}
func (r *streamRangeExemplarIterator) At() (int64, []exemplar.QueryResult) {
	if r.exemplars == nil {
		r.exemplars = make([]exemplar.QueryResult, 0, len(r.windowRangeAgg))
	}
	r.exemplars = r.exemplars[:0]
	ts := r.current/1e+6 + r.offset/1e+6
	for lbs, rangeAgg := range r.windowRangeAgg {
		exp := exemplar.Exemplar{
			Labels: rangeAgg.at().Labels,
			Ts:     ts,
		}

		eps := make([]exemplar.Exemplar, 0)
		eps = append(eps, exp)

		r.exemplars = append(r.exemplars, exemplar.QueryResult{
			SeriesLabels: r.metrics[lbs],
			Exemplars:    eps,
		})
	}
	return ts, r.exemplars
}

type ExemplarAgg struct {
	exemplar exemplar.Exemplar
}

func (a *ExemplarAgg) agg(exemplar exemplar.Exemplar) {
	a.exemplar = exemplar
}

func (a *ExemplarAgg) at() exemplar.Exemplar {
	return a.exemplar
}

owen-d

This is a really cool PR, but I have a few reservations.

Performance: Exemplars are likely meant to run alongside queries. How can we ensure this doesn't double query load?
Selection: All logs are processed in order to select the last log line as an exemplar (related to performance). I don't think we can accept this approach, especially on large datasets.
Maintainability: This PR duplicates most of the read path -- I don't think this is a sustainable pattern.

Considering the downsides, do we need this now? I don't think so. I'd be interested in figuring out a more scalable & sustainable approach to this problem. Perhaps we could extend the QuerySample API to include exemplars, scanning the data only once? It feels bad to abandon such a large and well constructed PR and I feel terrible blocking it -- in the future it's worth talking about possible solutions before writing all the code (github issues are great places to do this).

liguozhong · 2023-01-05T04:50:03Z

I agree with you, this PR took us 1 week, and the code is difficult to maintain, not suitable for merging into master.
But this PR is quite useful for our company, before the master branch implements the exemplar feature. Since we are going to implement logreduce, we will maintain this PR separately in our private branch.

liguozhong · 2023-01-05T04:55:38Z

I personally support two different APIs（/loki/api/v1/query_exemplars + /loki/api/v1/query_range）, which are more consistent with prometheus.
But loki and prometheus are indeed very different. I also agree that it is more cost-effective in terms of performance to obtain 2 data in one scan.

owen-d · 2023-02-09T10:16:54Z

I'm going to close this PR in the meantime since we don't plan to accept it.

add exemplar api

2ba1fc7

liguozhong requested a review from a team as a code owner December 20, 2022 06:12

pull-request-size bot added the size/XXL label Dec 20, 2022

fix protobuf

d620784

liguozhong mentioned this pull request Dec 20, 2022

Support exemplars like datapoints on Logql metrics queries in Loki #7876

Open

liguozhong added 9 commits December 20, 2022 15:04

impl FileClient Exemplar interface .

4042d70

fix test

090b38f

fix test

35310ac

fix test

26414cb

fix test

9f2bd81

fix test

28dddcd

fix test

5f4ad11

delete gen pb

b5fb521

Merge branch 'main' into exemplars

1ee621b

type opt

0ce5eac

type opt

f9ad244

fix exemplar pool

de3750a

liguozhong added 4 commits December 21, 2022 14:16

type opt

2ac8b0e

Merge branch 'main' into exemplars

8ae1020

trigger CI

51d0065

add doc

7feb7db

github-actions bot added the type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories label Dec 21, 2022

owen-d reviewed Jan 4, 2023

View reviewed changes

owen-d closed this Feb 9, 2023

liguozhong mentioned this pull request Mar 1, 2023

Feature Request: [LogQL ] should have a way to filter out duplicate label values #8649

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[new feature] API: Introduce exemplar api #7974

[new feature] API: Introduce exemplar api #7974

liguozhong commented Dec 20, 2022 •

edited

Loading

liguozhong commented Dec 20, 2022 •

edited

Loading

grafanabot commented Dec 20, 2022

grafanabot commented Dec 20, 2022

grafanabot commented Dec 20, 2022

slim-bean commented Dec 20, 2022

grafanabot commented Dec 21, 2022

liguozhong commented Dec 21, 2022 •

edited

Loading

liguozhong commented Dec 21, 2022

owen-d left a comment

liguozhong commented Jan 5, 2023

liguozhong commented Jan 5, 2023

owen-d commented Feb 9, 2023

[new feature] API: Introduce exemplar api #7974

[new feature] API: Introduce exemplar api #7974

Conversation

liguozhong commented Dec 20, 2022 • edited Loading

liguozhong commented Dec 20, 2022 • edited Loading

grafanabot commented Dec 20, 2022

grafanabot commented Dec 20, 2022

grafanabot commented Dec 20, 2022

slim-bean commented Dec 20, 2022

grafanabot commented Dec 21, 2022

liguozhong commented Dec 21, 2022 • edited Loading

liguozhong commented Dec 21, 2022

owen-d left a comment

Choose a reason for hiding this comment

liguozhong commented Jan 5, 2023

liguozhong commented Jan 5, 2023

owen-d commented Feb 9, 2023

liguozhong commented Dec 20, 2022 •

edited

Loading

liguozhong commented Dec 20, 2022 •

edited

Loading

liguozhong commented Dec 21, 2022 •

edited

Loading