Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[new feature] API: Introduce exemplar api #7974

Closed
wants to merge 18 commits into from

Conversation

liguozhong
Copy link
Contributor

@liguozhong liguozhong commented Dec 20, 2022

What this PR does / why we need it:
image
Provides loki's http api for this feature.

Introduce exemplar http api ( /loki/api/v1/query_exemplars ).
prometheus exemplar doc link: https://prometheus.io/docs/prometheus/latest/querying/api/#querying-exemplars

query_range:
http://localhost:3100/loki/api/v1/query_range?direction=BACKWARD&limit=1732&query=count_over_time({job="varlogs"}[1m])&start=1671449435000000000&end=1671453036000000000&step=2
image

VS
query_exemplars:
http://localhost:3100/loki/api/v1/query_exemplars?direction=BACKWARD&limit=1732&query=count_over_time({job="varlogs"}[1m])&start=1671449435000000000&end=1671453036000000000&step=2
image

Which issue(s) this PR fixes:
Fixes ##7876

Special notes for your reviewer:
Why the api parameters of /loki/api/v1/query_exemplars are exactly the same as those of /loki/api/v1//query_range?
because the label of loki can be dynamically generated during the logql query process through parsers such as | json and | regex, which is different from the fixed label of prometheus. Therefore, the parameters of loki's exemplars api cannot use label match like prometheus, but should use the same logql parameters as query_range to make better use of the schema on read loki design.

Therefore, this exemplar code PR becomes relatively long, which will cause the reviewer to work very hard. I am sorry for this.

Why is the test coverage rate dropping dramatically?
Because the amount of code in this PR is too much, this PR only completes the prototype of the exemplar feature with the smallest amount of code, and the testing and doc will be completed in another PR.

-           ingester	-1.8%
+        distributor	0%
-            querier	-3.2%
+ querier/queryrange	0%
-               iter	-24%
-            storage	-7.7%
-           chunkenc	-6%
-              logql	-15.6%
+               loki	0%

Checklist

  • Reviewed the CONTRIBUTING.md guide
  • Documentation added
  • Tests updated
  • CHANGELOG.md updated
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md

@liguozhong liguozhong requested a review from a team as a code owner December 20, 2022 06:12
@liguozhong
Copy link
Contributor Author

liguozhong commented Dec 20, 2022

http response text:

{
	status: "success",
	data: {
		resultType: "exemplars",
		result: [{
				metric: {
					filename: "/var/log/McAfeeSecurity.log",
					job: "varlogs"
				},
				values: [{
					labels: {
						_line: "Dec 15 15:02:12 B-Q6PJMD6M-0154 McAfee: [173]: Info: LogTime: 2022-Dec-15 15:02:12 HealthCheckManager::reviveFM - Starting Up FM with id=6 as an attempt to revive it",
						filename: "/var/log/McAfeeSecurity.log",
						job: "varlogs"
					},
					value: "0",
					timestamp: 1671452983
				}]
			},
			{
				metric: {
					filename: "/var/log/cloudprintupdate.log",
					job: "varlogs"
				},
				values: [{
						labels: {
							_line: "[Info]2022/12/19 20:36:07 main.go:43: Ali cloud print updater check again at 2022-12-19 20:36:07.363764 +0800 CST m=+295201.049795879",
							filename: "/var/log/cloudprintupdate.log",
							job: "varlogs"
						},
						value: "0",
						timestamp: 1671452983
					},
					{
						labels: {
							_line: "[Info]2022/12/19 20:36:07 main.go:43: Ali cloud print updater check again at 2022-12-19 20:36:07.363764 +0800 CST m=+295201.049795879",
							filename: "/var/log/cloudprintupdate.log",
							job: "varlogs"
						},
						value: "0",
						timestamp: 1671452985
					}
				]
			},
			{
				metric: {
					filename: "/var/log/fsck_apfs.log",
					job: "varlogs"
				},
				values: [{
					labels: {
						_line: "",
						filename: "/var/log/fsck_apfs.log",
						job: "varlogs"
					},
					value: "0",
					timestamp: 1671452983
				}]
			},
			{
				metric: {
					filename: "/var/log/fsck_apfs_error.log",
					job: "varlogs"
				},
				values: [{
					labels: {
						_line: "fsck_apfs completed at Mon Dec 19 14:20:32 2022",
						filename: "/var/log/fsck_apfs_error.log",
						job: "varlogs"
					},
					value: "0",
					timestamp: 1671452983
				}]
			},
			{
				metric: {
					filename: "/var/log/fsck_hfs.log",
					job: "varlogs"
				},
				values: [{
					labels: {
						_line: "",
						filename: "/var/log/fsck_hfs.log",
						job: "varlogs"
					},
					value: "0",
					timestamp: 1671452983
				}]
			},
			{
				metric: {
					filename: "/var/log/install.log",
					job: "varlogs"
				},
				values: [{
						labels: {
							_line: "2022-12-19 19:05:58+08 B-Q6PJMD6M-0154 suhelperd[92167]: Exiting Daemon SUHelperExitCodeNoSenders",
							filename: "/var/log/install.log",
							job: "varlogs"
						},
						value: "0",
						timestamp: 1671452983
					},
					{
						labels: {
							_line: "2022-12-19 19:05:58+08 B-Q6PJMD6M-0154 suhelperd[92167]: Exiting Daemon SUHelperExitCodeNoSenders",
							filename: "/var/log/install.log",
							job: "varlogs"
						},
						value: "0",
						timestamp: 1671452985
					},
					{
						labels: {
							_line: "2022-12-19 19:05:58+08 B-Q6PJMD6M-0154 suhelperd[92167]: Exiting Daemon SUHelperExitCodeNoSenders",
							filename: "/var/log/install.log",
							job: "varlogs"
						},
						value: "0",
						timestamp: 1671452987
					},
					{
						labels: {
							_line: "2022-12-19 19:05:58+08 B-Q6PJMD6M-0154 suhelperd[92167]: Exiting Daemon SUHelperExitCodeNoSenders",
							filename: "/var/log/install.log",
							job: "varlogs"
						},
						value: "0",
						timestamp: 1671452989
					},
					{
						labels: {
							_line: "2022-12-19 19:05:58+08 B-Q6PJMD6M-0154 suhelperd[92167]: Exiting Daemon SUHelperExitCodeNoSenders",
							filename: "/var/log/install.log",
							job: "varlogs"
						},
						value: "0",
						timestamp: 1671452991
					}
				]
			},
			{
				metric: {
					filename: "/var/log/system.log",
					job: "varlogs"
				},
				values: [{
						labels: {
							_line: "Dec 19 20:31:50 B-Q6PJMD6M-0154 com.apple.xpc.launchd[1] (com.apple.xpc.launchd.domain.user.502): Service "
							com.apple.xpc.launchd.unmanaged.loginwindow .203 " tried to register for endpoint "
							com.apple.tsm.uiserver " already registered by owner: com.apple.TextInputMenuAgent",
							filename: "/var/log/system.log",
							job: "varlogs"
						},
						value: "0",
						timestamp: 1671452983
					},
					{
						labels: {
							_line: "Dec 19 20:31:50 B-Q6PJMD6M-0154 com.apple.xpc.launchd[1] (com.apple.xpc.launchd.domain.user.502): Service "
							com.apple.xpc.launchd.unmanaged.loginwindow .203 " tried to register for endpoint "
							com.apple.tsm.uiserver " already registered by owner: com.apple.TextInputMenuAgent",
							filename: "/var/log/system.log",
							job: "varlogs"
						},
						value: "0",
						timestamp: 1671452989
					},
					{
						labels: {
							_line: "Dec 19 20:31:50 B-Q6PJMD6M-0154 com.apple.xpc.launchd[1] (com.apple.xpc.launchd.domain.user.502): Service "
							com.apple.xpc.launchd.unmanaged.loginwindow .203 " tried to register for endpoint "
							com.apple.tsm.uiserver " already registered by owner: com.apple.TextInputMenuAgent",
							filename: "/var/log/system.log",
							job: "varlogs"
						},
						value: "0",
						timestamp: 1671453011
					},
					{
						labels: {
							_line: "Dec 19 20:31:50 B-Q6PJMD6M-0154 com.apple.xpc.launchd[1] (com.apple.xpc.launchd.domain.user.502): Service "
							com.apple.xpc.launchd.unmanaged.loginwindow .203 " tried to register for endpoint "
							com.apple.tsm.uiserver " already registered by owner: com.apple.TextInputMenuAgent",
							filename: "/var/log/system.log",
							job: "varlogs"
						},
						value: "0",
						timestamp: 1671453033
					}
				]
			},
			{
				metric: {
					filename: "/var/log/wifi.log",
					job: "varlogs"
				},
				values: [{
						labels: {
							_line: "Mon Dec 19 19:05:21.749  Mon Dec 19 20:56:13.930 <kernel> postMessage::1349 APPLE80211_M_BSSID_CHANGED received",
							filename: "/var/log/wifi.log",
							job: "varlogs"
						},
						value: "0",
						timestamp: 1671452983
					},
					{
						labels: {
							_line: "Mon Dec 19 19:05:21.749  Mon Dec 19 20:56:13.930 <kernel> postMessage::1349 APPLE80211_M_BSSID_CHANGED received",
							filename: "/var/log/wifi.log",
							job: "varlogs"
						},
						value: "0",
						timestamp: 1671452985
					}
				]
			}
		],
	}
}

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

-           ingester	-1.8%
+        distributor	0%
-            querier	-3.2%
+ querier/queryrange	0%
-               iter	-24%
-            storage	-7.7%
-           chunkenc	-6%
-              logql	-15.6%
+               loki	0%

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

-           ingester	-1.8%
+        distributor	0%
-            querier	-3.2%
+ querier/queryrange	0.1%
-               iter	-24%
-            storage	-7.7%
-           chunkenc	-6%
-              logql	-15.6%
+               loki	0%

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

-           ingester	-1.8%
+        distributor	0%
-            querier	-3.2%
- querier/queryrange	-0.1%
-               iter	-24%
-            storage	-7.7%
-           chunkenc	-6%
-              logql	-15.6%
+               loki	0%

@slim-bean
Copy link
Collaborator

This is very interesting @liguozhong!

I am wondering though if it might be simpler to implement such a feature in just the front end?

If you were to click on a metric graph at any point couldn't Grafana make a query for the corresponding log line(s) in a lookup query? This won't be perfect however because a point on a graph can encompass many log lines.

A related question? How are the exemplars chosen?

@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

-           ingester	-1.8%
+        distributor	0%
-            querier	-3.2%
+ querier/queryrange	0%
-               iter	-24%
-            storage	-7.7%
-           chunkenc	-6%
-              logql	-15.6%
+               loki	0%

@liguozhong
Copy link
Contributor Author

liguozhong commented Dec 21, 2022

If you were to click on a metric graph at any point couldn't Grafana make a query for the corresponding log line(s) in a lookup query? This won't be perfect however because a point on a graph can encompass many log lines.

@slim-bean , Thanks for such a timely review. This idea is great. Compared with the implementation of this PR, this idea can reduce a lot of s3 query calculations. I hadn't thought of this way before. But maybe we can provide 2 kinds of exemplar api option.

Although the amount of code in this PR is too bloated, from a mathematical point of view, the implementation of this PR is the correct answer. At the same time, this PR is similar to the design of Prometheus's exemplar. As far as I know, the exemplar concept was designed by Google's monarch system after a lot of practice,exemplar is a very good monitoring standard. loki can also follow the exemplar standard.

I suggest that we can provide two exemplar APIs for users to choose from. Comparing the calculation amount of this PR is basically the same as that of /query_range, it is necessary to initiate a very large amount of query requests to s3.Even if the implementation of this PR is as slow as /query_range, we can write the exemplar into prometheus or mimir through the recording rule later, and complete the linkage of exemplar from log to metrics

@liguozhong
Copy link
Contributor Author

A related question? How are the exemplars chosen?

range_exemplar.go code here completes the logic of how to select the exemplar. From the code point of view, the last log in the range is selected as the exemplar.

func (r *streamRangeExemplarIterator) load(start, end int64) {
  for lbs, sample, hasNext := r.iter.Peek(); hasNext; lbs, sample, hasNext = r.iter.Peek() {
    rangeAgg, ok = r.windowRangeAgg[lbs]
    p := exemplar.Exemplar{
      Ts:     sample.TimestampMs,
      Value:  sample.Value,
      Labels: logproto.FromLabelAdaptersToLabels(sample.Labels),
    }
    rangeAgg.agg(p)
    _ = r.iter.Next()
  }
}
func (r *streamRangeExemplarIterator) At() (int64, []exemplar.QueryResult) {
	if r.exemplars == nil {
		r.exemplars = make([]exemplar.QueryResult, 0, len(r.windowRangeAgg))
	}
	r.exemplars = r.exemplars[:0]
	ts := r.current/1e+6 + r.offset/1e+6
	for lbs, rangeAgg := range r.windowRangeAgg {
		exp := exemplar.Exemplar{
			Labels: rangeAgg.at().Labels,
			Ts:     ts,
		}

		eps := make([]exemplar.Exemplar, 0)
		eps = append(eps, exp)

		r.exemplars = append(r.exemplars, exemplar.QueryResult{
			SeriesLabels: r.metrics[lbs],
			Exemplars:    eps,
		})
	}
	return ts, r.exemplars
}
type ExemplarAgg struct {
	exemplar exemplar.Exemplar
}

func (a *ExemplarAgg) agg(exemplar exemplar.Exemplar) {
	a.exemplar = exemplar
}

func (a *ExemplarAgg) at() exemplar.Exemplar {
	return a.exemplar
}

@github-actions github-actions bot added the type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories label Dec 21, 2022
Copy link
Member

@owen-d owen-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really cool PR, but I have a few reservations.

  • Performance: Exemplars are likely meant to run alongside queries. How can we ensure this doesn't double query load?
  • Selection: All logs are processed in order to select the last log line as an exemplar (related to performance). I don't think we can accept this approach, especially on large datasets.
  • Maintainability: This PR duplicates most of the read path -- I don't think this is a sustainable pattern.

Considering the downsides, do we need this now? I don't think so. I'd be interested in figuring out a more scalable & sustainable approach to this problem. Perhaps we could extend the QuerySample API to include exemplars, scanning the data only once? It feels bad to abandon such a large and well constructed PR and I feel terrible blocking it -- in the future it's worth talking about possible solutions before writing all the code (github issues are great places to do this).

@liguozhong
Copy link
Contributor Author

I agree with you, this PR took us 1 week, and the code is difficult to maintain, not suitable for merging into master.
But this PR is quite useful for our company, before the master branch implements the exemplar feature. Since we are going to implement logreduce, we will maintain this PR separately in our private branch.

@liguozhong
Copy link
Contributor Author

I personally support two different APIs(/loki/api/v1/query_exemplars + /loki/api/v1/query_range), which are more consistent with prometheus.
But loki and prometheus are indeed very different. I also agree that it is more cost-effective in terms of performance to obtain 2 data in one scan.

@owen-d
Copy link
Member

owen-d commented Feb 9, 2023

I'm going to close this PR in the meantime since we don't plan to accept it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XXL type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants