NETOBSERV-740: Metrics integration - console plugin frontend #516

jpinsonneau · 2024-04-16T15:15:00Z

Description

Add data source to query option dropdown

Dependencies

Based on #513

Operator: NETOBSERV-739: Add prometheus network-observability-operator#613

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

openshift-ci · 2024-04-16T15:15:05Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

This introduces prometheus as a new datasource, that can be used either as a replacement or as a complement of Loki. It doesn't require to change the frontend/backend API interface: on every frontend query, backend checks if that query is transposable to prometheus, and if so, runs it on prometheus, else falls back on loki (if it's enabled) There's some refactoring of the config and topology handlers to make place for prometheus.

To this by probing with the existing filter encoder function

Also, merge all "allowXXX" scope props into a single "allowedScopes" prop

jotak · 2024-05-10T12:13:10Z

Added 2 commits:

one to avoid requiring a new config that binds filters to fields (mainly because such a binding already exists although it's done more implicitly - in fact, at first I was having errors with this PR, as selecting Prometheus as a datasource was breaking the display; it turned out it was because these fields-filters bindings were missing; I guess the intent was to create a new PR on the operator to add them; but after all I preferred to remove the requirement for these explicit bindings as we have another way to find them, and IMO it's better to not define them in two places, that could be error prone.
second is just to adapt the display of the available groups to the scopes allowed

jotak · 2024-05-10T12:21:35Z

A few thoughts, IMO no blockers, perhaps that's for the next release, if we don't find time here:

New JIRA https://issues.redhat.com/browse/NETOBSERV-1643 to improve error handling & UX
Detection of invalid queries could be improved (perhaps on both backend+frontend) to tell things like: "given the desired filters, it's not possible to group by X" (e.g. if a user filters on a given namespace then we have no metric to group by node)
The metrics eligible for use are currently limited to the predefined metrics. E.g. there's no zone-based topology possible, because of that. We might open eligible metrics to custom ones.

web/src/components/netflow-traffic.tsx

codecov-commenter · 2024-05-10T15:35:07Z

Codecov Report

Attention: Patch coverage is 54.52696% with 447 lines in your changes are missing coverage. Please review.

Project coverage is 56.71%. Comparing base (994ed1e) to head (3f35b6c).
Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #516      +/-   ##
==========================================
- Coverage   57.27%   56.71%   -0.56%     
==========================================
  Files         169      174       +5     
  Lines        8487     8990     +503     
  Branches     1160     1179      +19     
==========================================
+ Hits         4861     5099     +238     
- Misses       3299     3528     +229     
- Partials      327      363      +36

Flag	Coverage Δ
uitests	`57.85% <58.25%> (-0.08%)`	⬇️
unittests	`53.45% <54.09%> (-1.56%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
pkg/config/loki.go	`55.55% <ø> (ø)`
pkg/handler/lokiclientmock/loki_client_mock.go	`0.00% <ø> (ø)`
pkg/httpclient/http_client.go	`43.90% <100.00%> (+4.42%)`	⬆️
pkg/model/fields/fields.go	`88.88% <ø> (ø)`
pkg/model/loki.go	`38.66% <ø> (ø)`
pkg/server/server.go	`65.51% <100.00%> (ø)`
pkg/utils/utils.go	`77.77% <ø> (+2.77%)`	⬆️
web/src/api/loki.ts	`85.71% <ø> (ø)`
web/src/components/__tests-data__/flows.ts	`100.00% <ø> (ø)`
web/src/components/dropdowns/group-dropdown.tsx	`71.42% <100.00%> (-0.80%)`	⬇️
... and 40 more

... and 1 file with indirect coverage changes

- Manage FlowDirection: first check if query can be performed using a metric agnostic to direction (Any); else, check both Ingress and Egress metric, and combine them in promQL with OR - Handle "getNamesForPrefix" with prom - Split "getTopologyFlows" in 2 parts for cyclo cplx - Improve error messages when query can't be performed using prometheus - Add tests

memodi · 2024-05-17T14:08:06Z

@memodi do you have a cluster up that I can use to investigate?

unfortunately, no. But I did manage to capture the API Response to prometheus query, it's a datapoint with timestamp: 1715875868.586. Like I said, I only observed in this one off instance when I let it run over a long period and I am not sure if its reproducible if I did same test again, however I would like to understand why it happens and if there's a chance of this getting augmented in large workloads.

jotak · 2024-05-17T14:27:04Z

unfortunately, no. But I did manage to capture the API Response to prometheus query, it's a datapoint with timestamp: 1715875868.586. Like I said, I only observed in this one off instance when I let it run over a long period and I am not sure if its reproducible if I did same test again, however I would like to understand why it happens and if there's a chance of this getting augmented in large workloads.

@memodi I don't have much idea on what could cause that :( We've seen doubled rates before, when there was a double-configured prometheus (user workload + cluster monitoring) .. any chance it could be that? (like, having set both of them for a short period of time, before removing one)

memodi · 2024-05-17T15:11:40Z

any chance it could be that

ah, there's definitely that chance, because we use this script: https://gitlab.cee.redhat.com/netobserv-qe/netobserv-qe-scripts/-/blob/main/netobserv.sh?ref_type=heads#L309 which enables user workload monitoring but I think I didn't see metrics being populated (probably fast enough) so patched CSV with DOWNSTREAM=true

memodi · 2024-05-17T15:28:01Z

/ok-to-test

github-actions · 2024-05-17T15:31:23Z

New image:
quay.io/netobserv/network-observability-console-plugin:9794230

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=9794230 make set-plugin-image

In a previous implementation, metric names came with their "_bucket" prefix for histograms; now this isn't the case anymore but the query generation wasn't updated accordingly

jotak · 2024-05-21T10:54:29Z

@memodi new build fixes an issue with percentiles (stats on RTT or DNS latency)

github-actions · 2024-05-21T10:56:30Z

New image:
quay.io/netobserv/network-observability-console-plugin:1b91ab5

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=1b91ab5 make set-plugin-image

msherif1234 · 2024-05-21T15:28:59Z

pkg/handler/topology.go

+			// Unfortunately we don't know a safe and generic way to pre-flight check if the user will be authorized
+			hlog.Info("Retrying with Loki...")
+			flows, code, err = h.getTopologyFlows(ctx, clients, params, constants.DataSourceLoki)
+		}


why not using something like wait.PollUntilContextTimeout to implement proper retry instead of retrying just once ?

It's not a "Retry until something" kind of thing, but replaying same query while switching to the other datasource

memodi · 2024-05-21T19:19:51Z

@jotak - it doesn't look like NETOBSERV-1652 is fixed and I see other issue with topology view where filters are not remembered take a look at recording .
Environment I was using:

OCP: 4.16.0-0.nightly-2024-05-21-043355
NetObserv operator: v0.0.0-3e9c83b
Loki: v5.9.2
eBPF-agent: main
FLP: main
ConsolePlugin: 1b91ab5

jotak · 2024-05-22T06:57:58Z

@jotak - it doesn't look like NETOBSERV-1652 is fixed and I see other issue with topology view where filters are not remembered take a look at recording . Environment I was using:

Oh sure, that's because I didn't implement datasource memorization, I fixed NETOBSERV-1652 the other way around ie. re-playing the query to Loki when I get a 401 or 403 error from Prometheus, which should be good to fix the UX issue.

What's the other settings that are not memorized? I'm not sure if every settings have to be memorized I think it's on purpose to keep only the one we consider more important... perhaps that will be something to check with @jpinsonneau when he's back

msherif1234 · 2024-05-22T10:19:21Z

/lgtm

msherif1234 · 2024-05-22T11:13:27Z

/test plugin-cypress

memodi · 2024-05-22T15:46:08Z

@jotak - it doesn't look like NETOBSERV-1652 is fixed and I see other issue with topology view where filters are not remembered take a look at recording . Environment I was using:

Oh sure, that's because I didn't implement datasource memorization, I fixed NETOBSERV-1652 the other way around ie. re-playing the query to Loki when I get a 401 or 403 error from Prometheus, which should be good to fix the UX issue.

What's the other settings that are not memorized? I'm not sure if every settings have to be memorized I think it's on purpose to keep only the one we consider more important... perhaps that will be something to check with @jpinsonneau when he's back

it's the filters, in recording at 0:19 seconds, when I come back to the page , the topology view shows all traffic including the infra traffic whereas there's already "app" filter added. I didn't mean to suggest to cache everything, but this seems like a regression bug. I think it only happens when Datasource is Loki or prometheus query returns 401/403 and loki is used as fallback.

What's the other settings that are not memorized?

memodi · 2024-05-22T16:21:25Z

/retest

jotak · 2024-05-23T08:32:55Z

it's the filters, in recording at 0:19 seconds, when I come back to the page , the topology view shows all traffic including the infra traffic whereas there's already "app" filter added. I didn't mean to suggest to cache everything, but this seems like a regression bug. I think it only happens when Datasource is Loki or prometheus query returns 401/403 and loki is used as fallback.

I don't think this is related to this PR, I think this is because, when opening the topology (after refreshing or coming from another page), there's first a query made with no filters and that's immediately replaced with another query with filters, cf screenshot (two subsequent queries on metrics endpoint):

And that's not from this PR, this is the same on main. The second query is the one with correct filters and is supposed to hide results of the first one but I guess in some rare cases the first arrives too late and hides the correct one. I would say this is a bug to open on main but not for this PR. Ideally we should only run a single query, not two.

I think this bug is also what makes the topology sort of "blinking" on first load, I've noticed that.

BTW is that something you can frequently reproduce? On my side I don't reproduce, even though I think I get the root cause

memodi · 2024-05-23T16:42:26Z

And that's not from this PR, this is the same on main.

you're right, I confirm issue exists on main too. I filed a bug: https://issues.redhat.com/browse/NETOBSERV-1661 and I think this is good to go.

/label qe-approved

openshift-ci-robot · 2024-05-23T16:42:31Z

jotak · 2024-05-27T07:04:10Z

/approve

openshift-ci · 2024-05-27T07:04:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jotak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jotak]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot added the do-not-merge/work-in-progress label Apr 16, 2024

openshift-merge-robot added the needs-rebase label Apr 16, 2024

jpinsonneau force-pushed the 740 branch from 7fa20d4 to 551ae05 Compare April 17, 2024 11:08

jpinsonneau force-pushed the 740 branch from 872c92f to ae999b1 Compare April 25, 2024 13:58

jotak self-assigned this May 7, 2024

jotak and others added 12 commits May 10, 2024 08:42

Add prom client vendor

7243b10

Manage DnsFlagsResponseCode

f1c861d

Use explicit metrics defs; better error message

9147041

Use explicit metrics config; handle prom labels values

26db7d9

return datasources in query result

8af6113

add datasource param & fix tests

6121085

remove disabled check

31aa7a1

manage datasources

6350c7a

hide filters not available

45fe9bc

Detect available filters without new config

644878a

To this by probing with the existing filter encoder function

Do not show invalid groups

04d45b6

Also, merge all "allowXXX" scope props into a single "allowedScopes" prop

jotak force-pushed the 740 branch from ae999b1 to 04d45b6 Compare May 10, 2024 12:08

openshift-merge-robot removed the needs-rebase label May 10, 2024

jotak marked this pull request as ready for review May 10, 2024 12:22

openshift-ci bot removed the do-not-merge/work-in-progress label May 10, 2024

jotak reviewed May 10, 2024

View reviewed changes

web/src/components/netflow-traffic.tsx Show resolved Hide resolved

jotak added 3 commits May 10, 2024 16:47

fix front lint

3515413

Fix tab tooltip not showing

a37d06d

run prettier

8889828

NETOBSERV-1652: retry with Loki on prom 401/403

7122c09

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label May 17, 2024

openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label May 17, 2024

Fix percentile queries, metric name has no suffix

3f35b6c

In a previous implementation, metric names came with their "_bucket" prefix for histograms; now this isn't the case anymore but the query generation wasn't updated accordingly

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label May 21, 2024

jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label May 21, 2024

msherif1234 reviewed May 21, 2024

View reviewed changes

openshift-ci bot assigned msherif1234 May 22, 2024

openshift-ci bot added the lgtm label May 22, 2024

openshift-ci bot added the qe-approved QE has approved this pull request label May 23, 2024

openshift-ci bot added the approved label May 27, 2024

openshift-merge-bot bot merged commit fade3b6 into netobserv:main May 27, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NETOBSERV-740: Metrics integration - console plugin frontend #516

NETOBSERV-740: Metrics integration - console plugin frontend #516

jpinsonneau commented Apr 16, 2024 •

edited by jotak

Loading

openshift-ci bot commented Apr 16, 2024

jotak commented May 10, 2024

jotak commented May 10, 2024

codecov-commenter commented May 10, 2024 •

edited by codecov bot

Loading

memodi commented May 17, 2024

jotak commented May 17, 2024

memodi commented May 17, 2024

memodi commented May 17, 2024

github-actions bot commented May 17, 2024

jotak commented May 21, 2024

github-actions bot commented May 21, 2024

msherif1234 May 21, 2024

jotak May 22, 2024 •

edited

Loading

memodi commented May 21, 2024 •

edited

Loading

jotak commented May 22, 2024 •

edited

Loading

msherif1234 commented May 22, 2024

msherif1234 commented May 22, 2024

memodi commented May 22, 2024

memodi commented May 22, 2024

jotak commented May 23, 2024

memodi commented May 23, 2024

openshift-ci-robot commented May 23, 2024 •

edited by openshift-ci bot

Loading

Description

Dependencies

Checklist

jotak commented May 27, 2024

openshift-ci bot commented May 27, 2024

NETOBSERV-740: Metrics integration - console plugin frontend #516

NETOBSERV-740: Metrics integration - console plugin frontend #516

Conversation

jpinsonneau commented Apr 16, 2024 • edited by jotak Loading

Description

Dependencies

Checklist

openshift-ci bot commented Apr 16, 2024

jotak commented May 10, 2024

jotak commented May 10, 2024

codecov-commenter commented May 10, 2024 • edited by codecov bot Loading

Codecov Report

memodi commented May 17, 2024

jotak commented May 17, 2024

memodi commented May 17, 2024

memodi commented May 17, 2024

github-actions bot commented May 17, 2024

jotak commented May 21, 2024

github-actions bot commented May 21, 2024

msherif1234 May 21, 2024

Choose a reason for hiding this comment

jotak May 22, 2024 • edited Loading

Choose a reason for hiding this comment

memodi commented May 21, 2024 • edited Loading

jotak commented May 22, 2024 • edited Loading

msherif1234 commented May 22, 2024

msherif1234 commented May 22, 2024

memodi commented May 22, 2024

memodi commented May 22, 2024

jotak commented May 23, 2024

memodi commented May 23, 2024

openshift-ci-robot commented May 23, 2024 • edited by openshift-ci bot Loading

Description

Dependencies

Checklist

jotak commented May 27, 2024

openshift-ci bot commented May 27, 2024

jpinsonneau commented Apr 16, 2024 •

edited by jotak

Loading

codecov-commenter commented May 10, 2024 •

edited by codecov bot

Loading

jotak May 22, 2024 •

edited

Loading

memodi commented May 21, 2024 •

edited

Loading

jotak commented May 22, 2024 •

edited

Loading

openshift-ci-robot commented May 23, 2024 •

edited by openshift-ci bot

Loading