Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-740: Metrics integration - console plugin frontend #516

Merged
merged 22 commits into from
May 27, 2024

Conversation

jpinsonneau
Copy link
Contributor

@jpinsonneau jpinsonneau commented Apr 16, 2024

Description

Add data source to query option dropdown

Dependencies

Based on #513

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
    • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
    • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
    • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
    • Standard QE validation, with pre-merge tests unless stated otherwise.
    • Regression tests only (e.g. refactoring with no user-facing change).
    • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Copy link

openshift-ci bot commented Apr 16, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

jotak and others added 12 commits May 10, 2024 08:42
This introduces prometheus as a new datasource, that can be used either
as a replacement or as a complement of Loki. It doesn't require to
change the frontend/backend API interface: on every frontend query,
backend checks if that query is transposable to prometheus, and if so,
runs it on prometheus, else falls back on loki (if it's enabled)

There's some refactoring of the config and topology handlers to make
place for prometheus.
To this by probing with the existing filter encoder function
Also, merge all "allowXXX" scope props into a single "allowedScopes"
prop
@jotak
Copy link
Member

jotak commented May 10, 2024

Added 2 commits:

  • one to avoid requiring a new config that binds filters to fields (mainly because such a binding already exists although it's done more implicitly - in fact, at first I was having errors with this PR, as selecting Prometheus as a datasource was breaking the display; it turned out it was because these fields-filters bindings were missing; I guess the intent was to create a new PR on the operator to add them; but after all I preferred to remove the requirement for these explicit bindings as we have another way to find them, and IMO it's better to not define them in two places, that could be error prone.
  • second is just to adapt the display of the available groups to the scopes allowed

@jotak
Copy link
Member

jotak commented May 10, 2024

A few thoughts, IMO no blockers, perhaps that's for the next release, if we don't find time here:

  • New JIRA https://issues.redhat.com/browse/NETOBSERV-1643 to improve error handling & UX
  • Detection of invalid queries could be improved (perhaps on both backend+frontend) to tell things like: "given the desired filters, it's not possible to group by X" (e.g. if a user filters on a given namespace then we have no metric to group by node)
  • The metrics eligible for use are currently limited to the predefined metrics. E.g. there's no zone-based topology possible, because of that. We might open eligible metrics to custom ones.

@jotak jotak marked this pull request as ready for review May 10, 2024 12:22
@codecov-commenter
Copy link

codecov-commenter commented May 10, 2024

Codecov Report

Attention: Patch coverage is 54.52696% with 447 lines in your changes are missing coverage. Please review.

Project coverage is 56.71%. Comparing base (994ed1e) to head (3f35b6c).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #516      +/-   ##
==========================================
- Coverage   57.27%   56.71%   -0.56%     
==========================================
  Files         169      174       +5     
  Lines        8487     8990     +503     
  Branches     1160     1179      +19     
==========================================
+ Hits         4861     5099     +238     
- Misses       3299     3528     +229     
- Partials      327      363      +36     
Flag Coverage Δ
uitests 57.85% <58.25%> (-0.08%) ⬇️
unittests 53.45% <54.09%> (-1.56%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
pkg/config/loki.go 55.55% <ø> (ø)
pkg/handler/lokiclientmock/loki_client_mock.go 0.00% <ø> (ø)
pkg/httpclient/http_client.go 43.90% <100.00%> (+4.42%) ⬆️
pkg/model/fields/fields.go 88.88% <ø> (ø)
pkg/model/loki.go 38.66% <ø> (ø)
pkg/server/server.go 65.51% <100.00%> (ø)
pkg/utils/utils.go 77.77% <ø> (+2.77%) ⬆️
web/src/api/loki.ts 85.71% <ø> (ø)
web/src/components/__tests-data__/flows.ts 100.00% <ø> (ø)
web/src/components/dropdowns/group-dropdown.tsx 71.42% <100.00%> (-0.80%) ⬇️
... and 40 more

... and 1 file with indirect coverage changes

- Manage FlowDirection: first check if query can be performed using a
  metric agnostic to direction (Any); else, check both Ingress and
Egress metric, and combine them in promQL with OR
- Handle "getNamesForPrefix" with prom
- Split "getTopologyFlows" in 2 parts for cyclo cplx
- Improve error messages when query can't be performed using prometheus
- Add tests
@memodi
Copy link
Contributor

memodi commented May 17, 2024

@memodi do you have a cluster up that I can use to investigate?

unfortunately, no. But I did manage to capture the API Response to prometheus query, it's a datapoint with timestamp: 1715875868.586. Like I said, I only observed in this one off instance when I let it run over a long period and I am not sure if its reproducible if I did same test again, however I would like to understand why it happens and if there's a chance of this getting augmented in large workloads.

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label May 17, 2024
@jotak
Copy link
Member

jotak commented May 17, 2024

unfortunately, no. But I did manage to capture the API Response to prometheus query, it's a datapoint with timestamp: 1715875868.586. Like I said, I only observed in this one off instance when I let it run over a long period and I am not sure if its reproducible if I did same test again, however I would like to understand why it happens and if there's a chance of this getting augmented in large workloads.

@memodi I don't have much idea on what could cause that :( We've seen doubled rates before, when there was a double-configured prometheus (user workload + cluster monitoring) .. any chance it could be that? (like, having set both of them for a short period of time, before removing one)

@memodi
Copy link
Contributor

memodi commented May 17, 2024

any chance it could be that

ah, there's definitely that chance, because we use this script: https://gitlab.cee.redhat.com/netobserv-qe/netobserv-qe-scripts/-/blob/main/netobserv.sh?ref_type=heads#L309 which enables user workload monitoring but I think I didn't see metrics being populated (probably fast enough) so patched CSV with DOWNSTREAM=true

@memodi
Copy link
Contributor

memodi commented May 17, 2024

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label May 17, 2024
Copy link

New image:
quay.io/netobserv/network-observability-console-plugin:9794230

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=9794230 make set-plugin-image

In a previous implementation, metric names came with their "_bucket"
prefix for histograms; now this isn't the case anymore but the query
generation wasn't updated accordingly
@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label May 21, 2024
@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label May 21, 2024
@jotak
Copy link
Member

jotak commented May 21, 2024

@memodi new build fixes an issue with percentiles (stats on RTT or DNS latency)

Copy link

New image:
quay.io/netobserv/network-observability-console-plugin:1b91ab5

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=1b91ab5 make set-plugin-image

// Unfortunately we don't know a safe and generic way to pre-flight check if the user will be authorized
hlog.Info("Retrying with Loki...")
flows, code, err = h.getTopologyFlows(ctx, clients, params, constants.DataSourceLoki)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not using something like wait.PollUntilContextTimeout to implement proper retry instead of retrying just once ?

Copy link
Member

@jotak jotak May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a "Retry until something" kind of thing, but replaying same query while switching to the other datasource

@memodi
Copy link
Contributor

memodi commented May 21, 2024

@jotak - it doesn't look like NETOBSERV-1652 is fixed and I see other issue with topology view where filters are not remembered take a look at recording .
Environment I was using:

OCP: 4.16.0-0.nightly-2024-05-21-043355
NetObserv operator: v0.0.0-3e9c83b
Loki: v5.9.2
eBPF-agent: main
FLP: main
ConsolePlugin: 1b91ab5

@jotak
Copy link
Member

jotak commented May 22, 2024

@jotak - it doesn't look like NETOBSERV-1652 is fixed and I see other issue with topology view where filters are not remembered take a look at recording . Environment I was using:

Oh sure, that's because I didn't implement datasource memorization, I fixed NETOBSERV-1652 the other way around ie. re-playing the query to Loki when I get a 401 or 403 error from Prometheus, which should be good to fix the UX issue.

What's the other settings that are not memorized? I'm not sure if every settings have to be memorized I think it's on purpose to keep only the one we consider more important... perhaps that will be something to check with @jpinsonneau when he's back

@msherif1234
Copy link
Contributor

/lgtm

@msherif1234
Copy link
Contributor

/test plugin-cypress

@memodi
Copy link
Contributor

memodi commented May 22, 2024

@jotak - it doesn't look like NETOBSERV-1652 is fixed and I see other issue with topology view where filters are not remembered take a look at recording . Environment I was using:

Oh sure, that's because I didn't implement datasource memorization, I fixed NETOBSERV-1652 the other way around ie. re-playing the query to Loki when I get a 401 or 403 error from Prometheus, which should be good to fix the UX issue.

What's the other settings that are not memorized? I'm not sure if every settings have to be memorized I think it's on purpose to keep only the one we consider more important... perhaps that will be something to check with @jpinsonneau when he's back

it's the filters, in recording at 0:19 seconds, when I come back to the page , the topology view shows all traffic including the infra traffic whereas there's already "app" filter added. I didn't mean to suggest to cache everything, but this seems like a regression bug. I think it only happens when Datasource is Loki or prometheus query returns 401/403 and loki is used as fallback.

What's the other settings that are not memorized?

@memodi
Copy link
Contributor

memodi commented May 22, 2024

/retest

@jotak
Copy link
Member

jotak commented May 23, 2024

it's the filters, in recording at 0:19 seconds, when I come back to the page , the topology view shows all traffic including the infra traffic whereas there's already "app" filter added. I didn't mean to suggest to cache everything, but this seems like a regression bug. I think it only happens when Datasource is Loki or prometheus query returns 401/403 and loki is used as fallback.

I don't think this is related to this PR, I think this is because, when opening the topology (after refreshing or coming from another page), there's first a query made with no filters and that's immediately replaced with another query with filters, cf screenshot (two subsequent queries on metrics endpoint):
image

And that's not from this PR, this is the same on main. The second query is the one with correct filters and is supposed to hide results of the first one but I guess in some rare cases the first arrives too late and hides the correct one. I would say this is a bug to open on main but not for this PR. Ideally we should only run a single query, not two.

I think this bug is also what makes the topology sort of "blinking" on first load, I've noticed that.

BTW is that something you can frequently reproduce? On my side I don't reproduce, even though I think I get the root cause

@memodi
Copy link
Contributor

memodi commented May 23, 2024

And that's not from this PR, this is the same on main.

you're right, I confirm issue exists on main too. I filed a bug: https://issues.redhat.com/browse/NETOBSERV-1661 and I think this is good to go.

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved QE has approved this pull request label May 23, 2024
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented May 23, 2024

@jpinsonneau: This pull request references NETOBSERV-740 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to this:

Description

Add data source to query option dropdown

Dependencies

Based on #513

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jotak
Copy link
Member

jotak commented May 27, 2024

/approve

Copy link

openshift-ci bot commented May 27, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jotak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit fade3b6 into netobserv:main May 27, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved jira/valid-reference lgtm ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. qe-approved QE has approved this pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants