NETOBSERV-739: Add prometheus #613

jotak · 2024-04-10T16:20:54Z

Description

Adds a new config section in CRD to allow reading metrics via prometheus ( / thanos) :

Auto mode: (when used in openshift, automatically uses monitoring's thanos)

  prometheus:
    querier:
      enable: true
      mode: Auto
      timeout: 30s

Manual mode: to manually configure prometheus endpoint / TLS etc.

This is enabled by default.

Dependencies

NETOBSERV-740: Metrics integration - console plugin frontend network-observability-console-plugin#516

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

openshift-ci · 2024-04-10T16:21:00Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2024-04-10T16:21:02Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jotak. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2024-04-10T16:21:18Z

openshift-ci-robot · 2024-04-11T12:26:32Z

github-actions · 2024-04-11T12:38:33Z

New images:

quay.io/netobserv/network-observability-operator:dd87eac
quay.io/netobserv/network-observability-operator-bundle:v0.0.0-dd87eac
quay.io/netobserv/network-observability-operator-catalog:v0.0.0-dd87eac

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:dd87eac make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-dd87eac

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-dd87eac
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

jpinsonneau · 2024-04-16T10:24:57Z

apis/flowcollector/v1beta2/flowcollector_types.go

+	// `prometheusClient` defines Prometheus client settings, used to fetch metrics from the Console plugin.
+	PrometheusClient FlowCollectorPrometheusClient `json:"prometheusClient,omitempty"`
+


Suggested change

// `prometheusClient` defines Prometheus client settings, used to fetch metrics from the Console plugin.

PrometheusClient FlowCollectorPrometheusClient `json:"prometheusClient,omitempty"`

// `prometheus` defines Prometheus client settings, used to fetch metrics from the Console plugin.

Prometheus FlowCollectorPrometheus `json:"prometheus,omitempty"`

For consistency, I feel it would be better to simply have prometheus as same as loki

that could contains FLP prometheus settings in future

idk, I added "client" precisely to avoid confusion with the metrics producing side.
Like, if we use prometheus, then what would the user expect the setting prometheus.enable to be? Likely that it's going to turn off metrics generation, which is not the case.
Or should we come up with something like:

prometheus: querier: <insert all settings>

?

That makes sense indeed. And we should also end up allowing users to disable prometheus export from FLP too

the current way to do this would be to set an empty includeList of metrics

github-actions · 2024-04-16T11:18:35Z

New images:

quay.io/netobserv/network-observability-operator:a7cd60f
quay.io/netobserv/network-observability-operator-bundle:v0.0.0-a7cd60f
quay.io/netobserv/network-observability-operator-catalog:v0.0.0-a7cd60f

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:a7cd60f make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-a7cd60f

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-a7cd60f
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

Makefile

codecov-commenter · 2024-04-25T16:50:39Z

Codecov Report

Attention: Patch coverage is 51.19048% with 123 lines in your changes are missing coverage. Please review.

Project coverage is 66.53%. Comparing base (18f3da6) to head (daf0382).
Report is 4 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #613      +/-   ##
==========================================
- Coverage   67.00%   66.53%   -0.47%     
==========================================
  Files          68       70       +2     
  Lines        7804     8095     +291     
==========================================
+ Hits         5229     5386     +157     
- Misses       2197     2315     +118     
- Partials      378      394      +16

Flag	Coverage Δ
unittests	`66.53% <51.19%> (-0.47%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
apis/flowcollector/v1beta1/flowcollector_types.go	`100.00% <ø> (ø)`
apis/flowcollector/v1beta2/flowcollector_types.go	`100.00% <ø> (ø)`
controllers/consoleplugin/config/config.go	`75.00% <ø> (ø)`
...trollers/consoleplugin/consoleplugin_reconciler.go	`71.21% <100.00%> (-0.22%)`	⬇️
pkg/helper/flowcollector.go	`82.21% <100.00%> (+0.17%)`	⬆️
pkg/metrics/predefined_metrics.go	`100.00% <100.00%> (ø)`
...pis/flowcollector/v1beta2/zz_generated.deepcopy.go	`43.76% <41.66%> (-0.12%)`	⬇️
controllers/consoleplugin/consoleplugin_objects.go	`87.22% <72.63%> (-4.40%)`	⬇️
...pis/flowcollector/v1beta1/zz_generated.deepcopy.go	`0.00% <0.00%> (ø)`
...s/flowcollector/v1beta1/zz_generated.conversion.go	`39.27% <45.94%> (+0.65%)`	⬆️

... and 1 file with indirect coverage changes

openshift-ci-robot · 2024-05-10T07:17:16Z

openshift-ci-robot · 2024-05-10T07:20:30Z

github-actions · 2024-05-21T10:07:02Z

New images:

quay.io/netobserv/network-observability-operator:3e9c83b
quay.io/netobserv/network-observability-operator-bundle:v0.0.0-3e9c83b
quay.io/netobserv/network-observability-operator-catalog:v0.0.0-3e9c83b

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:3e9c83b make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-3e9c83b

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-3e9c83b
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

jotak · 2024-05-21T10:10:08Z

@memodi FYI: last build adds fix for NETOBSERV-1654 and also a small perf improvement, using exact match with the default quick filters instead of partial matching

msherif1234 · 2024-05-21T15:16:46Z

config/samples/flows_v1beta2_flowcollector.yaml

      default: true
    - name: Services network
      filter:
-        dst_kind: 'Service'
+        dst_kind: '"Service"'


this is bit ugly and not user friendly why did u change the APIs
https://github.com/netobserv/network-observability-operator/pull/613/files#diff-1aad53ab28e869c35c03e94f19d4e37402b869c34eabfd55317931ab5e2b1436R863 ?

cc'd @memodi

it's a perf improvement for queries, have quotes means doing an exact match instead of a partial match, which is more performant

msherif1234 · 2024-05-21T15:19:41Z

another than the API question above
/LGTM

memodi · 2024-05-21T19:04:20Z

@memodi FYI: last build adds fix for NETOBSERV-1654 and also a small perf improvement, using exact match with the default quick filters instead of partial matching

this would need update to our tests that verifies "filters"

jotak · 2024-05-22T10:09:59Z

this would need update to our tests that verifies "filters"

If you feel this shouldn't be in this PR I can remove it, but it seems to me it's a quick & easy way to improve query perf

memodi · 2024-05-22T15:05:43Z

this would need update to our tests that verifies "filters"

If you feel this shouldn't be in this PR I can remove it, but it seems to me it's a quick & easy way to improve query perf

no worries, it's okay to have it in this PR, I just wanted to make a note of it.

memodi · 2024-05-22T18:44:41Z

@jotak the other thing I am noticing is below when flowcollector is created newly, plugin goes from Running --> Terminating --> Running state and I see below events for plugin before it stabilizes, not sure why it's scale down and back up:

netobserv-plugin-7d76c88489-89cf5   1/1     Running             0          5s
netobserv-plugin-bf7d457d8-rgv2z    1/1     Terminating         0          7s
netobserv-plugin-bf7d457d8-rgv2z    0/1     Terminating         0          7s
netobserv-plugin-bf7d457d8-rgv2z    0/1     Terminating         0          8s
netobserv-plugin-bf7d457d8-rgv2z    0/1     Terminating         0          8s
netobserv-plugin-bf7d457d8-rgv2z    0/1     Terminating         0          8s

$ oc get events
7m36s       Normal    Scheduled                pod/netobserv-plugin-7d76c88489-89cf5    Successfully assigned netobserv/netobserv-plugin-7d76c88489-89cf5 to memodi-05221042-qqhgt-worker-c-9zrdj
7m36s       Normal    AddedInterface           pod/netobserv-plugin-7d76c88489-89cf5    Add eth0 [10.130.2.30/23] from ovn-kubernetes
7m36s       Normal    Pulling                  pod/netobserv-plugin-7d76c88489-89cf5    Pulling image "quay.io/netobserv/network-observability-console-plugin:1b91ab5"
7m33s       Normal    Pulled                   pod/netobserv-plugin-7d76c88489-89cf5    Successfully pulled image "quay.io/netobserv/network-observability-console-plugin:1b91ab5" in 2.292s (2.293s including waiting)
7m33s       Normal    Created                  pod/netobserv-plugin-7d76c88489-89cf5    Created container netobserv-plugin
7m33s       Normal    Started                  pod/netobserv-plugin-7d76c88489-89cf5    Started container netobserv-plugin
7m36s       Normal    SuccessfulCreate         replicaset/netobserv-plugin-7d76c88489   Created pod: netobserv-plugin-7d76c88489-89cf5
7m38s       Normal    Scheduled                pod/netobserv-plugin-bf7d457d8-rgv2z     Successfully assigned netobserv/netobserv-plugin-bf7d457d8-rgv2z to memodi-05221042-qqhgt-worker-c-9zrdj
7m38s       Warning   FailedMount              pod/netobserv-plugin-bf7d457d8-rgv2z     MountVolume.SetUp failed for volume "console-serving-cert" : secret "console-serving-cert" not found
7m37s       Normal    AddedInterface           pod/netobserv-plugin-bf7d457d8-rgv2z     Add eth0 [10.130.2.28/23] from ovn-kubernetes
7m37s       Normal    Pulling                  pod/netobserv-plugin-bf7d457d8-rgv2z     Pulling image "quay.io/netobserv/network-observability-console-plugin:1b91ab5"
7m33s       Normal    Pulled                   pod/netobserv-plugin-bf7d457d8-rgv2z     Successfully pulled image "quay.io/netobserv/network-observability-console-plugin:1b91ab5" in 3.411s (3.411s including waiting)
7m33s       Normal    Created                  pod/netobserv-plugin-bf7d457d8-rgv2z     Created container netobserv-plugin
7m33s       Normal    Started                  pod/netobserv-plugin-bf7d457d8-rgv2z     Started container netobserv-plugin
7m31s       Normal    Killing                  pod/netobserv-plugin-bf7d457d8-rgv2z     Stopping container netobserv-plugin
7m38s       Normal    SuccessfulCreate         replicaset/netobserv-plugin-bf7d457d8    Created pod: netobserv-plugin-bf7d457d8-rgv2z
7m33s       Normal    SuccessfulDelete         replicaset/netobserv-plugin-bf7d457d8    Deleted pod: netobserv-plugin-bf7d457d8-rgv2z
7m38s       Normal    ScalingReplicaSet        deployment/netobserv-plugin              Scaled up replica set netobserv-plugin-bf7d457d8 to 1
7m36s       Normal    ScalingReplicaSet        deployment/netobserv-plugin              Scaled up replica set netobserv-plugin-7d76c88489 to 1
7m33s       Normal    ScalingReplicaSet        deployment/netobserv-plugin              Scaled down replica set netobserv-plugin-bf7d457d8 to 0 from 1

confirming this doesn't happen on a latest bundle, but only when Operator bundle is v0.0.0-3e9c83b and Plugin image is 1b91ab5

jotak · 2024-05-23T08:47:04Z

@memodi I don't reproduce this either :-(
Can you share the operator logs when that happens? And the FlowCollector config?
And if possible get the plugin's pod YAML before and after restart

memodi · 2024-05-23T14:47:19Z

@memodi I don't reproduce this either :-( Can you share the operator logs when that happens? And the FlowCollector config? And if possible get the plugin's pod YAML before and after restart

@jotak here's the must-gather tar file: https://drive.google.com/file/d/1O6eduOKzr8SqoxZyylJ1HKxsG8u9eZm9/view?usp=drive_link

- FlowCollector CRD: new config for prometheus client - Allow disabling Loki and still use the console plugin (with prometheus) - Add some labels in metrics to maximize coverage of plugin queries to prom: K8S_FlowLayer, Src|DstK8S_Type on workload metrics Fix configuring metrics with openshift Use explicit metrics config, use enable bool for prom fix tests Use nested `prometheus.querier` in API

openshift-ci · 2024-05-23T16:11:20Z

New changes are detected. LGTM label has been removed.

jotak · 2024-05-23T16:16:20Z

Ok @memodi I think I've fixed it, that's a weird thing I never noticed before, but it seems like when you apply a custom resource without setting fields explicitly, relying on their default value, there's a small time where the CR is actually stored without the fields set, and quickly after it's automatically updated with the default fields. Here, I found that your custom resource did not have any config for prom, so it was first seeing prometheus.querier.mode="" then quickly after it's amended to prometheus.querier.mode="Auto" which is the default.
The reason why I was not able to reproduce was because I didn't rely on defaults in my CR, I had this mode explcitly defined. When I remove the prom config then I can reproduce.

So my fix just considers empty mode is equivalent to "Auto" mode

github-actions · 2024-05-23T16:21:07Z

New images:

quay.io/netobserv/network-observability-operator:1c6ec02
quay.io/netobserv/network-observability-operator-bundle:v0.0.0-1c6ec02
quay.io/netobserv/network-observability-operator-catalog:v0.0.0-1c6ec02

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:1c6ec02 make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-1c6ec02

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-1c6ec02
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

memodi · 2024-05-23T16:37:05Z

thanks @jotak , yes, we usually work with defaults in our tests as much as possible and turn on/off only if we need to for our testing. Last commit fixed the bug.

/label qe-approved

openshift-ci-robot · 2024-05-23T16:37:10Z

jotak · 2024-05-23T16:38:09Z

unit tests are failing weirdly on CI, they work fine locally - I'll see on Monday if that's all clean before merging

openshift-ci bot added the do-not-merge/work-in-progress label Apr 10, 2024

jotak changed the title ~~wip~~ NETOBSERV-739: WIP Add prometheus Apr 10, 2024

openshift-ci-robot added the jira/valid-reference label Apr 10, 2024

jotak force-pushed the add-prom branch from eca5ad4 to ce5f58f Compare April 11, 2024 12:25

jotak mentioned this pull request Apr 11, 2024

NETOBSERV-739: WIP Add prometheus netobserv/network-observability-console-plugin#513

Closed

10 tasks

jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Apr 11, 2024

jotak force-pushed the add-prom branch from ce5f58f to 8930140 Compare April 11, 2024 15:28

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Apr 11, 2024

jpinsonneau reviewed Apr 16, 2024

View reviewed changes

jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Apr 16, 2024

jotak force-pushed the add-prom branch from b8dd874 to 912fa6a Compare April 16, 2024 11:41

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Apr 16, 2024

jotak force-pushed the add-prom branch from 912fa6a to a049d42 Compare April 18, 2024 13:20

jotak commented Apr 18, 2024

View reviewed changes

Makefile Outdated Show resolved Hide resolved

openshift-merge-robot added the needs-rebase label Apr 25, 2024

jotak force-pushed the add-prom branch from a049d42 to 455cedf Compare May 10, 2024 06:37

openshift-merge-robot removed the needs-rebase label May 10, 2024

jotak changed the title ~~NETOBSERV-739: WIP Add prometheus~~ NETOBSERV-739: Add prometheus May 10, 2024

jotak marked this pull request as ready for review May 10, 2024 07:16

openshift-ci bot removed the do-not-merge/work-in-progress label May 10, 2024

jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label May 10, 2024

msherif1234 reviewed May 21, 2024

View reviewed changes

openshift-ci bot assigned msherif1234 May 21, 2024

openshift-ci bot added the lgtm label May 21, 2024

jotak added 6 commits May 23, 2024 18:11

Provide metric direction to console plugin

d2111b6

Address feedback + add SubnetLabel to metrics

75a5802

NETOBSERV-1654: enable workload-based metrics when loki is disabled

1630755

nit: use exact match in quick filters for better query performance

35fecc1

Fix using auto mode if prom mode unset

daf0382

jotak force-pushed the add-prom branch from bb16522 to daf0382 Compare May 23, 2024 16:11

openshift-ci bot removed the lgtm label May 23, 2024

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label May 23, 2024

jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label May 23, 2024

openshift-ci bot added the qe-approved QE has approved this pull request label May 23, 2024

jotak merged commit f278746 into netobserv:main May 27, 2024
11 of 12 checks passed

		// `prometheusClient` defines Prometheus client settings, used to fetch metrics from the Console plugin.
		PrometheusClient FlowCollectorPrometheusClient `json:"prometheusClient,omitempty"`

NETOBSERV-739: Add prometheus #613

NETOBSERV-739: Add prometheus #613

Conversation

jotak commented Apr 10, 2024 • edited Loading

Description

Dependencies

Checklist

openshift-ci bot commented Apr 10, 2024

openshift-ci bot commented Apr 10, 2024

openshift-ci-robot commented Apr 10, 2024 • edited by openshift-ci bot Loading

Description

Dependencies

Checklist

openshift-ci-robot commented Apr 11, 2024 • edited by openshift-ci bot Loading

Description

Dependencies

Checklist

github-actions bot commented Apr 11, 2024

jpinsonneau Apr 16, 2024

Choose a reason for hiding this comment

jpinsonneau Apr 16, 2024

Choose a reason for hiding this comment

jotak Apr 16, 2024

Choose a reason for hiding this comment

jpinsonneau Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

jotak Apr 17, 2024

Choose a reason for hiding this comment

github-actions bot commented Apr 16, 2024

codecov-commenter commented Apr 25, 2024 • edited by codecov bot Loading

Codecov Report

openshift-ci-robot commented May 10, 2024 • edited by openshift-ci bot Loading

Description

Dependencies

Checklist

openshift-ci-robot commented May 10, 2024 • edited by openshift-ci bot Loading

Description

Dependencies

Checklist

github-actions bot commented May 21, 2024

jotak commented May 21, 2024 • edited by openshift-ci bot Loading

msherif1234 May 21, 2024

Choose a reason for hiding this comment

msherif1234 May 21, 2024

Choose a reason for hiding this comment

jotak May 22, 2024

Choose a reason for hiding this comment

msherif1234 commented May 21, 2024

memodi commented May 21, 2024

jotak commented May 22, 2024

memodi commented May 22, 2024

memodi commented May 22, 2024 • edited Loading

jotak commented May 23, 2024

memodi commented May 23, 2024

openshift-ci bot commented May 23, 2024

jotak commented May 23, 2024

github-actions bot commented May 23, 2024

memodi commented May 23, 2024

openshift-ci-robot commented May 23, 2024 • edited by openshift-ci bot Loading

Description

Dependencies

Checklist

jotak commented May 23, 2024

jotak commented Apr 10, 2024 •

edited

Loading

openshift-ci-robot commented Apr 10, 2024 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Apr 11, 2024 •

edited by openshift-ci bot

Loading

jpinsonneau Apr 16, 2024 •

edited

Loading

codecov-commenter commented Apr 25, 2024 •

edited by codecov bot

Loading

openshift-ci-robot commented May 10, 2024 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented May 10, 2024 •

edited by openshift-ci bot

Loading

jotak commented May 21, 2024 •

edited by openshift-ci bot

Loading

memodi commented May 22, 2024 •

edited

Loading

openshift-ci-robot commented May 23, 2024 •

edited by openshift-ci bot

Loading