NETOBSERV-1688: relax ebpf drop alert #674

jotak · 2024-06-10T08:40:34Z

reformulate some of the text for ebpf drop alert, and relax threshold a little bit.

Description

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

openshift-ci-robot · 2024-06-10T08:40:37Z

jotak · 2024-06-10T08:48:51Z

@msherif1234 @memodi please let me know also if the new doc & alert text is better like that. The changes I'm making are to convey:

that the root cause might be the map being "busy" (EBUSY errno) , which in my understanding can stand for being full, but not necessarily
that this could be "just" packets missed but not necessarily missing flows (ie. it's just the bytes/packets counter in an existing flow that won't be updated in case of miss, but the flow will still appear)
also I prefer the word "missing" rather than "dropping" because I don't want user could think we could actually drop a packet ie. interfere with the data

codecov · 2024-06-10T08:59:43Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 66.78%. Comparing base (bc0bcef) to head (dc257fb).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #674      +/-   ##
==========================================
+ Coverage   66.60%   66.78%   +0.18%     
==========================================
  Files          70       69       -1     
  Lines        8115     8093      -22     
==========================================
  Hits         5405     5405              
+ Misses       2315     2293      -22     
  Partials      395      395

Flag	Coverage Δ
unittests	`66.78% <100.00%> (+0.18%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
apis/flowcollector/v1beta1/flowcollector_types.go	`100.00% <ø> (ø)`
apis/flowcollector/v1beta2/flowcollector_types.go	`100.00% <ø> (ø)`
controllers/ebpf/agent_controller.go	`59.64% <ø> (ø)`
controllers/ebpf/agent_metrics.go	`89.56% <100.00%> (ø)`

msherif1234 · 2024-06-10T10:40:46Z

controllers/ebpf/agent-metrics.go

 			},
-			Expr: intstr.FromString("sum(rate(netobserv_agent_dropped_flows_total[1m])) > 0"),
+			Expr: intstr.FromString("sum(rate(netobserv_agent_dropped_flows_total[1m])) > 100"),


just put this 100 limit in the constants

memodi · 2024-06-10T13:45:14Z

controllers/ebpf/agent_metrics.go

@@ -124,10 +124,10 @@ func (c *AgentController) agentPrometheusRule(target *flowslatest.FlowCollectorE
 		rules = append(rules, monitoringv1.Rule{
 			Alert: string(flowslatest.AlertDroppedFlows),
 			Annotations: map[string]string{
-				"description": "NetObserv eBPF agent is not able to process new flows. Possible reasons are the BPF hashmap being full, or the capacity limiter being triggered. Both causes can be worked around by increasing cacheMaxFlows value in Flowcollector resource.",
-				"summary":     "NetObserv eBPF is not able to process any new flows",
+				"description": "NetObserv eBPF agent is missing packets or flows. The metric netobserv_agent_dropped_flows_total provides more information on the cause. Possible reasons are the BPF hashmap being busy or full, or the capacity limiter being triggered. This may be worked around by increasing cacheMaxFlows value in Flowcollector resource.",


by saying "full", doesn't that mean capacity ( of hashmap) is reached? if it's same thing than we can update text busy or full, or the capacity limiter being triggered since its a repetition.

the capacity limiter is something different than the hashmap capacity , it's something that comes later in the process (just before flows are about to be sent over the network to kafka or FLP)

I am fine with new description

I see, so increasing the cacheMaxFlows value will help when flows/packets are dropped due to hashmap being full but it wouldn't really help with if the capacity limit is triggered, correct?

Actually, no. The capacity limiter also obeys to cacheMaxFlows setting, so it will also be a solution in that case.
The case where cacheMaxFlows won't be a solution is the very one we've seen here: when packets are missed due to hashmap being "busy", but not full. Unfortunately we haven't found so far a way to distinguish cases between this sort of "busy" error versus map being full, that would be easily visible to users. One thing they might do is correlating this error with other metrics that provide the current size usage of the map, and see whether or not it's close to its max capacity.

There's certainly more that we could do here, in follow-ups; like creating another alert precisely for when the map size comes close to its capacity.

memodi · 2024-06-10T14:03:07Z

controllers/ebpf/agent_controller.go

+	bpfTraceMountPath          = "/sys/kernel/debug"
+	bpfNetNSMountName          = "var-run-netns"
+	bpfNetNSMountPath          = "/var/run/netns"
+	droppedFlowsAlertThreshold = 100


the threshold of 100 , is it based on some testing or an arbitrary value that we think is good (not too high or not too low)?

random number atm

That's arbitrary, but I don't see how to not make it arbitrary, unless by making it a configurable threshold. People who would like to set another limit can still copy/paste the rule to a new one, and turn off this alert in netobserv. The rule is still useful in that sense.

Still, I did test it on my cluster.
With my "medium" hey-ho test intensity (which actually is quite high already), I'm between 10 and 15 drops/sec. With my "high" intensity I'm around 25-30.

Bigger clusters would likely see more drops/sec under stress.

got it, thanks!

memodi · 2024-06-10T16:07:09Z

/ok-to-test

github-actions · 2024-06-10T16:10:38Z

New images:

quay.io/netobserv/network-observability-operator:cb21b5e
quay.io/netobserv/network-observability-operator-bundle:v0.0.0-cb21b5e
quay.io/netobserv/network-observability-operator-catalog:v0.0.0-cb21b5e

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:cb21b5e make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-cb21b5e

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-cb21b5e
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

msherif1234 · 2024-06-10T16:24:34Z

controllers/ebpf/agent-metrics-test.go

@@ -1,56 +0,0 @@
-package ebpf


any reason for deleting this unit-test ?

Yes I explained it in the commit message :)
dc257fb

memodi

/label qe-approved

(based on testing #675)

openshift-ci-robot · 2024-06-11T15:43:20Z

jotak · 2024-06-11T15:45:52Z

/approve

openshift-ci · 2024-06-11T15:45:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jotak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jotak]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

reformulate some of the text for ebpf drop alert, and relax threshold a little bit.

On removing test: agent-metrics-test.go did never run; it should have been named "agent_metrics_test.go" to run (_ instead of -). When renamed, it actually fails. Given how little value these tests are adding, I'm removing them rather than spending time to fix them. (they only check that the objects created names match with the expected constant) The controller integration tests cover more and are much more relevant (and they run)

msherif1234 · 2024-06-11T16:10:26Z

/lgtm

* NETOBSERV-1688: relax ebpf drop alert reformulate some of the text for ebpf drop alert, and relax threshold a little bit. * Use const for alert threshold; remove unused test On removing test: agent-metrics-test.go did never run; it should have been named "agent_metrics_test.go" to run (_ instead of -). When renamed, it actually fails. Given how little value these tests are adding, I'm removing them rather than spending time to fix them. (they only check that the objects created names match with the expected constant) The controller integration tests cover more and are much more relevant (and they run)

jotak requested review from msherif1234 and memodi June 10, 2024 08:40

openshift-ci-robot added the jira/valid-reference label Jun 10, 2024

msherif1234 reviewed Jun 10, 2024

View reviewed changes

jotak mentioned this pull request Jun 10, 2024

NETOBSERV-1688: relax ebpf drop alert [1.6 backport] #675

Merged

memodi requested changes Jun 10, 2024

View reviewed changes

openshift-ci bot assigned memodi Jun 10, 2024

openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 10, 2024

msherif1234 reviewed Jun 10, 2024

View reviewed changes

jotak requested a review from memodi June 10, 2024 16:37

memodi approved these changes Jun 11, 2024

View reviewed changes

openshift-ci bot added qe-approved QE has approved this pull request lgtm labels Jun 11, 2024

openshift-ci bot added the approved label Jun 11, 2024

jotak added 2 commits June 11, 2024 18:01

NETOBSERV-1688: relax ebpf drop alert

96ad510

reformulate some of the text for ebpf drop alert, and relax threshold a little bit.

jotak force-pushed the relax-alert branch from dc257fb to 1ad123b Compare June 11, 2024 16:02

openshift-ci bot removed the lgtm label Jun 11, 2024

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 11, 2024

openshift-ci bot assigned msherif1234 Jun 11, 2024

openshift-ci bot added the lgtm label Jun 11, 2024

openshift-merge-bot bot merged commit e1c9f3c into netobserv:main Jun 11, 2024
8 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NETOBSERV-1688: relax ebpf drop alert #674

NETOBSERV-1688: relax ebpf drop alert #674

jotak commented Jun 10, 2024

openshift-ci-robot commented Jun 10, 2024 •

edited by openshift-ci bot

Loading

Description

Dependencies

Checklist

jotak commented Jun 10, 2024

codecov bot commented Jun 10, 2024 •

edited

Loading

msherif1234 Jun 10, 2024

jotak Jun 10, 2024

memodi Jun 10, 2024

jotak Jun 10, 2024

msherif1234 Jun 10, 2024

memodi Jun 10, 2024

jotak Jun 11, 2024 •

edited

Loading

memodi Jun 10, 2024

msherif1234 Jun 10, 2024

jotak Jun 10, 2024

memodi Jun 10, 2024

memodi commented Jun 10, 2024

github-actions bot commented Jun 10, 2024

msherif1234 Jun 10, 2024

jotak Jun 10, 2024

memodi left a comment

openshift-ci-robot commented Jun 11, 2024 •

edited by openshift-ci bot

Loading

Description

Dependencies

Checklist

jotak commented Jun 11, 2024

openshift-ci bot commented Jun 11, 2024

msherif1234 commented Jun 11, 2024

NETOBSERV-1688: relax ebpf drop alert #674

NETOBSERV-1688: relax ebpf drop alert #674

Conversation

jotak commented Jun 10, 2024

Description

Dependencies

Checklist

openshift-ci-robot commented Jun 10, 2024 • edited by openshift-ci bot Loading

Description

Dependencies

Checklist

jotak commented Jun 10, 2024

codecov bot commented Jun 10, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jotak Jun 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

memodi commented Jun 10, 2024

github-actions bot commented Jun 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

memodi left a comment

Choose a reason for hiding this comment

openshift-ci-robot commented Jun 11, 2024 • edited by openshift-ci bot Loading

Description

Dependencies

Checklist

jotak commented Jun 11, 2024

openshift-ci bot commented Jun 11, 2024

msherif1234 commented Jun 11, 2024

openshift-ci-robot commented Jun 10, 2024 •

edited by openshift-ci bot

Loading

codecov bot commented Jun 10, 2024 •

edited

Loading

jotak Jun 11, 2024 •

edited

Loading

openshift-ci-robot commented Jun 11, 2024 •

edited by openshift-ci bot

Loading