Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-1688: relax ebpf drop alert #674

Merged
merged 2 commits into from
Jun 11, 2024

Conversation

jotak
Copy link
Member

@jotak jotak commented Jun 10, 2024

reformulate some of the text for ebpf drop alert, and relax threshold a little bit.

Description

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
    • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
    • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
    • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
    • Standard QE validation, with pre-merge tests unless stated otherwise.
    • Regression tests only (e.g. refactoring with no user-facing change).
    • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jun 10, 2024

@jotak: This pull request references NETOBSERV-1688 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.17.0" version, but no target version was set.

In response to this:

reformulate some of the text for ebpf drop alert, and relax threshold a little bit.

Description

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jotak
Copy link
Member Author

jotak commented Jun 10, 2024

@msherif1234 @memodi please let me know also if the new doc & alert text is better like that. The changes I'm making are to convey:

  • that the root cause might be the map being "busy" (EBUSY errno) , which in my understanding can stand for being full, but not necessarily
  • that this could be "just" packets missed but not necessarily missing flows (ie. it's just the bytes/packets counter in an existing flow that won't be updated in case of miss, but the flow will still appear)
  • also I prefer the word "missing" rather than "dropping" because I don't want user could think we could actually drop a packet ie. interfere with the data

Copy link

codecov bot commented Jun 10, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 66.78%. Comparing base (bc0bcef) to head (dc257fb).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #674      +/-   ##
==========================================
+ Coverage   66.60%   66.78%   +0.18%     
==========================================
  Files          70       69       -1     
  Lines        8115     8093      -22     
==========================================
  Hits         5405     5405              
+ Misses       2315     2293      -22     
  Partials      395      395              
Flag Coverage Δ
unittests 66.78% <100.00%> (+0.18%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
apis/flowcollector/v1beta1/flowcollector_types.go 100.00% <ø> (ø)
apis/flowcollector/v1beta2/flowcollector_types.go 100.00% <ø> (ø)
controllers/ebpf/agent_controller.go 59.64% <ø> (ø)
controllers/ebpf/agent_metrics.go 89.56% <100.00%> (ø)

},
Expr: intstr.FromString("sum(rate(netobserv_agent_dropped_flows_total[1m])) > 0"),
Expr: intstr.FromString("sum(rate(netobserv_agent_dropped_flows_total[1m])) > 100"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just put this 100 limit in the constants

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -124,10 +124,10 @@ func (c *AgentController) agentPrometheusRule(target *flowslatest.FlowCollectorE
rules = append(rules, monitoringv1.Rule{
Alert: string(flowslatest.AlertDroppedFlows),
Annotations: map[string]string{
"description": "NetObserv eBPF agent is not able to process new flows. Possible reasons are the BPF hashmap being full, or the capacity limiter being triggered. Both causes can be worked around by increasing cacheMaxFlows value in Flowcollector resource.",
"summary": "NetObserv eBPF is not able to process any new flows",
"description": "NetObserv eBPF agent is missing packets or flows. The metric netobserv_agent_dropped_flows_total provides more information on the cause. Possible reasons are the BPF hashmap being busy or full, or the capacity limiter being triggered. This may be worked around by increasing cacheMaxFlows value in Flowcollector resource.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by saying "full", doesn't that mean capacity ( of hashmap) is reached? if it's same thing than we can update text busy or full, or the capacity limiter being triggered since its a repetition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the capacity limiter is something different than the hashmap capacity , it's something that comes later in the process (just before flows are about to be sent over the network to kafka or FLP)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with new description

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so increasing the cacheMaxFlows value will help when flows/packets are dropped due to hashmap being full but it wouldn't really help with if the capacity limit is triggered, correct?

Copy link
Member Author

@jotak jotak Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, no. The capacity limiter also obeys to cacheMaxFlows setting, so it will also be a solution in that case.
The case where cacheMaxFlows won't be a solution is the very one we've seen here: when packets are missed due to hashmap being "busy", but not full. Unfortunately we haven't found so far a way to distinguish cases between this sort of "busy" error versus map being full, that would be easily visible to users. One thing they might do is correlating this error with other metrics that provide the current size usage of the map, and see whether or not it's close to its max capacity.

There's certainly more that we could do here, in follow-ups; like creating another alert precisely for when the map size comes close to its capacity.

bpfTraceMountPath = "/sys/kernel/debug"
bpfNetNSMountName = "var-run-netns"
bpfNetNSMountPath = "/var/run/netns"
droppedFlowsAlertThreshold = 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the threshold of 100 , is it based on some testing or an arbitrary value that we think is good (not too high or not too low)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random number atm

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's arbitrary, but I don't see how to not make it arbitrary, unless by making it a configurable threshold. People who would like to set another limit can still copy/paste the rule to a new one, and turn off this alert in netobserv. The rule is still useful in that sense.

Still, I did test it on my cluster.
With my "medium" hey-ho test intensity (which actually is quite high already), I'm between 10 and 15 drops/sec. With my "high" intensity I'm around 25-30.

Bigger clusters would likely see more drops/sec under stress.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, thanks!

@memodi
Copy link
Contributor

memodi commented Jun 10, 2024

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 10, 2024
Copy link

New images:

  • quay.io/netobserv/network-observability-operator:cb21b5e
  • quay.io/netobserv/network-observability-operator-bundle:v0.0.0-cb21b5e
  • quay.io/netobserv/network-observability-operator-catalog:v0.0.0-cb21b5e

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:cb21b5e make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-cb21b5e

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-cb21b5e
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

@@ -1,56 +0,0 @@
package ebpf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason for deleting this unit-test ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I explained it in the commit message :)
dc257fb

@jotak jotak requested a review from memodi June 10, 2024 16:37
Copy link
Contributor

@memodi memodi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/label qe-approved

(based on testing #675)

@openshift-ci openshift-ci bot added qe-approved QE has approved this pull request lgtm labels Jun 11, 2024
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jun 11, 2024

@jotak: This pull request references NETOBSERV-1688 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.17.0" version, but no target version was set.

In response to this:

reformulate some of the text for ebpf drop alert, and relax threshold a little bit.

Description

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jotak
Copy link
Member Author

jotak commented Jun 11, 2024

/approve

Copy link

openshift-ci bot commented Jun 11, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jotak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

reformulate some of the text for ebpf drop alert, and
relax threshold a little bit.
On removing test: agent-metrics-test.go did never run; it should have
been named "agent_metrics_test.go" to run (_ instead of -). When
renamed, it actually fails. Given how little value these tests are
adding, I'm removing them rather than spending time to fix them. (they
only check that the objects created names match with the expected
constant)

The controller integration tests cover more and are much more relevant
(and they run)
@openshift-ci openshift-ci bot removed the lgtm label Jun 11, 2024
@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 11, 2024
@msherif1234
Copy link
Contributor

/lgtm

@openshift-merge-bot openshift-merge-bot bot merged commit e1c9f3c into netobserv:main Jun 11, 2024
8 of 9 checks passed
jotak added a commit that referenced this pull request Jun 17, 2024
* NETOBSERV-1688: relax ebpf drop alert

reformulate some of the text for ebpf drop alert, and
relax threshold a little bit.

* Use const for alert threshold; remove unused test

On removing test: agent-metrics-test.go did never run; it should have
been named "agent_metrics_test.go" to run (_ instead of -). When
renamed, it actually fails. Given how little value these tests are
adding, I'm removing them rather than spending time to fix them. (they
only check that the objects created names match with the expected
constant)

The controller integration tests cover more and are much more relevant
(and they run)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants