Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-1625: (follow-up) mention other possible cause for ebpf drops #640

Merged
merged 2 commits into from
May 15, 2024

Conversation

jotak
Copy link
Member

@jotak jotak commented May 10, 2024

Description

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
    • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
    • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
    • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
    • Standard QE validation, with pre-merge tests unless stated otherwise.
    • Regression tests only (e.g. refactoring with no user-facing change).
    • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented May 10, 2024

@jotak: This pull request references NETOBSERV-1625 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Description

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jotak
Copy link
Member Author

jotak commented May 10, 2024

hey @msherif1234 : I think #632 missed some potential causes for drops. In agent, the drops metric is used in several scenarios:

On the first one, I see another problem: the alert would be triggered if there are filtered-out flows with the new filter feature. I think that filter feature should use a different metric, wdyt?

@msherif1234
Copy link
Contributor

hey @msherif1234 : I think #632 missed some potential causes for drops. In agent, the drops metric is used in several scenarios:

On the first one, I see another problem: the alert would be triggered if there are filtered-out flows with the new filter feature. I think that filter feature should use a different metric, wdyt?

Alert uses https://github.com/netobserv/network-observability-operator/blob/main/controllers/ebpf/agent-metrics.go#L130 unlike filter metrics uses filtered_flows_total to query and this alert is specific to the hashmap full drop reason as that what customer case was they were sending many flows and not seeing the right number back

@msherif1234
Copy link
Contributor

I am fine with the apis edit but as I said above there is no issue for filter metrics, regarding the limiter drops I don't recall adding this not sure what real cases can lead to it

// Possible values are:<br>
// - `NetObservDroppedFlows`, which is triggered when eBPF agent hashmap table is full.<br>
// `NetObservDroppedFlows`, which is triggered when the eBPF agent is dropping flows, such as when the BPF hashmap is full.<br>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we state the other possible reason for drop here i.e limiter capacity exceeded and possible recovery config if any ? from what I see I dn't think there is anything can be done to avoid limiter more of internal go limits ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member Author

@jotak jotak May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the capacity limiter is not related to go limits (if you're talking about the GOMEMLIMIT), it's some sort of backpressure management, cf the warning text that is logged when it's triggered: https://github.com/netobserv/netobserv-ebpf-agent/blob/bf91cbef0008f6d6bc8b1c748729c18ec6b14d35/pkg/flow/limiter.go#L50-L53

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool Thanks!!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc'd @skrthomas this might need to be doc under the ebpf agent alerting WDYT?

@jotak
Copy link
Member Author

jotak commented May 10, 2024

Alert uses https://github.com/netobserv/network-observability-operator/blob/main/controllers/ebpf/agent-metrics.go#L130 unlike filter metrics uses filtered_flows_total to query and this alert is specific to the hashmap full drop reason as that what customer case was they were sending many flows and not seeing the right number back

Oh yes good point I didn't notice

@jotak jotak added no-qe This PR doesn't necessitate QE approval labels May 15, 2024
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented May 15, 2024

@jotak: This pull request references NETOBSERV-1625 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Description

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@codecov-commenter
Copy link

codecov-commenter commented May 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 66.96%. Comparing base (8d0f97a) to head (d51ad27).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #640      +/-   ##
==========================================
- Coverage   67.10%   66.96%   -0.15%     
==========================================
  Files          68       68              
  Lines        7804     7804              
==========================================
- Hits         5237     5226      -11     
- Misses       2192     2199       +7     
- Partials      375      379       +4     
Flag Coverage Δ
unittests 66.96% <100.00%> (-0.15%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@msherif1234
Copy link
Contributor

/lgtm

@jotak
Copy link
Member Author

jotak commented May 15, 2024

thanks @msherif1234
/approve

Copy link

openshift-ci bot commented May 15, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jotak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 2d85102 into netobserv:main May 15, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved jira/valid-reference lgtm no-qe This PR doesn't necessitate QE approval
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants