Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-1625: Add ebpf altering for flows drop #632

Merged

Conversation

msherif1234
Copy link
Contributor

@msherif1234 msherif1234 commented Apr 29, 2024

Description

Add alert for ebpf flows drop to detect when ebpf hash table is full

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
    • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
    • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
    • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
    • Standard QE validation, with pre-merge tests unless stated otherwise.
    • Regression tests only (e.g. refactoring with no user-facing change).
    • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

@msherif1234 msherif1234 changed the title enable ebpf agent metrics by default Enable ebpf agent metrics by default Apr 29, 2024
@msherif1234 msherif1234 changed the title Enable ebpf agent metrics by default NETOBSERV-1625: Add ebpf altering for flows drop Apr 29, 2024
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Apr 29, 2024

@msherif1234: This pull request references NETOBSERV-1625 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Description

enable ebpf agent metrics by default so it can be used to generate alerts with flow table is full

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@msherif1234 msherif1234 changed the title NETOBSERV-1625: Add ebpf altering for flows drop WIP: NETOBSERV-1625: Add ebpf altering for flows drop Apr 29, 2024
@msherif1234 msherif1234 force-pushed the agent_metrics_on_by_default branch 3 times, most recently from b85c21c to 2a846b5 Compare April 29, 2024 23:47
@msherif1234
Copy link
Contributor Author

msherif1234 commented Apr 29, 2024

$ oc get prometheusrules -n netobserv-privileged ebpf-agent-prom-alert -o yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  creationTimestamp: "2024-04-29T23:53:40Z"
  generation: 1
  labels:
    app: netobserv-ebpf-agent
  name: ebpf-agent-prom-alert
  namespace: netobserv-privileged
  ownerReferences:
  - apiVersion: flows.netobserv.io/v1beta2
    blockOwnerDeletion: true
    controller: true
    kind: FlowCollector
    name: cluster
    uid: 7376e577-a0d4-4ffd-a83d-cacc0199e14e
  resourceVersion: "11150"
  uid: 51e833bc-aef1-4f9c-8210-4eea4500acd9
spec:
  groups:
  - name: NetobservEBPFAgentAlerts
    rules:
    - alert: NetObservAgentFlowsDropped
      annotations:
        description: NetObserv eBPF agent hashmap table is full, it means that the
          eBPF agent is not able to process new flows. Please consider to increase
          the hashmap table size.
        summary: NetObserv eBPF is not able to process any new flows
      expr: sum(rate(netobserv_agent_dropped_flows_total[1m])) == 0
      for: 10m
      labels:
        app: netobserv
        severity: warning

image

@msherif1234 msherif1234 changed the title WIP: NETOBSERV-1625: Add ebpf altering for flows drop NETOBSERV-1625: Add ebpf altering for flows drop Apr 30, 2024
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Apr 30, 2024

@msherif1234: This pull request references NETOBSERV-1625 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Description

Add alert for ebpf flows drop to detect when ebpf hash table is full

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Apr 30, 2024

@msherif1234: This pull request references NETOBSERV-1625 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Description

Add alert for ebpf flows drop to detect when ebpf hash table is full

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Apr 30, 2024
Copy link

New images:

  • quay.io/netobserv/network-observability-operator:77ee9b1
  • quay.io/netobserv/network-observability-operator-bundle:v0.0.0-77ee9b1
  • quay.io/netobserv/network-observability-operator-catalog:v0.0.0-77ee9b1

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:77ee9b1 make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-77ee9b1

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-77ee9b1
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

@msherif1234
Copy link
Contributor Author

/retest

1 similar comment
@msherif1234
Copy link
Contributor Author

/retest

Signed-off-by: Mohamed Mahmoud <mmahmoud@redhat.com>
@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Apr 30, 2024
@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Apr 30, 2024
Copy link

New images:

  • quay.io/netobserv/network-observability-operator:76239ae
  • quay.io/netobserv/network-observability-operator-bundle:v0.0.0-76239ae
  • quay.io/netobserv/network-observability-operator-catalog:v0.0.0-76239ae

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:76239ae make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-76239ae

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-76239ae
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

@msherif1234
Copy link
Contributor Author

msherif1234 commented Apr 30, 2024

To emulate hashmap table full alert

  • set sampling to 1
  • reduce hashmap table size to 100
  • enable ebpf agent metric
  • use hey-ho tool to generate traffic ./hey-ho.sh -r 5 -d 3 -z 10m -n 4 -q 2 -p -b

image

image

@codecov-commenter
Copy link

codecov-commenter commented Apr 30, 2024

Codecov Report

Attention: Patch coverage is 76.78571% with 13 lines in your changes are missing coverage. Please review.

Project coverage is 66.42%. Comparing base (216e3e6) to head (c90c5ee).
Report is 1 commits behind head on main.

Files Patch % Lines
controllers/ebpf/agent-metrics.go 88.63% 3 Missing and 2 partials ⚠️
...pis/flowcollector/v1beta1/zz_generated.deepcopy.go 0.00% 4 Missing ⚠️
...pis/flowcollector/v1beta2/zz_generated.deepcopy.go 0.00% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #632      +/-   ##
==========================================
+ Coverage   66.34%   66.42%   +0.07%     
==========================================
  Files          67       67              
  Lines        7527     7583      +56     
==========================================
+ Hits         4994     5037      +43     
- Misses       2163     2173      +10     
- Partials      370      373       +3     
Flag Coverage Δ
unittests 66.42% <76.78%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@jpinsonneau jpinsonneau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good ! Thanks @msherif1234

@memodi
Copy link
Contributor

memodi commented Apr 30, 2024

@msherif1234 I deployed the operator bundle included in catalog image , however I don't see the prometheus rule being configured in netobserv-privileged NS, am I missing something?

$ oc get prometheusrules -A  | egrep netobserv
netobserv                                    flowlogs-pipeline-alert                        78s
image

@memodi
Copy link
Contributor

memodi commented Apr 30, 2024

oh wait, I had ebpf metrics disabled, enabling it now.

@memodi
Copy link
Contributor

memodi commented Apr 30, 2024

/label qe-approved
@msherif1234 just one comment on Description of the alert, should we reword to say:
NetObserv eBPF agent is not able to process new flows as it's hashmap is full. Hashmap table size can be increased by increasing cacheMaxFlows value in Flowcollector resource wdyt?

@openshift-ci openshift-ci bot added the qe-approved QE has approved this pull request label Apr 30, 2024
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Apr 30, 2024

@msherif1234: This pull request references NETOBSERV-1625 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Description

Add alert for ebpf flows drop to detect when ebpf hash table is full

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Signed-off-by: Mohamed Mahmoud <mmahmoud@redhat.com>
@openshift-ci openshift-ci bot removed the lgtm label Apr 30, 2024
Copy link

openshift-ci bot commented Apr 30, 2024

New changes are detected. LGTM label has been removed.

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Apr 30, 2024
@msherif1234
Copy link
Contributor Author

/approve

Copy link

openshift-ci bot commented Apr 30, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msherif1234

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 13b2a78 into netobserv:main Apr 30, 2024
10 checks passed
@msherif1234 msherif1234 deleted the agent_metrics_on_by_default branch April 30, 2024 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants