NETOBSERV-557: add eBPF agent metrics for troubleshooting #263

msherif1234 · 2024-02-07T19:33:34Z

Description

add promo metrics to eBPF agent with the ability to export metrics to promo Server

unit-test

tested locally using standalone ebpf-agent

sudo LOG_LEVEL=debug FLOWS_TARGET_HOST=127.0.0.1 FLOWS_TARGET_PORT=9999 METRICS_PROMO_ENABLE="true" ./bin/netobserv-ebpf-agent
from another terminal curl 127.0.0.1:9090/metrics | grep ebpf_agent
results

:-- # TYPE ebpf_agent_err_can_not_delete_flow_entries counter
-ebpf_agent_err_can_not_delete_flow_entries{operational="errors while deleting flows"} 0
-:# HELP ebpf_agent_err_can_not_write_to_grpc Error can not write to GRPC
--# TYPE ebpf_agent_err_can_not_write_to_grpc counter
:-ebpf_agent_err_can_not_write_to_grpc{operational="err_export_by_grpc"} 0
- # HELP ebpf_agent_hashmap_evictions Number of hashmap evictions
8# TYPE ebpf_agent_hashmap_evictions counter
8ebpf_agent_hashmap_evictions{operational="hash map evictions"} 16
06# HELP ebpf_agent_number_of_evicted_flows Number of evicted flows
k# TYPE ebpf_agent_number_of_evicted_flows gauge

ebpf_agent_number_of_evicted_flows{operational="number of evicted flows"} 41
# HELP ebpf_agent_number_of_flows_received_via_ring_buffer Number of flows received via ring buffer
# TYPE ebpf_agent_number_of_flows_received_via_ring_buffer gauge
ebpf_agent_number_of_flows_received_via_ring_buffer{operational="number_of_flows_received"} 0
# HELP ebpf_agent_number_of_records_received_by_grpc Number of records received by GRPC
# TYPE ebpf_agent_number_of_records_received_by_grpc counter
ebpf_agent_number_of_records_received_by_grpc{operational="number_of_records_received_by_grpc"} 41
# HELP ebpf_agent_sampling_rate Sampling rate
# TYPE ebpf_agent_sampling_rate gauge
ebpf_agent_sampling_rate{operational="sampling rate"} 50
# HELP ebpf_agent_time_spent_in_lookup_and_delete_map Time spent in lookup and delete map
# TYPE ebpf_agent_time_spent_in_lookup_and_delete_map histogram
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="0.001"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="0.01"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="0.1"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="1"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="10"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="100"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="1000"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="10000"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="+Inf"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_sum{operational="time spent in lookup and delete"} 0.005711665
ebpf_agent_time_spent_in_lookup_and_delete_map_count{operational="time spent in lookup and delete"} 16
# HELP ebpf_agent_userspace_evictions Number of userspace evictions
# TYPE ebpf_agent_userspace_evictions counter
ebpf_agent_userspace_evictions{operational="user space evictions"} 0

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

openshift-ci-robot · 2024-02-07T19:33:37Z

codecov · 2024-02-07T19:58:27Z

Codecov Report

Attention: 62 lines in your changes are missing coverage. Please review.

Comparison is base (349fd30) 33.53% compared to head (77d43d1) 35.90%.

Files	Patch %	Lines
pkg/agent/agent.go	39.53%	25 Missing and 1 partial ⚠️
pkg/prometheus/prom_server.go	70.58%	14 Missing and 6 partials ⚠️
pkg/metrics/metrics.go	91.20%	6 Missing and 2 partials ⚠️
pkg/ebpf/tracer.go	0.00%	4 Missing ⚠️
pkg/flow/tracer_ringbuf.go	66.66%	2 Missing ⚠️
pkg/exporter/grpc_proto.go	90.90%	1 Missing ⚠️
pkg/exporter/kafka_proto.go	66.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #263      +/-   ##
==========================================
+ Coverage   33.53%   35.90%   +2.36%     
==========================================
  Files          40       42       +2     
  Lines        3554     3777     +223     
==========================================
+ Hits         1192     1356     +164     
- Misses       2293     2343      +50     
- Partials       69       78       +9

Flag	Coverage Δ
unittests	`35.90% <75.20%> (+2.36%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

openshift-ci-robot · 2024-02-08T14:27:43Z

@msherif1234: This pull request references NETOBSERV-557 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Description

add promo metrics to eBPF agent with the ability to export metrics to promo Server

unit-test

tested locally using standalone ebpf-agent

sudo LOG_LEVEL=debug FLOWS_TARGET_HOST=127.0.0.1 FLOWS_TARGET_PORT=9999 METRICS_PROMO_ENABLE="true" ./bin/netobserv-ebpf-agent

from another terminal curl 127.0.0.1:9090/metrics | grep ebpf_agent
results
:-# TYPE ebpf_agent_bytes_submitted_via_grpc counter
-:ebpf_agent_bytes_submitted_via_grpc{operational="bytes_received_by_grpc"} 192
--# HELP ebpf_agent_err_can_not_delete_flow_entries Error can not delete flow entries
-# TYPE ebpf_agent_err_can_not_delete_flow_entries counter
-:-ebpf_agent_err_can_not_delete_flow_entries{operational="errors while deleting flows"} 0
-# HELP ebpf_agent_err_can_not_write_to_grpc Error can not write to GRPC
:-# TYPE ebpf_agent_err_can_not_write_to_grpc counter
- ebpf_agent_err_can_not_write_to_grpc{operational="err_export_by_grpc"} 0
86# HELP ebpf_agent_hashmap_evictions Number of hashmap evictions
23# TYPE ebpf_agent_hashmap_evictions counter
k
ebpf_agent_hashmap_evictions{operational="hash map evictions"} 6
# HELP ebpf_agent_number_of_evicted_flows Number of evicted flows
# TYPE ebpf_agent_number_of_evicted_flows gauge
ebpf_agent_number_of_evicted_flows{operational="number of evicted flows"} 192
# HELP ebpf_agent_number_of_flows_received_via_ring_buffer Number of flows received via ring buffer
# TYPE ebpf_agent_number_of_flows_received_via_ring_buffer gauge
ebpf_agent_number_of_flows_received_via_ring_buffer{operational="number_of_flows_received"} 0
# HELP ebpf_agent_time_spent_in_lookup_and_delete_map Time spent in lookup and delete map
# TYPE ebpf_agent_time_spent_in_lookup_and_delete_map histogram
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="0.001"} 1
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="0.01"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="0.1"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="1"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="10"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="100"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="1000"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="10000"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="+Inf"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_sum{operational="time spent in lookup and delete"} 0.013046000999999998
ebpf_agent_time_spent_in_lookup_and_delete_map_count{operational="time spent in lookup and delete"} 6
# HELP ebpf_agent_userspace_evictions Number of userspace evictions
# TYPE ebpf_agent_userspace_evictions counter
ebpf_agent_userspace_evictions{operational="user space evictions"} 0
Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.

Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).

Does this PR require product documentation?

If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.

Does this PR require a product release notes entry?

If so, fill in "Release Note Text" in the JIRA.

Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.

If so, make sure it is described in the JIRA ticket.

QE requirements (check 1 from the list):

Standard QE validation, with pre-merge tests unless stated otherwise.

Regression tests only (e.g. refactoring with no user-facing change).

No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-02-08T14:49:12Z

@msherif1234: This pull request references NETOBSERV-557 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Description

add promo metrics to eBPF agent with the ability to export metrics to promo Server

unit-test

tested locally using standalone ebpf-agent

sudo LOG_LEVEL=debug FLOWS_TARGET_HOST=127.0.0.1 FLOWS_TARGET_PORT=9999 METRICS_PROMO_ENABLE="true" ./bin/netobserv-ebpf-agent

from another terminal curl 127.0.0.1:9090/metrics | grep ebpf_agent
results
:-# TYPE ebpf_agent_bytes_submitted_via_grpc counter
-:ebpf_agent_bytes_submitted_via_grpc{operational="bytes_received_by_grpc"} 192
--# HELP ebpf_agent_err_can_not_delete_flow_entries Error can not delete flow entries
-# TYPE ebpf_agent_err_can_not_delete_flow_entries counter
-:-ebpf_agent_err_can_not_delete_flow_entries{operational="errors while deleting flows"} 0
-# HELP ebpf_agent_err_can_not_write_to_grpc Error can not write to GRPC
:-# TYPE ebpf_agent_err_can_not_write_to_grpc counter
- ebpf_agent_err_can_not_write_to_grpc{operational="err_export_by_grpc"} 0
86# HELP ebpf_agent_hashmap_evictions Number of hashmap evictions
23# TYPE ebpf_agent_hashmap_evictions counter
k
ebpf_agent_hashmap_evictions{operational="hash map evictions"} 6
# HELP ebpf_agent_number_of_evicted_flows Number of evicted flows
# TYPE ebpf_agent_number_of_evicted_flows gauge
ebpf_agent_number_of_evicted_flows{operational="number of evicted flows"} 192
# HELP ebpf_agent_number_of_flows_received_via_ring_buffer Number of flows received via ring buffer
# TYPE ebpf_agent_number_of_flows_received_via_ring_buffer gauge
ebpf_agent_number_of_flows_received_via_ring_buffer{operational="number_of_flows_received"} 0
# HELP ebpf_agent_time_spent_in_lookup_and_delete_map Time spent in lookup and delete map
# TYPE ebpf_agent_time_spent_in_lookup_and_delete_map histogram
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="0.001"} 1
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="0.01"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="0.1"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="1"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="10"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="100"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="1000"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="10000"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="+Inf"} 6
ebpf_agent_time_spent_in_lookup_and_delete_map_sum{operational="time spent in lookup and delete"} 0.013046000999999998
ebpf_agent_time_spent_in_lookup_and_delete_map_count{operational="time spent in lookup and delete"} 6
# HELP ebpf_agent_userspace_evictions Number of userspace evictions
# TYPE ebpf_agent_userspace_evictions counter
ebpf_agent_userspace_evictions{operational="user space evictions"} 0
Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.

Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).

Does this PR require product documentation?

If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.

Does this PR require a product release notes entry?

If so, fill in "Release Note Text" in the JIRA.

Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.

If so, make sure it is described in the JIRA ticket.

QE requirements (check 1 from the list):

Standard QE validation, with pre-merge tests unless stated otherwise.

Regression tests only (e.g. refactoring with no user-facing change).

No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-02-08T18:30:48Z

@msherif1234: This pull request references NETOBSERV-557 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Description

add promo metrics to eBPF agent with the ability to export metrics to promo Server

unit-test

tested locally using standalone ebpf-agent

sudo LOG_LEVEL=debug FLOWS_TARGET_HOST=127.0.0.1 FLOWS_TARGET_PORT=9999 METRICS_PROMO_ENABLE="true" ./bin/netobserv-ebpf-agent

from another terminal curl 127.0.0.1:9090/metrics | grep ebpf_agent
results
:-- # TYPE ebpf_agent_err_can_not_delete_flow_entries counter
-ebpf_agent_err_can_not_delete_flow_entries{operational="errors while deleting flows"} 0
-:# HELP ebpf_agent_err_can_not_write_to_grpc Error can not write to GRPC
--# TYPE ebpf_agent_err_can_not_write_to_grpc counter
:-ebpf_agent_err_can_not_write_to_grpc{operational="err_export_by_grpc"} 0
- # HELP ebpf_agent_hashmap_evictions Number of hashmap evictions
8# TYPE ebpf_agent_hashmap_evictions counter
8ebpf_agent_hashmap_evictions{operational="hash map evictions"} 16
06# HELP ebpf_agent_number_of_evicted_flows Number of evicted flows
k# TYPE ebpf_agent_number_of_evicted_flows gauge

ebpf_agent_number_of_evicted_flows{operational="number of evicted flows"} 41
# HELP ebpf_agent_number_of_flows_received_via_ring_buffer Number of flows received via ring buffer
# TYPE ebpf_agent_number_of_flows_received_via_ring_buffer gauge
ebpf_agent_number_of_flows_received_via_ring_buffer{operational="number_of_flows_received"} 0
# HELP ebpf_agent_number_of_records_received_by_grpc Number of records received by GRPC
# TYPE ebpf_agent_number_of_records_received_by_grpc counter
ebpf_agent_number_of_records_received_by_grpc{operational="number_of_records_received_by_grpc"} 41
# HELP ebpf_agent_sampling_rate Sampling rate
# TYPE ebpf_agent_sampling_rate gauge
ebpf_agent_sampling_rate{operational="sampling rate"} 50
# HELP ebpf_agent_time_spent_in_lookup_and_delete_map Time spent in lookup and delete map
# TYPE ebpf_agent_time_spent_in_lookup_and_delete_map histogram
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="0.001"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="0.01"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="0.1"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="1"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="10"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="100"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="1000"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="10000"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_bucket{operational="time spent in lookup and delete",le="+Inf"} 16
ebpf_agent_time_spent_in_lookup_and_delete_map_sum{operational="time spent in lookup and delete"} 0.005711665
ebpf_agent_time_spent_in_lookup_and_delete_map_count{operational="time spent in lookup and delete"} 16
# HELP ebpf_agent_userspace_evictions Number of userspace evictions
# TYPE ebpf_agent_userspace_evictions counter
ebpf_agent_userspace_evictions{operational="user space evictions"} 0
Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.

Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).

Does this PR require product documentation?

If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.

Does this PR require a product release notes entry?

If so, fill in "Release Note Text" in the JIRA.

Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.

If so, make sure it is described in the JIRA ticket.

QE requirements (check 1 from the list):

Standard QE validation, with pre-merge tests unless stated otherwise.

Regression tests only (e.g. refactoring with no user-facing change).

No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

github-actions · 2024-02-19T12:36:48Z

New image:
quay.io/netobserv/netobserv-ebpf-agent:1ed6690

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=1ed6690 make set-agent-image

jotak · 2024-02-19T12:40:16Z

pkg/agent/config.go

+	// MetricsPromoEnable enables prometheus server to collect ebpf agent metrics, default is false.
+	MetricsPromoEnable bool `env:"METRICS_PROMO_ENABLE" envDefault:"false"`
+	// MetricsPromoServerAddress is the address of the prometheus server that collects ebpf agent metrics.
+	MetricsPromoServerAddress string `env:"METRICS_PROMO_SERVER_ADDRESS"`
+	// MetricsPromoPort is the port of the prometheus server that collects ebpf agent metrics.
+	MetricsPromoPort int `env:"METRICS_PROMO_PORT" envDefault:"9090"`


I'm curious about this "Promo" in the name, I guess it stands for "prometheus" but it's the first time I see it abbreviated like that ... anyway, I'm not sure we need to tell about Prometheus in the variable names, as it is the de-facto standard in k8s anyway. Can we just say "MetricsEnable" (or "EnableMetrics" to be more consistent with the other EnableSomething in the settings?), MetricsServerAddress, etc.

jotak · 2024-02-19T12:43:44Z

pkg/agent/config.go

+	// MetricsPrefix is the prefix of the metrics that are sent to prometheus server.
+	MetricsPrefix string `env:"METRICS_PREFIX" envDefault:"ebpf_agent_"`
+	// MetricsNoPanic disables panic on metrics errors, default is false.
+	MetricsNoPanic bool `env:"METRICS_NO_PANIC" envDefault:"false"`


Maybe we can remove the NoPanic option, not sure if it really brings something useful?
(In FLP we have this "NoPanic" option mostly because the code was initially written with panics and we didn't want that in the operator ... there's not the same background here)

jotak · 2024-02-19T12:52:02Z

pkg/exporter/grpc_proto.go

+		hostPort:                      hostPort,
+		clientConn:                    clientConn,
+		maxFlowsPerMessage:            maxFlowsPerMessage,
+		numberOfRecordsReceivedByGRPC: m.CreateNumberOfRecordsReceivedByGRPC("number_of_records_received_by_grpc"),


What about a counter named records_written_total, shared across exporters, with a label such as transport or exporter that could be either grpc or kafka or anything else?
Also it may be useful to count records but also the number of batches, so there could be a second metric batches_written_total with again an exporter or transport label.

On naming metrics and labels, it's recommend to read this: https://prometheus.io/docs/practices/naming/

it doesn't seem there is a way to findout how many batches been written its just a single send call with all records ?

line 52 just after for inputRecords := range input { if you increment a batch counter by 1, wouldn't it give the batch count?

jotak · 2024-02-19T12:53:14Z

pkg/exporter/grpc_proto.go

+		clientConn:                    clientConn,
+		maxFlowsPerMessage:            maxFlowsPerMessage,
+		numberOfRecordsReceivedByGRPC: m.CreateNumberOfRecordsReceivedByGRPC("number_of_records_received_by_grpc"),
+		errExportByGRPC:               m.CreateErrorCanNotWriteToGRPC("err_export_by_grpc"),


Here also I think we could have a single metric, with the appropriate label, for reporting errors, rather than one different metric per exporter

jotak · 2024-02-19T12:59:03Z

pkg/agent/agent.go

-	rbTracer := flow.NewRingBufTracer(fetcher, mapTracer, cfg.CacheActiveTimeout)
+	m := metrics.NewMetrics(metricsSettings)
+	samplingGauge := m.CreateSamplingRate("sampling rate")
+	samplingGauge.Add(float64(cfg.Sampling))


although it is technically the same since the gauge is initialized at 0, I would use .Set rather than .Add, as we are setting a value regardless what was the previous value.

Suggested change

samplingGauge.Add(float64(cfg.Sampling))

samplingGauge.Set(float64(cfg.Sampling))

jotak · 2024-02-19T13:32:27Z

pkg/metrics/metrics.go

+	return c
+}
+
+func (m *Metrics) CreateHashMapCounter(stage string) prometheus.Counter {


In all the helper functions here, the "stage" param I think can be removed, that's something more relevant to FLP (because in FLP we configure custom pipelines, each with a stage name, so it was relevant to tie metrics to their stage)

jotak · 2024-02-19T13:35:31Z

pkg/flow/tracer_map.go

+	evictionCond               *sync.Cond
+	lastEvictionNs             uint64
+	hmapEvictionCounter        prometheus.Counter
+	numberOfEvictedFlows       prometheus.Gauge


I'm not sure why this one would be a gauge, this isn't something that will vary up and down, right? Looking at code it's only adding .. so that would rather be a counter

jotak · 2024-02-19T13:56:06Z

pkg/metrics/metrics.go

+		TypeCounter,
+		"operational",
+	)
+	timeSpentInLookupandDeleteMapSecondsTotal = defineMetric(


name suggestion: lookup_and_delete_map_duration_seconds
(we don't use the "total" suffix here since we are not adding measurements)

jotak · 2024-02-19T13:57:52Z

pkg/metrics/metrics.go

+		"hashmap_evictions_total",
+		"Number of hashmap evictions total",
+		TypeCounter,
+		"operational",


I think here and below "operational" is also a FLP leftover that can be removed

jotak · 2024-02-19T14:06:34Z

p99 of lookup&delete map is kinda scary... almost 1s

msherif1234 · 2024-02-19T14:11:17Z

p99 of lookup&delete map is kinda scary... almost 1s

is this at scale ?we know this path is most busy path and resouces hog in the agent

jotak · 2024-02-19T14:26:57Z

pkg/metrics/metrics.go

+}
+
+var (
+	hmapEvictionsTotal = defineMetric(


I have the feeling that we could merge metrics used for ringbuf with ones used for maps.
E.g. we could have:

evictions_total{source=bpf_ringbuf | bpf_maps}

evicted_flows_total{source=bpf_ringbuf | bpf_maps}

That would replace hmapEvictionsTotal, userspaceNumberOfEvictionsTotal, numberOfevictedFlowsTotal and numberofFlowsreceivedviaRingBufferTotal

Also, is it possible to get a "reason" label here that could be for instance "timeout" or "batch full" ? (or anything else that can trigger an eviction)

since rb map is when map is full which shouldn't happen that often I think from debugging pov u need to see both so u can use this as hint to resize ur hmap table but if u merge u will lose this visibility ?
I will check for eviction reason

there is either timer eviction for hmap or event for rb not different reasons for evictions from looking at the code

Signed-off-by: Mohamed Mahmoud <mmahmoud@redhat.com>

jotak · 2024-02-21T13:31:28Z

pkg/prometheus/prom_server.go

+	return httpServer
+}
+
+func defaultServer(srv *http.Server) *http.Server {


I wonder if we should come up with a common lib in netobserv org for this kind of code... the same code is used in console plugin, FLP and now here.
Anyway, not for this PR

jotak

lgtm
I will play more with that, create a dashboard etc. so perhaps will have further changes or additions to bring, but let's start from here and iterate if needed

thanks @msherif1234 !

msherif1234 · 2024-02-21T13:35:56Z

/approve

openshift-ci · 2024-02-21T13:36:01Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msherif1234

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [msherif1234]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot added the jira/valid-reference label Feb 7, 2024

msherif1234 force-pushed the promo-stats branch 3 times, most recently from a5e0ae2 to 01f6dac Compare February 7, 2024 19:52

msherif1234 changed the title ~~NETOBSERV-557: add eBPF agent metrics for troubleshooting~~ WIP: NETOBSERV-557: add eBPF agent metrics for troubleshooting Feb 7, 2024

openshift-ci bot added the do-not-merge/work-in-progress label Feb 7, 2024

msherif1234 force-pushed the promo-stats branch 3 times, most recently from 7a12625 to 05e4311 Compare February 8, 2024 14:24

msherif1234 changed the title ~~WIP: NETOBSERV-557: add eBPF agent metrics for troubleshooting~~ NETOBSERV-557: add eBPF agent metrics for troubleshooting Feb 8, 2024

openshift-ci bot removed the do-not-merge/work-in-progress label Feb 8, 2024

msherif1234 force-pushed the promo-stats branch from 05e4311 to 4e275d9 Compare February 8, 2024 18:33

jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Feb 19, 2024