-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NETOBSERV-557: add eBPF agent metrics for troubleshooting #263
Conversation
@msherif1234: This pull request references NETOBSERV-557 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
a5e0ae2
to
01f6dac
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #263 +/- ##
==========================================
+ Coverage 33.53% 35.90% +2.36%
==========================================
Files 40 42 +2
Lines 3554 3777 +223
==========================================
+ Hits 1192 1356 +164
- Misses 2293 2343 +50
- Partials 69 78 +9
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
7a12625
to
05e4311
Compare
@msherif1234: This pull request references NETOBSERV-557 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@msherif1234: This pull request references NETOBSERV-557 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@msherif1234: This pull request references NETOBSERV-557 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
05e4311
to
4e275d9
Compare
New image: It will expire after two weeks. To deploy this build, run from the operator repo, assuming the operator is running: USER=netobserv VERSION=1ed6690 make set-agent-image |
pkg/agent/config.go
Outdated
// MetricsPromoEnable enables prometheus server to collect ebpf agent metrics, default is false. | ||
MetricsPromoEnable bool `env:"METRICS_PROMO_ENABLE" envDefault:"false"` | ||
// MetricsPromoServerAddress is the address of the prometheus server that collects ebpf agent metrics. | ||
MetricsPromoServerAddress string `env:"METRICS_PROMO_SERVER_ADDRESS"` | ||
// MetricsPromoPort is the port of the prometheus server that collects ebpf agent metrics. | ||
MetricsPromoPort int `env:"METRICS_PROMO_PORT" envDefault:"9090"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about this "Promo" in the name, I guess it stands for "prometheus" but it's the first time I see it abbreviated like that ... anyway, I'm not sure we need to tell about Prometheus in the variable names, as it is the de-facto standard in k8s anyway. Can we just say "MetricsEnable" (or "EnableMetrics" to be more consistent with the other EnableSomething in the settings?), MetricsServerAddress, etc.
pkg/agent/config.go
Outdated
// MetricsPrefix is the prefix of the metrics that are sent to prometheus server. | ||
MetricsPrefix string `env:"METRICS_PREFIX" envDefault:"ebpf_agent_"` | ||
// MetricsNoPanic disables panic on metrics errors, default is false. | ||
MetricsNoPanic bool `env:"METRICS_NO_PANIC" envDefault:"false"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can remove the NoPanic option, not sure if it really brings something useful?
(In FLP we have this "NoPanic" option mostly because the code was initially written with panics and we didn't want that in the operator ... there's not the same background here)
4e275d9
to
c916a40
Compare
pkg/exporter/grpc_proto.go
Outdated
hostPort: hostPort, | ||
clientConn: clientConn, | ||
maxFlowsPerMessage: maxFlowsPerMessage, | ||
numberOfRecordsReceivedByGRPC: m.CreateNumberOfRecordsReceivedByGRPC("number_of_records_received_by_grpc"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about a counter named records_written_total
, shared across exporters, with a label such as transport
or exporter
that could be either grpc
or kafka
or anything else?
Also it may be useful to count records but also the number of batches, so there could be a second metric batches_written_total
with again an exporter
or transport
label.
On naming metrics and labels, it's recommend to read this: https://prometheus.io/docs/practices/naming/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't seem there is a way to findout how many batches been written its just a single send call with all records ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 52 just after for inputRecords := range input {
if you increment a batch counter by 1, wouldn't it give the batch count?
pkg/exporter/grpc_proto.go
Outdated
clientConn: clientConn, | ||
maxFlowsPerMessage: maxFlowsPerMessage, | ||
numberOfRecordsReceivedByGRPC: m.CreateNumberOfRecordsReceivedByGRPC("number_of_records_received_by_grpc"), | ||
errExportByGRPC: m.CreateErrorCanNotWriteToGRPC("err_export_by_grpc"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here also I think we could have a single metric, with the appropriate label, for reporting errors, rather than one different metric per exporter
pkg/agent/agent.go
Outdated
rbTracer := flow.NewRingBufTracer(fetcher, mapTracer, cfg.CacheActiveTimeout) | ||
m := metrics.NewMetrics(metricsSettings) | ||
samplingGauge := m.CreateSamplingRate("sampling rate") | ||
samplingGauge.Add(float64(cfg.Sampling)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although it is technically the same since the gauge is initialized at 0, I would use .Set
rather than .Add
, as we are setting a value regardless what was the previous value.
samplingGauge.Add(float64(cfg.Sampling)) | |
samplingGauge.Set(float64(cfg.Sampling)) |
pkg/metrics/metrics.go
Outdated
return c | ||
} | ||
|
||
func (m *Metrics) CreateHashMapCounter(stage string) prometheus.Counter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In all the helper functions here, the "stage" param I think can be removed, that's something more relevant to FLP (because in FLP we configure custom pipelines, each with a stage name, so it was relevant to tie metrics to their stage)
c916a40
to
8f82b7e
Compare
pkg/flow/tracer_map.go
Outdated
evictionCond *sync.Cond | ||
lastEvictionNs uint64 | ||
hmapEvictionCounter prometheus.Counter | ||
numberOfEvictedFlows prometheus.Gauge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why this one would be a gauge, this isn't something that will vary up and down, right? Looking at code it's only adding .. so that would rather be a counter
8f82b7e
to
5e877eb
Compare
pkg/metrics/metrics.go
Outdated
TypeCounter, | ||
"operational", | ||
) | ||
timeSpentInLookupandDeleteMapSecondsTotal = defineMetric( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name suggestion: lookup_and_delete_map_duration_seconds
(we don't use the "total" suffix here since we are not adding measurements)
pkg/metrics/metrics.go
Outdated
"hashmap_evictions_total", | ||
"Number of hashmap evictions total", | ||
TypeCounter, | ||
"operational", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here and below "operational" is also a FLP leftover that can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
5e877eb
to
40135c5
Compare
} | ||
|
||
var ( | ||
hmapEvictionsTotal = defineMetric( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the feeling that we could merge metrics used for ringbuf with ones used for maps.
E.g. we could have:
- evictions_total{source=bpf_ringbuf | bpf_maps}
- evicted_flows_total{source=bpf_ringbuf | bpf_maps}
That would replace hmapEvictionsTotal, userspaceNumberOfEvictionsTotal, numberOfevictedFlowsTotal and numberofFlowsreceivedviaRingBufferTotal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is it possible to get a "reason" label here that could be for instance "timeout" or "batch full" ? (or anything else that can trigger an eviction)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since rb map is when map is full which shouldn't happen that often I think from debugging pov u need to see both so u can use this as hint to resize ur hmap table but if u merge u will lose this visibility ?
I will check for eviction reason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is either timer eviction for hmap or event for rb not different reasons for evictions from looking at the code
40135c5
to
46ecf2a
Compare
Signed-off-by: Mohamed Mahmoud <mmahmoud@redhat.com>
46ecf2a
to
77d43d1
Compare
return httpServer | ||
} | ||
|
||
func defaultServer(srv *http.Server) *http.Server { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should come up with a common lib in netobserv org for this kind of code... the same code is used in console plugin, FLP and now here.
Anyway, not for this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
I will play more with that, create a dashboard etc. so perhaps will have further changes or additions to bring, but let's start from here and iterate if needed
thanks @msherif1234 !
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: msherif1234 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Description
add promo metrics to eBPF agent with the ability to export metrics to promo Server
unit-test
tested locally using standalone ebpf-agent
sudo LOG_LEVEL=debug FLOWS_TARGET_HOST=127.0.0.1 FLOWS_TARGET_PORT=9999 METRICS_PROMO_ENABLE="true" ./bin/netobserv-ebpf-agent
curl 127.0.0.1:9090/metrics | grep ebpf_agent
results
Dependencies
n/a
Checklist
If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.