-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metrics: Add SELinux related audit metrics to the enricher #916
Conversation
Codecov Report
@@ Coverage Diff @@
## main #916 +/- ##
==========================================
+ Coverage 50.64% 50.75% +0.10%
==========================================
Files 42 42
Lines 4626 4660 +34
==========================================
+ Hits 2343 2365 +22
- Misses 2198 2209 +11
- Partials 85 86 +1 |
Thank you for the review @saschagrunert |
/test pull-security-profiles-operator-test-e2e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned about this dramatically increasing the cardinality of the SELinux metrics...
As a general rule of thumb I'd avoid having any metric whose cardinality on a /metrics could potentially go over 10 due to growth in label values. The way I think of it is that if it has already grown to be 10 today, it might be 15 in a year's time. It can be additionally okay to have no more than a literal handful within a given Prometheus that go to 100. These are only rules of thumb, if you know for example that the number of replicas of your service will always be tiny then you can exceed these a bit, conversely if you're expecting thousands of replicas I'd carefully consider every time series on the /metrics.
What are you trying to get with these metrics? Maybe there can be another solution instead of adding these new labels.
metricsLabelPerms, | ||
metricsLabelScontext, | ||
metricsLabelTcontext, | ||
metricsLabelTclass, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't doing this dramatically increase the cardinality of this metric? I really think we shouldn't do this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What real-world implication does increasing the cardinality have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main issue (real-world implication) is that increasing cardinality of metrics will lead to overloading Prometheus. Thus, by not adding only what's minimally needed for us to debug problems at scale we make the SPO a noisy neighbour in terms of metrics
Have a way of observing what is being recorded. There's two reasons:
If the number of labels is a concern, I guess we could remove /some/ but at least the source and target contexts should stay so that you could see what is touching what and go to see the full log enricher logs. |
btw the problem with "watching log enricher logs" is that each spod instance has their own. The metrics allow you to watch all of them from a central place. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold for @JAORMX
OK, I removed the perm and the class. This still retains scontext and tcontext which should give an idea of who is touching what and brings the amount of labels to just one more than the already existing seccomp metric |
I think this is a good compromise. Thanks! |
ugh another CI failure, let me reopen the PR.. |
/hold cancel |
On dispatching an AVC line from the enricher via GRPC, also issue a metric with the details of the AVC, similarly to what we do in the seccomp case.
/retest |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JAORMX, jhrozek, saschagrunert The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
What type of PR is this?
/kind feature
What this PR does / why we need it:
On dispatching an AVC line from the enricher via GRPC, also issue a
metric with the details of the AVC, similarly to what we do in the
seccomp case.
Which issue(s) this PR fixes:
Fixes #609
Does this PR have test?
No, I'm not sure if our way of getting metrics through the service
is stable enough for tests.
Special notes for your reviewer:
N/A
Does this PR introduce a user-facing change?