Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: Add SELinux related audit metrics to the enricher #916

Merged
merged 1 commit into from
May 9, 2022

Conversation

jhrozek
Copy link
Contributor

@jhrozek jhrozek commented May 3, 2022

What type of PR is this?

/kind feature

What this PR does / why we need it:

On dispatching an AVC line from the enricher via GRPC, also issue a
metric with the details of the AVC, similarly to what we do in the
seccomp case.

Which issue(s) this PR fixes:

Fixes #609

Does this PR have test?

No, I'm not sure if our way of getting metrics through the service
is stable enough for tests.

Special notes for your reviewer:

N/A

Does this PR introduce a user-facing change?

The security_profiles_operator_selinux_profile_audit_total metric was actually
enabled and uses the appropriate labels scraped from the audit.log file.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 3, 2022
@codecov-commenter
Copy link

Codecov Report

Merging #916 (6dd8132) into main (72a0786) will increase coverage by 0.10%.
The diff coverage is 61.53%.

@@            Coverage Diff             @@
##             main     #916      +/-   ##
==========================================
+ Coverage   50.64%   50.75%   +0.10%     
==========================================
  Files          42       42              
  Lines        4626     4660      +34     
==========================================
+ Hits         2343     2365      +22     
- Misses       2198     2209      +11     
- Partials       85       86       +1     

internal/pkg/daemon/metrics/grpc.go Outdated Show resolved Hide resolved
internal/pkg/daemon/enricher/enricher.go Outdated Show resolved Hide resolved
internal/pkg/daemon/metrics/grpc.go Outdated Show resolved Hide resolved
@jhrozek
Copy link
Contributor Author

jhrozek commented May 3, 2022

Thank you for the review @saschagrunert

@jhrozek
Copy link
Contributor Author

jhrozek commented May 3, 2022

/test pull-security-profiles-operator-test-e2e

Copy link
Contributor

@JAORMX JAORMX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about this dramatically increasing the cardinality of the SELinux metrics...

From an article on the topic

As a general rule of thumb I'd avoid having any metric whose cardinality on a /metrics could potentially go over 10 due to growth in label values. The way I think of it is that if it has already grown to be 10 today, it might be 15 in a year's time. It can be additionally okay to have no more than a literal handful within a given Prometheus that go to 100. These are only rules of thumb, if you know for example that the number of replicas of your service will always be tiny then you can exceed these a bit, conversely if you're expecting thousands of replicas I'd carefully consider every time series on the /metrics.

What are you trying to get with these metrics? Maybe there can be another solution instead of adding these new labels.

metricsLabelPerms,
metricsLabelScontext,
metricsLabelTcontext,
metricsLabelTclass,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't doing this dramatically increase the cardinality of this metric? I really think we shouldn't do this...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What real-world implication does increasing the cardinality have?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main issue (real-world implication) is that increasing cardinality of metrics will lead to overloading Prometheus. Thus, by not adding only what's minimally needed for us to debug problems at scale we make the SPO a noisy neighbour in terms of metrics

@jhrozek
Copy link
Contributor Author

jhrozek commented May 4, 2022

I'm concerned about this dramatically increasing the cardinality of the SELinux metrics...

From an article on the topic

As a general rule of thumb I'd avoid having any metric whose cardinality on a /metrics could potentially go over 10 due to growth in label values. The way I think of it is that if it has already grown to be 10 today, it might be 15 in a year's time. It can be additionally okay to have no more than a literal handful within a given Prometheus that go to 100. These are only rules of thumb, if you know for example that the number of replicas of your service will always be tiny then you can exceed these a bit, conversely if you're expecting thousands of replicas I'd carefully consider every time series on the /metrics.

What are you trying to get with these metrics? Maybe there can be another solution instead of adding these new labels.

Have a way of observing what is being recorded. There's two reasons:

  • for the recording itself
  • I'm still thinking about "watching" workloads with a policy when introducing the policy into production and I was thinking about using the log enricher so that you could see what AVCs does the workload trigger and can adjust either the workload or the policy.

If the number of labels is a concern, I guess we could remove /some/ but at least the source and target contexts should stay so that you could see what is touching what and go to see the full log enricher logs.

@jhrozek
Copy link
Contributor Author

jhrozek commented May 4, 2022

btw the problem with "watching log enricher logs" is that each spod instance has their own. The metrics allow you to watch all of them from a central place.

Copy link
Member

@saschagrunert saschagrunert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold for @JAORMX

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 4, 2022
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 4, 2022
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 4, 2022
@jhrozek
Copy link
Contributor Author

jhrozek commented May 4, 2022

OK, I removed the perm and the class. This still retains scontext and tcontext which should give an idea of who is touching what and brings the amount of labels to just one more than the already existing seccomp metric

@JAORMX
Copy link
Contributor

JAORMX commented May 4, 2022

OK, I removed the perm and the class. This still retains scontext and tcontext which should give an idea of who is touching what and brings the amount of labels to just one more than the already existing seccomp metric

I think this is a good compromise. Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 4, 2022
@jhrozek
Copy link
Contributor Author

jhrozek commented May 5, 2022

ugh another CI failure, let me reopen the PR..

@jhrozek jhrozek closed this May 5, 2022
@jhrozek jhrozek reopened this May 5, 2022
@jhrozek
Copy link
Contributor Author

jhrozek commented May 5, 2022

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 5, 2022
On dispatching an AVC line from the enricher via GRPC, also issue a
metric with the details of the AVC, similarly to what we do in the
seccomp case.
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 6, 2022
@saschagrunert
Copy link
Member

/retest

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 9, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JAORMX, jhrozek, saschagrunert

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [JAORMX,jhrozek,saschagrunert]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@JAORMX
Copy link
Contributor

JAORMX commented May 9, 2022

/retest

@k8s-ci-robot k8s-ci-robot merged commit fe57597 into kubernetes-sigs:main May 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SELinux recording metrics are missing
5 participants