Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constraint status content is non-deterministic #3261

Closed
prachirp opened this issue Feb 9, 2024 · 5 comments · Fixed by #3277 or #3293
Closed

Constraint status content is non-deterministic #3261

prachirp opened this issue Feb 9, 2024 · 5 comments · Fixed by #3277 or #3293
Labels
enhancement New feature or request

Comments

@prachirp
Copy link
Contributor

prachirp commented Feb 9, 2024

Describe the solution you'd like
When viewing the status of a constraint with violations, some arbitrary subset of totalViolations are listed in the status. This subset may change each time the cluster is audited. It's preferable to sort violations so the same subset is listed in constraint status, this way we can determine that that violation was remediated or no longer exists.

Anything else you would like to add:
Seems like a few line fix to the audit manager updateConstraintStatus method to sort auditResults before appending to constraint status. https://github.com/open-policy-agent/gatekeeper/blob/master/pkg/audit/manager.go#L863

Environment:

  • Gatekeeper version:
  • Kubernetes version: (use kubectl version):
@prachirp prachirp added the enhancement New feature or request label Feb 9, 2024
@JaydipGabani
Copy link
Contributor

@prachirp you can use constraint-violations-limit flag to increase the number of violations you are getting in status to maximum of 500, by default it is 20, to get around this problem. If you are noticing that there might be more violations per constraint than 500 then you may be able to use pubsub to export violations.

@maxsmythe
Copy link
Contributor

maxsmythe commented Feb 15, 2024

Adding to @JaydipGabani 's comment, another option would be to read violations from the audit pod's logs.

WRT the specific proposal... unfortunately I don't think sorting violations solves the issue:

  • Assume we have a constraint with 50 violations and a cap of 20 maximum reported violations
  • Let's also assume the violations are stably sorted (the suggestion)
  • Lets assume some new violation, AAAA is created. According to the sorting rules, it comes first. This pushes violation ZZZZ off the list of violations reported on the constraint.
  • It is impossible to know whether violation ZZZZ has been removed because it has been fixed or if it has merely been pushed off the list. This is the original problem all over again.

It may be worth sorting violations just to have more stable results, but unfortunately doing so doesn't allow us to make any conclusions about whether a violation has been remediated.

Ultimately, the audit report on the status of the resource is informational, but not authoritative. Due to scaling issues it will likely never be authoritative.

@JaydipGabani
Copy link
Contributor

@prachirp to be correct, limitation is not 500 violations per constraint. Gatekeeper could report violations on the constraints till etcd limit is respected for each object. By default, each object is allowed have maximum of 1.5 MiB - more info here. So, gatekeeper can report as many violations on one constraint resource as long as the size of the object remains less than etcd limit.

@prachirp
Copy link
Contributor Author

Hey @JaydipGabani @maxsmythe thanks for responding! For some background, the violations from status are sufficient enough that we won't move to alternate sources.

Secondly, we do see policies with thousands of total violations and know we will always miss some violations in constraint status as a result no matter the violation per constraint limit. This is also sufficient.

The main reason to propose this is to reduce churn in the violations being reported in constraint status and have the results be slightly more predictable. I was wrong earlier since a finding which disappeared from the sorted list won't indicate that the finding was remediated. But sorting the list will help us understand if the violations being reported are fully churned or not. Currently the list of total violations might be the same but the ones reported in constraint status could be fully churned. This will no longer be the case with a sorted list. Being able to tell this information would help us a lot.

Sorting violations would help reason a bit easier. For example compare that constraint A had 40 findings starting with X,Y,Z at time 0 and later had 40 findings starting with A,X,Y at time 1. We could easily see that the state had changed between those times whereas before we can only see that there are 40 total findings both times and couldn't tell that there was churn. A concrete case could be a cluster restart happening between time 0 and time 1. We can be more confident that the cluster has reached eventual consistency when the first 20 out of total 40 findings reported in constraint status exactly match the same first 20 out of 40 findings reported before cluster restart.

Hope the background helps! And let me know if there is any reason not to take the proposed solution of sorting auditResults before appending to constraint status.

prachirp added a commit to prachirp/gatekeeper that referenced this issue Feb 20, 2024
prachirp added a commit to prachirp/gatekeeper that referenced this issue Feb 20, 2024
Signed-off-by: Prachi Pendse <prachirp@google.com>
@maxsmythe
Copy link
Contributor

I think this makes sense, if nothing else it has the potential to make diffs less noisy. As long as it's in a place where performance impact is limited, SGTM.

prachirp added a commit to prachirp/gatekeeper that referenced this issue Feb 21, 2024
Signed-off-by: Prachi Pendse <prachirp@google.com>
maxsmythe added a commit that referenced this issue Feb 22, 2024
Signed-off-by: Prachi Pendse <prachirp@google.com>
Co-authored-by: Max Smythe <smythe@google.com>
leewoobin789 pushed a commit to softlee-io/gatekeeper that referenced this issue Apr 1, 2024
…-policy-agent#3277)

Signed-off-by: Prachi Pendse <prachirp@google.com>
Co-authored-by: Max Smythe <smythe@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
3 participants