Constraint status content is non-deterministic #3261

prachirp · 2024-02-09T22:37:32Z

Describe the solution you'd like
When viewing the status of a constraint with violations, some arbitrary subset of totalViolations are listed in the status. This subset may change each time the cluster is audited. It's preferable to sort violations so the same subset is listed in constraint status, this way we can determine that that violation was remediated or no longer exists.

Anything else you would like to add:
Seems like a few line fix to the audit manager updateConstraintStatus method to sort auditResults before appending to constraint status. https://github.com/open-policy-agent/gatekeeper/blob/master/pkg/audit/manager.go#L863

Environment:

Gatekeeper version:
Kubernetes version: (use kubectl version):

The text was updated successfully, but these errors were encountered:

JaydipGabani · 2024-02-12T18:30:59Z

@prachirp you can use constraint-violations-limit flag to increase the number of violations you are getting in status to maximum of 500, by default it is 20, to get around this problem. If you are noticing that there might be more violations per constraint than 500 then you may be able to use pubsub to export violations.

maxsmythe · 2024-02-15T00:09:12Z

Adding to @JaydipGabani 's comment, another option would be to read violations from the audit pod's logs.

WRT the specific proposal... unfortunately I don't think sorting violations solves the issue:

Assume we have a constraint with 50 violations and a cap of 20 maximum reported violations
Let's also assume the violations are stably sorted (the suggestion)
Lets assume some new violation, AAAA is created. According to the sorting rules, it comes first. This pushes violation ZZZZ off the list of violations reported on the constraint.
It is impossible to know whether violation ZZZZ has been removed because it has been fixed or if it has merely been pushed off the list. This is the original problem all over again.

It may be worth sorting violations just to have more stable results, but unfortunately doing so doesn't allow us to make any conclusions about whether a violation has been remediated.

Ultimately, the audit report on the status of the resource is informational, but not authoritative. Due to scaling issues it will likely never be authoritative.

JaydipGabani · 2024-02-15T22:59:10Z

@prachirp to be correct, limitation is not 500 violations per constraint. Gatekeeper could report violations on the constraints till etcd limit is respected for each object. By default, each object is allowed have maximum of 1.5 MiB - more info here. So, gatekeeper can report as many violations on one constraint resource as long as the size of the object remains less than etcd limit.

prachirp · 2024-02-20T18:17:38Z

Hey @JaydipGabani @maxsmythe thanks for responding! For some background, the violations from status are sufficient enough that we won't move to alternate sources.

Secondly, we do see policies with thousands of total violations and know we will always miss some violations in constraint status as a result no matter the violation per constraint limit. This is also sufficient.

The main reason to propose this is to reduce churn in the violations being reported in constraint status and have the results be slightly more predictable. I was wrong earlier since a finding which disappeared from the sorted list won't indicate that the finding was remediated. But sorting the list will help us understand if the violations being reported are fully churned or not. Currently the list of total violations might be the same but the ones reported in constraint status could be fully churned. This will no longer be the case with a sorted list. Being able to tell this information would help us a lot.

Sorting violations would help reason a bit easier. For example compare that constraint A had 40 findings starting with X,Y,Z at time 0 and later had 40 findings starting with A,X,Y at time 1. We could easily see that the state had changed between those times whereas before we can only see that there are 40 total findings both times and couldn't tell that there was churn. A concrete case could be a cluster restart happening between time 0 and time 1. We can be more confident that the cluster has reached eventual consistency when the first 20 out of total 40 findings reported in constraint status exactly match the same first 20 out of 40 findings reported before cluster restart.

Hope the background helps! And let me know if there is any reason not to take the proposed solution of sorting auditResults before appending to constraint status.

Signed-off-by: Prachi Pendse <prachirp@google.com>

maxsmythe · 2024-02-21T00:31:57Z

I think this makes sense, if nothing else it has the potential to make diffs less noisy. As long as it's in a place where performance impact is limited, SGTM.

Signed-off-by: Prachi Pendse <prachirp@google.com>

Signed-off-by: Prachi Pendse <prachirp@google.com> Co-authored-by: Max Smythe <smythe@google.com>

…-policy-agent#3277) Signed-off-by: Prachi Pendse <prachirp@google.com> Co-authored-by: Max Smythe <smythe@google.com>

prachirp added the enhancement New feature or request label Feb 9, 2024

prachirp added a commit to prachirp/gatekeeper that referenced this issue Feb 20, 2024

fix open-policy-agent#3261 Sort constraint status audit results

60404dd

prachirp added a commit to prachirp/gatekeeper that referenced this issue Feb 20, 2024

fix open-policy-agent#3261 Sort constraint status audit results

c825121

Signed-off-by: Prachi Pendse <prachirp@google.com>

prachirp mentioned this issue Feb 20, 2024

fix #3261 Sort constraint status audit results #3277

Merged

prachirp added a commit to prachirp/gatekeeper that referenced this issue Feb 21, 2024

fix open-policy-agent#3261 Sort constraint status audit results

9c55f9f

Signed-off-by: Prachi Pendse <prachirp@google.com>

maxsmythe closed this as completed in #3277 Feb 22, 2024

maxsmythe added a commit that referenced this issue Feb 22, 2024

fix #3261 Sort constraint status audit results (#3277)

fed8e15

Signed-off-by: Prachi Pendse <prachirp@google.com> Co-authored-by: Max Smythe <smythe@google.com>

prachirp mentioned this issue Feb 29, 2024

fix: store constraint status audit results in sorted order #3293

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Constraint status content is non-deterministic #3261

Constraint status content is non-deterministic #3261

prachirp commented Feb 9, 2024

JaydipGabani commented Feb 12, 2024

maxsmythe commented Feb 15, 2024 •

edited

Loading

JaydipGabani commented Feb 15, 2024

prachirp commented Feb 20, 2024

maxsmythe commented Feb 21, 2024

Constraint status content is non-deterministic #3261

Constraint status content is non-deterministic #3261

Comments

prachirp commented Feb 9, 2024

JaydipGabani commented Feb 12, 2024

maxsmythe commented Feb 15, 2024 • edited Loading

JaydipGabani commented Feb 15, 2024

prachirp commented Feb 20, 2024

maxsmythe commented Feb 21, 2024

maxsmythe commented Feb 15, 2024 •

edited

Loading