-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance in queries that perform "group by" operations #2276
Labels
Comments
tsandall
added a commit
to tsandall/opa
that referenced
this issue
Apr 13, 2020
This commit adds a benchmark that exercises comprehensions that perform cross products. This will give us a baseline for the improvement tracked by open-policy-agent#2276. Signed-off-by: Torin Sandall <torinsandall@gmail.com>
tsandall
added a commit
that referenced
this issue
Apr 13, 2020
This commit adds a benchmark that exercises comprehensions that perform cross products. This will give us a baseline for the improvement tracked by #2276. Signed-off-by: Torin Sandall <torinsandall@gmail.com>
tsandall
added a commit
to tsandall/opa
that referenced
this issue
May 1, 2020
This commit adds a new kind of indexing to the compiler and topdown to help avoid recomputing comprehensions. This helps with queries that perform "group by" operations. This optimization allows policies to perform group-by/aggregation in O(n) instead of O(n^2). The optimization works by computing a set of index keys for the comprehension at compile-time and then computing the collection once at evaluation-time and indexing the result based on the keys. The index keys are variables in the outer query that limit the values produced by the comprehension. In the simple group-by case these are the object values themselves. During evaluation, topdown checks if indexing is possible and builds the index by computing the comprehension without creating a closure over the outer query. This computes ALL values in the collection defined by the comprehension. The results are keyed by the assignments to the variables indicated in the comprehension index. This way the comprehension does not have to be recomputed for each set of assignments in the outer query. The index is exposed on both the compiler and the query compiler so that ad-hoc queries can benefit from the indexing as well. This is important for things like the playground where users may select a rule body and run it. If that exhibited n^2 behaviour it would be quite confusing. In order to be indexed, the comprehension must meet a few conditions. Importantly, the indexing should not worsen overall performance. To ensure this, comprehensions containing refs or walk() calls that include output vars that close over the outer query are not indexed. This means that if the caller were pushing down assignments to those vars, OPA will not compute the entire collection. In the future we can improve the index to cover more kinds of comprehensions. One improvement that would be particularly nice would be to allow the comprehension index to close over specific local variables in the parent scope. This would let us build the index in more cases--however, the analysis would need to be careful to take into account the count of closure variables. Variables with multiple assignments would be poor candidates. Benchmark results (before, O(n^2) runtime): BenchmarkComprehensionIndexing/10-16 13831 85821 ns/op BenchmarkComprehensionIndexing/100-16 208 5662625 ns/op BenchmarkComprehensionIndexing/1000-16 2 549295038 ns/op Benchmark results (after, O(n) runtime): BenchmarkComprehensionIndexing/10-16 35809 33369 ns/op BenchmarkComprehensionIndexing/100-16 3756 274546 ns/op BenchmarkComprehensionIndexing/1000-16 438 2725152 ns/op Fixes open-policy-agent#2276 Signed-off-by: Torin Sandall <torinsandall@gmail.com>
tsandall
added a commit
that referenced
this issue
May 1, 2020
This commit adds a new kind of indexing to the compiler and topdown to help avoid recomputing comprehensions. This helps with queries that perform "group by" operations. This optimization allows policies to perform group-by/aggregation in O(n) instead of O(n^2). The optimization works by computing a set of index keys for the comprehension at compile-time and then computing the collection once at evaluation-time and indexing the result based on the keys. The index keys are variables in the outer query that limit the values produced by the comprehension. In the simple group-by case these are the object values themselves. During evaluation, topdown checks if indexing is possible and builds the index by computing the comprehension without creating a closure over the outer query. This computes ALL values in the collection defined by the comprehension. The results are keyed by the assignments to the variables indicated in the comprehension index. This way the comprehension does not have to be recomputed for each set of assignments in the outer query. The index is exposed on both the compiler and the query compiler so that ad-hoc queries can benefit from the indexing as well. This is important for things like the playground where users may select a rule body and run it. If that exhibited n^2 behaviour it would be quite confusing. In order to be indexed, the comprehension must meet a few conditions. Importantly, the indexing should not worsen overall performance. To ensure this, comprehensions containing refs or walk() calls that include output vars that close over the outer query are not indexed. This means that if the caller were pushing down assignments to those vars, OPA will not compute the entire collection. In the future we can improve the index to cover more kinds of comprehensions. One improvement that would be particularly nice would be to allow the comprehension index to close over specific local variables in the parent scope. This would let us build the index in more cases--however, the analysis would need to be careful to take into account the count of closure variables. Variables with multiple assignments would be poor candidates. Benchmark results (before, O(n^2) runtime): BenchmarkComprehensionIndexing/10-16 13831 85821 ns/op BenchmarkComprehensionIndexing/100-16 208 5662625 ns/op BenchmarkComprehensionIndexing/1000-16 2 549295038 ns/op Benchmark results (after, O(n) runtime): BenchmarkComprehensionIndexing/10-16 35809 33369 ns/op BenchmarkComprehensionIndexing/100-16 3756 274546 ns/op BenchmarkComprehensionIndexing/1000-16 438 2725152 ns/op Fixes #2276 Signed-off-by: Torin Sandall <torinsandall@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Rego doesn't support mutation and there's no explicit "group by" operation. As a result, when policies need to aggregate or group data by some key, they need to use a comprehension and close over values from inside the query. This leads to cross-products in Rego when implementations in other languages would have linear runtime. This can be seen going back all the way to the old scheduler policy. For example:
This rule has O(nodes*pods) runtime. Another common case is inverting an object (which ends up being O(object^2). For example:
The latter case is trivial and could be addressed by a built-in function. The scheduler case above is not as easy. E.g., the built-in would have to accept a parameter that indicates how to key any index it builds.
A better solution would be for OPA to identify comprehensions that could and should be indexed (as well as the keys to use for those indices.) During evaluation, the evaluator can lazily build the comprehension index that would allow the query to efficiently lookup the comprehension value based on local variable bindings.
For example, in the scheduler case above, the index would be keyed by "node_name". In the second example, the index would be keyed by "v". The index could be hierarchical to avoid needing to prune lookup results.
Only comprehensions that are safe when excluding the parent query can be indexed. For example:
Note, in some cases it'll be less efficient to index comprehensions. For example:
In this case if O(p)>>O(q) then comprehension indexing is effective. OTOH, if O(q)>>O(p) then indexing may be more expensive because all values of 'x' are not known at indexing time. A simple heuristic could be to avoid indexing when keys are used as ref operands because they are probably used to perform constant-time lookups that shrink the search space.
Misc. notes
The text was updated successfully, but these errors were encountered: