Improve performance in queries that perform "group by" operations #2276

tsandall · 2020-04-07T12:46:44Z

Rego doesn't support mutation and there's no explicit "group by" operation. As a result, when policies need to aggregate or group data by some key, they need to use a comprehension and close over values from inside the query. This leads to cross-products in Rego when implementations in other languages would have linear runtime. This can be seen going back all the way to the old scheduler policy. For example:

# this is from test/scheduler/scheduler_test.go
pods_on_node[node_id] = pds {
    node_name := nodes[node_id].metadata.name
    pds := [p | pods[i].spec.nodeName == node_name; p := pods[i]]
}

This rule has O(nodes*pods) runtime. Another common case is inverting an object (which ends up being O(object^2). For example:

v := obj[_]
ks := [k | some k; v==obj[k]]

The latter case is trivial and could be addressed by a built-in function. The scheduler case above is not as easy. E.g., the built-in would have to accept a parameter that indicates how to key any index it builds.

A better solution would be for OPA to identify comprehensions that could and should be indexed (as well as the keys to use for those indices.) During evaluation, the evaluator can lazily build the comprehension index that would allow the query to efficiently lookup the comprehension value based on local variable bindings.

For example, in the scheduler case above, the index would be keyed by "node_name". In the second example, the index would be keyed by "v". The index could be hierarchical to avoid needing to prune lookup results.

Only comprehensions that are safe when excluding the parent query can be indexed. For example:

x := p[_]
_ = {1 | y = x[_]}  # cannot be indexed because 'x' is not safe

Note, in some cases it'll be less efficient to index comprehensions. For example:

p[x]
ys := {y | q[x][y] = 1}

In this case if O(p)>>O(q) then comprehension indexing is effective. OTOH, if O(q)>>O(p) then indexing may be more expensive because all values of 'x' are not known at indexing time. A simple heuristic could be to avoid indexing when keys are used as ref operands because they are probably used to perform constant-time lookups that shrink the search space.

Misc. notes

Comprehension index must be invalidated by with statements

This commit adds a benchmark that exercises comprehensions that perform cross products. This will give us a baseline for the improvement tracked by open-policy-agent#2276. Signed-off-by: Torin Sandall <torinsandall@gmail.com>

This commit adds a benchmark that exercises comprehensions that perform cross products. This will give us a baseline for the improvement tracked by #2276. Signed-off-by: Torin Sandall <torinsandall@gmail.com>

This commit adds a new kind of indexing to the compiler and topdown to help avoid recomputing comprehensions. This helps with queries that perform "group by" operations. This optimization allows policies to perform group-by/aggregation in O(n) instead of O(n^2). The optimization works by computing a set of index keys for the comprehension at compile-time and then computing the collection once at evaluation-time and indexing the result based on the keys. The index keys are variables in the outer query that limit the values produced by the comprehension. In the simple group-by case these are the object values themselves. During evaluation, topdown checks if indexing is possible and builds the index by computing the comprehension without creating a closure over the outer query. This computes ALL values in the collection defined by the comprehension. The results are keyed by the assignments to the variables indicated in the comprehension index. This way the comprehension does not have to be recomputed for each set of assignments in the outer query. The index is exposed on both the compiler and the query compiler so that ad-hoc queries can benefit from the indexing as well. This is important for things like the playground where users may select a rule body and run it. If that exhibited n^2 behaviour it would be quite confusing. In order to be indexed, the comprehension must meet a few conditions. Importantly, the indexing should not worsen overall performance. To ensure this, comprehensions containing refs or walk() calls that include output vars that close over the outer query are not indexed. This means that if the caller were pushing down assignments to those vars, OPA will not compute the entire collection. In the future we can improve the index to cover more kinds of comprehensions. One improvement that would be particularly nice would be to allow the comprehension index to close over specific local variables in the parent scope. This would let us build the index in more cases--however, the analysis would need to be careful to take into account the count of closure variables. Variables with multiple assignments would be poor candidates. Benchmark results (before, O(n^2) runtime): BenchmarkComprehensionIndexing/10-16 13831 85821 ns/op BenchmarkComprehensionIndexing/100-16 208 5662625 ns/op BenchmarkComprehensionIndexing/1000-16 2 549295038 ns/op Benchmark results (after, O(n) runtime): BenchmarkComprehensionIndexing/10-16 35809 33369 ns/op BenchmarkComprehensionIndexing/100-16 3756 274546 ns/op BenchmarkComprehensionIndexing/1000-16 438 2725152 ns/op Fixes open-policy-agent#2276 Signed-off-by: Torin Sandall <torinsandall@gmail.com>

This commit adds a new kind of indexing to the compiler and topdown to help avoid recomputing comprehensions. This helps with queries that perform "group by" operations. This optimization allows policies to perform group-by/aggregation in O(n) instead of O(n^2). The optimization works by computing a set of index keys for the comprehension at compile-time and then computing the collection once at evaluation-time and indexing the result based on the keys. The index keys are variables in the outer query that limit the values produced by the comprehension. In the simple group-by case these are the object values themselves. During evaluation, topdown checks if indexing is possible and builds the index by computing the comprehension without creating a closure over the outer query. This computes ALL values in the collection defined by the comprehension. The results are keyed by the assignments to the variables indicated in the comprehension index. This way the comprehension does not have to be recomputed for each set of assignments in the outer query. The index is exposed on both the compiler and the query compiler so that ad-hoc queries can benefit from the indexing as well. This is important for things like the playground where users may select a rule body and run it. If that exhibited n^2 behaviour it would be quite confusing. In order to be indexed, the comprehension must meet a few conditions. Importantly, the indexing should not worsen overall performance. To ensure this, comprehensions containing refs or walk() calls that include output vars that close over the outer query are not indexed. This means that if the caller were pushing down assignments to those vars, OPA will not compute the entire collection. In the future we can improve the index to cover more kinds of comprehensions. One improvement that would be particularly nice would be to allow the comprehension index to close over specific local variables in the parent scope. This would let us build the index in more cases--however, the analysis would need to be careful to take into account the count of closure variables. Variables with multiple assignments would be poor candidates. Benchmark results (before, O(n^2) runtime): BenchmarkComprehensionIndexing/10-16 13831 85821 ns/op BenchmarkComprehensionIndexing/100-16 208 5662625 ns/op BenchmarkComprehensionIndexing/1000-16 2 549295038 ns/op Benchmark results (after, O(n) runtime): BenchmarkComprehensionIndexing/10-16 35809 33369 ns/op BenchmarkComprehensionIndexing/100-16 3756 274546 ns/op BenchmarkComprehensionIndexing/1000-16 438 2725152 ns/op Fixes #2276 Signed-off-by: Torin Sandall <torinsandall@gmail.com>

tsandall added the optimization label Apr 7, 2020

tsandall self-assigned this Apr 7, 2020

tsandall mentioned this issue Apr 13, 2020

topdown: Benchmark comprehensions that perform cross products #2290

Merged

tsandall mentioned this issue Apr 29, 2020

ast, topdown: Index comprehensions to avoid unnecessary work #2357

Merged

tsandall closed this as completed in #2357 May 1, 2020

ashutosh-narkar added this to Open Policy Agent Aug 5, 2024

ashutosh-narkar moved this to Done in Open Policy Agent Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance in queries that perform "group by" operations #2276

Improve performance in queries that perform "group by" operations #2276

tsandall commented Apr 7, 2020 •

edited

Loading

Improve performance in queries that perform "group by" operations #2276

Improve performance in queries that perform "group by" operations #2276

Comments

tsandall commented Apr 7, 2020 • edited Loading

tsandall commented Apr 7, 2020 •

edited

Loading