-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High memory & CPU use, slow decisions #1443
Comments
@aeneasr thanks for the excellent write up. The eval latency is growing linearly with the number of ACPs which is what I'd expect for the decide_allow function. With 1 worker and 30,000 ACPs the average latency is ~500ms (or 2 QPS per core). As the number of workers increases, queries are going to start queuing and the end-to-end latency goes up. I will see how we can bring down the latency. In the exact case, if we refactor the decide_allow function, we should be able to use partial eval and rule indexing to get close to constant-time. I'm traveling back from KubeCon EU on Monday so expect an update by Tuesday/Wednesday next week. |
Awesome, thank you Torin for helping with investigation. Besides high response latency, I think the memory consumption is something that should also be investigated :) |
@aeneasr I've started looking into this. What I found immediately is that a significant portion of the latency is due to conversion between the raw Go data structures that store the ACPs in-memory and the AST types used by the evaluator. I'm looking into whether we can use partial evaluation to solve this. If that doesn't work, I'll look into other ways of avoiding the conversion. Once the latency is addressed we can deal with the memory consumption. |
Thank you for following up, Torin! I've observed high memory use in the past stemming from repeated json <-> go conversion. It's possible that those two are related. If I can be of any help feel free to ping me any time (better here than on slack as I'm not really monitoring slack). edit:// Sorry, I misread regarding json <-> go! In any case, thank you for spending time on this! |
Hi. Any progress on this? |
This is also relevant to me and I wouldn't mind helping out if I knew where to look. |
I've been looking into this over the past couple days. Here are my initial findings and a tentative plan to address the latency issue.
@aeneasr I've got some questions about the use case:
|
Thank you for looking into this. We can give people guidance on using glob instead of regex when only glob-like patterns are being used. I think a somewhat slow cold start could work. |
Update. With #1517 merged it's possible to write IAM style policies that use the The main issue now is that the partial evaluation and compile step takes way too long. I'm looking into the cause right now (it appears to be due to one of the compile steps allocating too much memory.) I've pushed some changes to https://github.com/tsandall/iambench to enable testing with glob.match. The evaluation latency improvements are promising (mean ~50 microseconds, 99.9% < 1ms) |
I tried out the 0.12.1 version of opa with the test suite from above and unfortunately could not see any improvements. Memory use is still insane, CPU load is too high and responses are too slow. I'll probably have to rewrite some of the policies to make use of partial evaluation to see if and how that solves issues, but since that won't solve regex I'm not sure if it's worth investing time in. What's the general stance on resource intensity and latency for OPA? Are real-time systems (such as Ingress/API Gateways) even a good fit for OPA? Is the primary intention of OPA to replace/imitate systems like AWS or GCP IAM (the policy/role part)? Or is it really about e.g. SSH login or e.g. Kubernetes deployments where 500ms longer for logging in aren't that terrible? |
hey @aeneasr thanks for testing v0.12.1. I'm sorry you weren't able to get an improvement out-of-the-box but that's not surprising because the optimization mentioned above only applies if you can partially evaluate the policy (or write the policy initially in such a way that it gets indexed.) The "problem" is these policies currently require a linear scan over every single resource in every single ACP object. For the original test case in the issue this means checking 240,000 matches (30K ACPs with 8 resources each):
Reordering those matches to subject->action->resource or action->subject->resource (instead of resource->subject->action) would probably help in most cases however latency is still going to grow linearly with ACPs. At some point GC is going to become a significant factor and performance will degrade further. If you want to check 30,000 ACPs in 100ms that means ~3us per check. On my dev laptop, one check takes ~40us. I'd hazard a guess we can reduce this by 2-3x (maybe 5x?) with micro-optimizations like removing heap allocs but I think that'll be a losing battle (and could take a lot of work.) With that said, there are some things we can do for some of your policies. For example, if we refactor your top-level allow rules that do the subject/action/resource matching, we can partially evaluate the policy and the result will be indexed (yielding constant-time eval) The eval performance looks like this (1K and 10K below): $ go run cmd/main.go -flavor glob -amount 1000
2019/07/12 16:50:29 Running partial evaluation...
2019/07/12 16:50:32 Partial evaluation metrics: {
"timer_rego_input_parse_ns": 318,
"timer_rego_module_compile_ns": 2483633398,
"timer_rego_module_parse_ns": 9498979,
"timer_rego_partial_eval_ns": 1170182203,
"timer_rego_query_compile_ns": 89208,
"timer_rego_query_parse_ns": 178441
}
2019/07/12 16:50:32 Running evaluation...
2019/07/12 16:50:32 mean 90% 99% 99.9%
2019/07/12 16:50:37 29.91µs 37.112µs 120.714µs 236.884µs
2019/07/12 16:50:42 29.994µs 35.808µs 121.604µs 239.032µs
2019/07/12 16:50:47 30.394µs 38.974µs 106.313µs 263.604µs
2019/07/12 16:50:52 30.28µs 36.461µs 109.695µs 197.814µs
2019/07/12 16:50:57 32.773µs 40.219µs 137.237µs 238.739µs
2019/07/12 16:51:02 36.126µs 42.194µs 137.664µs 274.453µs
2019/07/12 16:51:07 39.658µs 42.085µs 128.264µs 3.506384ms
2019/07/12 16:51:12 35.118µs 42.324µs 121.485µs 237.226µs
Memory usage:
Of course, this will only work for your exact and glob match policies. Adding indexing support for arbitrary regexp patterns will be hard and unlikely in the future. We could investigate a small fragment of regexp but then I'm not sure what the benefit would be over globs. If you need to evaluate a large number of arbitrary regexp patterns (e.g., 10K, 100K) in a short amount of time (e.g., 100ms) then OPA is probably not a good fit. As far as general guidance:
So, yes we expect OPA to be used in low-latency API authorization use cases and it has been to date. You can get sub-millisecond response times for fairly large rulesets (e.g., the 10K example above involves 80,000 rules...) You do need to be somewhat careful with the fragment of Rego you use but it can be done. One thing that would help here is tooling that warns when you exit the fragment that's supported by the indexer, partial evaluation, etc. |
Thank you Torin for the excellent response (as usual). I will take that input and try to experiment further. I saw that there are several doc sections about optimizing rego, which is awesome. Maybe adding some tooling support (linter?) that helps with avoiding the most common performance pitfalls would be a really good idea! |
Closing as resolved. |
We're using opa as an embedded service in ORY Keto which recently was moved into production by a couple of adopters. With ORY Keto we provide a couple of "well-defined" APIs around standardized access control mechanisms (currently only what we call Access Control Policies).
It was reported that, with a medium (20-30k) amount of those ACPs (not to confuse with rego policies), high CPU (100%) and memory consumption (5GB+) as well as slow responses (10-30s) are consistently observed.
At first we thought it was the way we embed opa but the focus shifted towards opa. We created a reproducible code base at: https://github.com/aeneasr/opa-bench
To set up a reproducible test case, run:
Next run the benchmark tool. You can set the amount of concurrent workers (
--workers
), the number of Access Control Policies (--policies
), as well as the string matching strategy (supportsregex
,glob
,exact
- ordered by computational complexity withexact
having the lowest computational complexity as it's a simple string equals):The benchmark will run for about 10-12 seconds. It will first create 30k Access Control Policies and then create 25 workers that continoously poll
localhost:8181/v1/data/ory/<flavor>/allow
with a variety of (valid) inputs that either return true or false as the access control decision.Running
opa-bench opa
will first clean up all existing data by doing aDELETE /v1/data/store
which removes all relevant data entries before (re-)creating the Access Control Policies.Actual Behavior
It is expected that memory and CPU consumption stay within reasonable boundaries and that response times are below 100ms.
We're seeing that as we're adding more concurrent workers and more policies, response times (and CPU/memory use) increase significantly. Here are a couple of runs which show a significant time increase with increasing Access Control Policies / Workers for the different "flavors":
exact
glob
regex
We also tested few access control policies with up to 50 concurrent workers. Here too we saw an increase in response latency:
Memory use goes up to 4-5GB of RAM with 30k policies and response times are above the 20 second mark. The core OPA is running on is running at full capacity (100%).
While it is expected that more expensive matchers like glob or regex increase the overall computational time, it is interesting to see that even "simple" string equality takes up to 4.5s, 5gb RAM, 100% CPU time per decision. We think it's also interesting that with an increase in workers the response latency goes up so quickly.
Expected Behavior
Even with some data loaded (30k isn't that much if I imagine e.g. a 10m+ MAUs) there shouldn't be so much strain on the CPU and the memory consumption looks very suspicious. Responses should be much quicker.
Additional Info
After running our tests we actually saw OPA at 15GB of RAM (we even observed 35GB but that's when the system went down so no screenshot)!
Version
The text was updated successfully, but these errors were encountered: