Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory & CPU use, slow decisions #1443

Closed
aeneasr opened this issue May 20, 2019 · 13 comments
Closed

High memory & CPU use, slow decisions #1443

aeneasr opened this issue May 20, 2019 · 13 comments
Assignees
Labels
investigating Issues being actively investigated

Comments

@aeneasr
Copy link

aeneasr commented May 20, 2019

We're using opa as an embedded service in ORY Keto which recently was moved into production by a couple of adopters. With ORY Keto we provide a couple of "well-defined" APIs around standardized access control mechanisms (currently only what we call Access Control Policies).

It was reported that, with a medium (20-30k) amount of those ACPs (not to confuse with rego policies), high CPU (100%) and memory consumption (5GB+) as well as slow responses (10-30s) are consistently observed.

At first we thought it was the way we embed opa but the focus shifted towards opa. We created a reproducible code base at: https://github.com/aeneasr/opa-bench

To set up a reproducible test case, run:

$ opa run --server --log-level error

# In a new terminal:
$ git clone https://github.com/aeneasr/opa-bench.git
$ cd opa-bench
$ GO111MODULE=on go install .

# Import rego policies:
$ ./opa-policies.sh

Next run the benchmark tool. You can set the amount of concurrent workers (--workers), the number of Access Control Policies (--policies), as well as the string matching strategy (supports regex, glob, exact - ordered by computational complexity with exact having the lowest computational complexity as it's a simple string equals):

$ opa-bench opa exact --workers 25 --policies 30000

The benchmark will run for about 10-12 seconds. It will first create 30k Access Control Policies and then create 25 workers that continoously poll localhost:8181/v1/data/ory/<flavor>/allow with a variety of (valid) inputs that either return true or false as the access control decision.

Running opa-bench opa will first clean up all existing data by doing a DELETE /v1/data/store which removes all relevant data entries before (re-)creating the Access Control Policies.

Actual Behavior

It is expected that memory and CPU consumption stay within reasonable boundaries and that response times are below 100ms.

We're seeing that as we're adding more concurrent workers and more policies, response times (and CPU/memory use) increase significantly. Here are a couple of runs which show a significant time increase with increasing Access Control Policies / Workers for the different "flavors":

exact

$ opa-bench opa exact --workers 1 --policies 10
Took 0.00084977s on average per request to: http://localhost:8181/v1/data/ory/exact/allow
$ opa-bench opa exact --workers 1 --policies 100
Took 0.00229134s on average per request to: http://localhost:8181/v1/data/ory/exact/allow
$ opa-bench opa exact --workers 1 --policies 1000
Took 0.01577556s on average per request to: http://localhost:8181/v1/data/ory/exact/allow
$ opa-bench opa exact --workers 1 --policies 10000
Took 0.14910213s on average per request to: http://localhost:8181/v1/data/ory/exact/allow
$ opa-bench opa exact --workers 1 --policies 20000
Took 0.29849962s on average per request to: http://localhost:8181/v1/data/ory/exact/allow
$ opa-bench opa exact --workers 1 --policies 30000
Took 0.44687458s on average per request to: http://localhost:8181/v1/data/ory/exact/allow
$ opa-bench opa exact --workers 5 --policies 30000
Took 0.76463677s on average per request to: http://localhost:8181/v1/data/ory/exact/allow
$ opa-bench opa exact --workers 10 --policies 30000
Took 1.56407603s on average per request to: http://localhost:8181/v1/data/ory/exact/allow
$ opa-bench opa exact --workers 20 --policies 30000
Took 3.08979077s on average per request to: http://localhost:8181/v1/data/ory/exact/allow
$ opa-bench opa exact --workers 25 --policies 30000
Took 4.46234530s on average per request to: http://localhost:8181/v1/data/ory/exact/allow

glob

$ opa-bench opa glob --workers 1 --policies 10
Took 0.00129360s on average per request to: http://localhost:8181/v1/data/ory/glob/allow
$ opa-bench opa glob --workers 1 --policies 100
Took 0.00643750s on average per request to: http://localhost:8181/v1/data/ory/glob/allow
$ opa-bench opa glob --workers 1 --policies 1000
Took 0.05908481s on average per request to: http://localhost:8181/v1/data/ory/glob/allow
$ opa-bench opa glob --workers 1 --policies 10000
Took 0.62110626s on average per request to: http://localhost:8181/v1/data/ory/glob/allow
$ opa-bench opa glob --workers 1 --policies 20000
Took 1.27386651s on average per request to: http://localhost:8181/v1/data/ory/glob/allow
$ opa-bench opa glob --workers 1 --policies 30000
Took 1.90759661s on average per request to: http://localhost:8181/v1/data/ory/glob/allow
$ opa-bench opa glob --workers 5 --policies 30000
Took 2.82143575s on average per request to: http://localhost:8181/v1/data/ory/glob/allow
$ opa-bench opa glob --workers 10 --policies 30000
Took 5.47034376s on average per request to: http://localhost:8181/v1/data/ory/glob/allow
$ opa-bench opa glob --workers 20 --policies 30000
Took 11.16535831s on average per request to: http://localhost:8181/v1/data/ory/glob/allow
$ opa-bench opa glob --workers 25 --policies 30000
Took 13.80117587s on average per request to: http://localhost:8181/v1/data/ory/glob/allow

regex

$ opa-bench opa regex --workers 1 --policies 10
Took 0.00157184s on average per request to: http://localhost:8181/v1/data/ory/regex/allow
$ opa-bench opa regex --workers 1 --policies 100
Took 0.00754616s on average per request to: http://localhost:8181/v1/data/ory/regex/allow
$ opa-bench opa regex --workers 1 --policies 1000
Took 0.06946957s on average per request to: http://localhost:8181/v1/data/ory/regex/allow
$ opa-bench opa regex --workers 1 --policies 10000
Took 0.80506640s on average per request to: http://localhost:8181/v1/data/ory/regex/allow
$ opa-bench opa regex --workers 1 --policies 20000
Took 2.19482781s on average per request to: http://localhost:8181/v1/data/ory/regex/allow
$ opa-bench opa regex --workers 1 --policies 30000
Took 3.37429522s on average per request to: http://localhost:8181/v1/data/ory/regex/allow
$ opa-bench opa regex --workers 5 --policies 30000
Took 4.57060719s on average per request to: http://localhost:8181/v1/data/ory/regex/allow
$ opa-bench opa regex --workers 10 --policies 30000
Took 12.47872744s on average per request to: http://localhost:8181/v1/data/ory/regex/allow
$ opa-bench opa regex --workers 20 --policies 30000
Took 36.90770357s on average per request to: http://localhost:8181/v1/data/ory/regex/allow
$ opa-bench opa regex --workers 25 --policies 30000
Took 37.36773969s on average per request to: http://localhost:8181/v1/data/ory/regex/allow

We also tested few access control policies with up to 50 concurrent workers. Here too we saw an increase in response latency:

$ opa-bench opa regex --workers 1 --policies 100
Took 0.00717944s on average per request to: http://localhost:8181/v1/data/ory/regex/allow
$ opa-bench opa regex --workers 10 --policies 100
Took 0.01405837s on average per request to: http://localhost:8181/v1/data/ory/regex/allow
$ opa-bench opa regex --workers 25 --policies 100
Took 0.03782333s on average per request to: http://localhost:8181/v1/data/ory/regex/allow
$ opa-bench opa regex --workers 50 --policies 100
Took 0.08254780s on average per request to: http://localhost:8181/v1/data/ory/regex/allow

Memory use goes up to 4-5GB of RAM with 30k policies and response times are above the 20 second mark. The core OPA is running on is running at full capacity (100%).

While it is expected that more expensive matchers like glob or regex increase the overall computational time, it is interesting to see that even "simple" string equality takes up to 4.5s, 5gb RAM, 100% CPU time per decision. We think it's also interesting that with an increase in workers the response latency goes up so quickly.

Expected Behavior

Even with some data loaded (30k isn't that much if I imagine e.g. a 10m+ MAUs) there shouldn't be so much strain on the CPU and the memory consumption looks very suspicious. Responses should be much quicker.

Additional Info

After running our tests we actually saw OPA at 15GB of RAM (we even observed 35GB but that's when the system went down so no screenshot)!

Bildschirmfoto 2019-05-20 um 16 43 02

Version

$ opa version
Version: 0.10.7
Build Commit: 0f39c27e
Build Timestamp: 2019-04-09T00:29:06Z
Build Hostname: 5ec141f488d4
@tsandall
Copy link
Member

@aeneasr thanks for the excellent write up.

The eval latency is growing linearly with the number of ACPs which is what I'd expect for the decide_allow function. With 1 worker and 30,000 ACPs the average latency is ~500ms (or 2 QPS per core). As the number of workers increases, queries are going to start queuing and the end-to-end latency goes up.

I will see how we can bring down the latency. In the exact case, if we refactor the decide_allow function, we should be able to use partial eval and rule indexing to get close to constant-time.

I'm traveling back from KubeCon EU on Monday so expect an update by Tuesday/Wednesday next week.

@aeneasr
Copy link
Author

aeneasr commented May 26, 2019

Awesome, thank you Torin for helping with investigation. Besides high response latency, I think the memory consumption is something that should also be investigated :)

@tsandall
Copy link
Member

@aeneasr I've started looking into this. What I found immediately is that a significant portion of the latency is due to conversion between the raw Go data structures that store the ACPs in-memory and the AST types used by the evaluator. I'm looking into whether we can use partial evaluation to solve this. If that doesn't work, I'll look into other ways of avoiding the conversion. Once the latency is addressed we can deal with the memory consumption.

@aeneasr
Copy link
Author

aeneasr commented May 30, 2019

Thank you for following up, Torin! I've observed high memory use in the past stemming from repeated json <-> go conversion. It's possible that those two are related. If I can be of any help feel free to ping me any time (better here than on slack as I'm not really monitoring slack).

edit:// Sorry, I misread regarding json <-> go! In any case, thank you for spending time on this!

@tsandall tsandall self-assigned this May 31, 2019
@tsandall tsandall added the investigating Issues being actively investigated label May 31, 2019
@elwinar
Copy link

elwinar commented Jun 4, 2019

Hi. Any progress on this?
I'm currently benchmarking various solutions for implementing a permission system, and ORY Keto seems to suffer from bad performance due to OPA.
I'm willing to contribute if needed, and I can provide my benchmark files (Keto policies and roles) if it can be any help.

@jrmullins
Copy link

This is also relevant to me and I wouldn't mind helping out if I knew where to look.

@tsandall
Copy link
Member

tsandall commented Jun 6, 2019

I've been looking into this over the past couple days. Here are my initial findings and a tentative plan to address the latency issue.

  • The exact, glob, and regex policies in Keto perform a full scan over the ACP objects. In Keto latency with large number of policies ory/keto#104 this amounts to ~240,000 objects in total (30,000 policies * 8 resources per policy). It's worth noting that a full scan takes ~600ms (on my machine) without any additional statements (i.e., no subject/action/resource checks, just iterating over the objects and returning "true"). Profiling OPA doesn't reveal any obvious bottlenecks--a significant portion of time is spent in GC due to all of the heap allocations that are happening. Running escape analysis didn't reveal any obvious low hanging fruit. Nonetheless, even if we did identify some bottlenecks, I think it's unlikely we would be able to get more than 3-5x improvement.

  • The exact and glob policies in Keto can be rewritten such that partial evaluation can be used to inline most of the policy. The inlined version can be indexed by OPA's rule indexer which will yield nearly constant-time evaluation for these large data sets. My benchmarks show that 30,000 exact ACPs can be checked in less than 100 microseconds. That said, partial evaluation is expensive and running it on 30,000 policies (8 resources per) takes ~15 seconds on my machine. This precludes running it synchronously when policies are updated. You can find the benchmark code here if you're interested. Note, these benchmarks require some WIP changes to OPA: https://github.com/tsandall/opa/tree/spike-iam-scale. Once the policies have been partially evaluated, OPA consumes about 2GB memory.

@aeneasr I've got some questions about the use case:

  1. Would it be possible to use glob.match instead of regex at this scale? It should be relatively easy to update the rule indexer to support glob.match. Doing the same for regex would be much more difficult. In the linked Keto issue, it seems the user is only using regex wildcards, so globs should be enough?

  2. How are ACPs updated? Would asynchronous updates be OK? I imagine that a slow cold start (e.g., 15 seconds) is acceptable but if you need to update a few ACPs synchronously then users would expect a fast response.

@aeneasr
Copy link
Author

aeneasr commented Jun 11, 2019

Thank you for looking into this. We can give people guidance on using glob instead of regex when only glob-like patterns are being used.

I think a somewhat slow cold start could work.

@tsandall
Copy link
Member

tsandall commented Jun 28, 2019

Update. With #1517 merged it's possible to write IAM style policies that use the glob.match and get indexed. See https://www.openpolicyagent.org/docs/edge/faq/#glob-statements.

The main issue now is that the partial evaluation and compile step takes way too long. I'm looking into the cause right now (it appears to be due to one of the compile steps allocating too much memory.)

I've pushed some changes to https://github.com/tsandall/iambench to enable testing with glob.match.

The evaluation latency improvements are promising (mean ~50 microseconds, 99.9% < 1ms)

@aeneasr
Copy link
Author

aeneasr commented Jul 12, 2019

I tried out the 0.12.1 version of opa with the test suite from above and unfortunately could not see any improvements. Memory use is still insane, CPU load is too high and responses are too slow. I'll probably have to rewrite some of the policies to make use of partial evaluation to see if and how that solves issues, but since that won't solve regex I'm not sure if it's worth investing time in.

What's the general stance on resource intensity and latency for OPA? Are real-time systems (such as Ingress/API Gateways) even a good fit for OPA? Is the primary intention of OPA to replace/imitate systems like AWS or GCP IAM (the policy/role part)? Or is it really about e.g. SSH login or e.g. Kubernetes deployments where 500ms longer for logging in aren't that terrible?

@tsandall
Copy link
Member

hey @aeneasr thanks for testing v0.12.1. I'm sorry you weren't able to get an improvement out-of-the-box but that's not surprising because the optimization mentioned above only applies if you can partially evaluate the policy (or write the policy initially in such a way that it gets indexed.)

The "problem" is these policies currently require a linear scan over every single resource in every single ACP object. For the original test case in the issue this means checking 240,000 matches (30K ACPs with 8 resources each):

effect := policies[i].effect   <--- SCAN ALL POLICIES
policies[i].resources[_] == request.resource  <--- SCAN ALL RESOURCES IN POLICY 'i'
match_subjects(policies[i].subjects, roles, request.subject) <--- SCAN ALL SUBJECTS IN POLICY 'i'
policies[i].actions[_] == request.action <--- SCAN ALL ACTIONS IN POLICY 'i'
condition.all_conditions_true(policies[i])

Reordering those matches to subject->action->resource or action->subject->resource (instead of resource->subject->action) would probably help in most cases however latency is still going to grow linearly with ACPs. At some point GC is going to become a significant factor and performance will degrade further. If you want to check 30,000 ACPs in 100ms that means ~3us per check. On my dev laptop, one check takes ~40us. I'd hazard a guess we can reduce this by 2-3x (maybe 5x?) with micro-optimizations like removing heap allocs but I think that'll be a losing battle (and could take a lot of work.)

With that said, there are some things we can do for some of your policies. For example, if we refactor your top-level allow rules that do the subject/action/resource matching, we can partially evaluate the policy and the result will be indexed (yielding constant-time eval) The eval performance looks like this (1K and 10K below):

$ go run cmd/main.go -flavor glob -amount 1000
2019/07/12 16:50:29 Running partial evaluation...
2019/07/12 16:50:32 Partial evaluation metrics: {
  "timer_rego_input_parse_ns": 318,
  "timer_rego_module_compile_ns": 2483633398,
  "timer_rego_module_parse_ns": 9498979,
  "timer_rego_partial_eval_ns": 1170182203,
  "timer_rego_query_compile_ns": 89208,
  "timer_rego_query_parse_ns": 178441
}
2019/07/12 16:50:32 Running evaluation...
2019/07/12 16:50:32 mean                 90%                  99%                  99.9%               
2019/07/12 16:50:37 29.91µs              37.112µs             120.714µs            236.884µs           
2019/07/12 16:50:42 29.994µs             35.808µs             121.604µs            239.032µs           
2019/07/12 16:50:47 30.394µs             38.974µs             106.313µs            263.604µs           
2019/07/12 16:50:52 30.28µs              36.461µs             109.695µs            197.814µs           
2019/07/12 16:50:57 32.773µs             40.219µs             137.237µs            238.739µs           
2019/07/12 16:51:02 36.126µs             42.194µs             137.664µs            274.453µs           
2019/07/12 16:51:07 39.658µs             42.085µs             128.264µs            3.506384ms          
2019/07/12 16:51:12 35.118µs             42.324µs             121.485µs            237.226µs 
$ go run cmd/main.go -flavor glob -amount 10000
2019/07/12 17:05:59 Running partial evaluation...
2019/07/12 17:06:40 Partial evaluation metrics: {
  "timer_rego_input_parse_ns": 539,
  "timer_rego_module_compile_ns": 28095862732,
  "timer_rego_module_parse_ns": 9192583,
  "timer_rego_partial_eval_ns": 12609325091,
  "timer_rego_query_compile_ns": 75859,
  "timer_rego_query_parse_ns": 157540
}
2019/07/12 17:06:40 Running evaluation...
2019/07/12 17:06:40 mean                 90%                  99%                  99.9%               
2019/07/12 17:06:45 35.862µs             42.521µs             120.667µs            1.169337ms          
2019/07/12 17:06:50 41.553µs             46.646µs             226.178µs            3.406305ms          
2019/07/12 17:06:55 36.926µs             39.322µs             175.905µs            1.956829ms          
2019/07/12 17:07:00 36.95µs              41.523µs             175.493µs            1.931025ms          
2019/07/12 17:07:05 37.101µs             41.662µs             215.841µs            313.556µs           
2019/07/12 17:07:10 35.769µs             40.776µs             183.814µs            717.434µs           
2019/07/12 17:07:15 36.485µs             41.773µs             180.778µs            316.275µs           
2019/07/12 17:07:20 36.262µs             42.066µs             182.687µs            242.675µs           
2019/07/12 17:07:25 38.519µs             45.133µs             205.925µs            312.903µs      

Memory usage:

1K glob ACPs with 8 resources per: ~170MB RSS
10K glob ACPs with 8 resources per: ~1.4GB RSS (~10x as expected)

Of course, this will only work for your exact and glob match policies. Adding indexing support for arbitrary regexp patterns will be hard and unlikely in the future. We could investigate a small fragment of regexp but then I'm not sure what the benefit would be over globs. If you need to evaluate a large number of arbitrary regexp patterns (e.g., 10K, 100K) in a short amount of time (e.g., 100ms) then OPA is probably not a good fit.

As far as general guidance:

  • One general rule of thumb with OPA/Rego is to prefer maps to arrays. Arrays result in full scans. If you need N-way indices then arrays make sense but if you can structure your data with maps that often allows you to do constant-time lookups.
  • If your policy can be expressed in the fragment supported by the indexer (or partial evaluation can transform the policy into that fragment), even better.

So, yes we expect OPA to be used in low-latency API authorization use cases and it has been to date. You can get sub-millisecond response times for fairly large rulesets (e.g., the 10K example above involves 80,000 rules...) You do need to be somewhat careful with the fragment of Rego you use but it can be done. One thing that would help here is tooling that warns when you exit the fragment that's supported by the indexer, partial evaluation, etc.

@aeneasr
Copy link
Author

aeneasr commented Jul 13, 2019

Thank you Torin for the excellent response (as usual). I will take that input and try to experiment further. I saw that there are several doc sections about optimizing rego, which is awesome. Maybe adding some tooling support (linter?) that helps with avoiding the most common performance pitfalls would be a really good idea!

@anderseknert
Copy link
Member

Closing as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
investigating Issues being actively investigated
Projects
Archived in project
Development

No branches or pull requests

5 participants