-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Security Performance Test Suite #3903
Comments
Good list of scenarios as there are clear classes where the security plugin has undefined cpu/memory performances characteristics O(n^2) or O(2^N) based on the number of indices. This would help identify where we can provide better instruction to existing clusters with best practices and also give us a way to measure the impact of performance improvements over time.
I feel like this is represents a different 'class' of test case - which is different than correctness. I think of the correctness and perf areas being those measures such as the number of indices or number of roles assigned to a user. For TLS versions the considerations depend on compliance to standards such as FIPS - they need a level of configurability. I think it is a very useful area to test - but I'd break those out separately the dimension and I'm less concerned about getting results until we are looking to add additional configuration options and we want to make sure they are in alignment.
I'd prefer we keep it simple and always run them, see the ResourceFocusedTests.java and associated github workflow for an example of these 'constrained' tests. What do you think of address specific performance time issues, such as having a snapshot that can be restored with the test data set rather than individual requests for items, maybe the benchmarks' folks can help with recommendations?
I would be concerned about creating new infrastructure for this - while it would be possible such as by using OpenSearch-ci, that becomes a management burden compared to the managed github runners. By using this infrastructure it also creates a bottleneck for testing and deployment of new features or test areas in forks. Note; I know this makes a trade off where you have a hard time testing locally - I'd rather that PRs can get this testing than relying on contributors / maintainers to remember to run the tests.
Not sure what is available in this space, I'd love to see proposals. What codecov provides on pull requests is really nice - but I don't know what platforms are out there for these multi-dimensional perf metrics where you could see In a more bootstrapping way we could use the a LKG that is build and runs the tests, then the PR, then another test compares the results. This might also offer a way that these tests could be run locally via gradle that is triggered by a GHA. Closing thoughts - ReliabilityI believe the most important part of these tests would be that it is clear what they are testing, and its also clear what is wrong when they fail. If we have a failing perf test it would be unfortunate for it to be off-hand ignored because it isn't clear the impact. I'd rather have only a few highly trusted test cases than a bunch of edge case coverage with high variation in results and reliability. |
Have you seen github-action-benchmark? Maybe this would be a good approach to tackle the data collection / analysis. |
Good point! This helps defining the scope of the issue.
Do you have an intuition what would be the max duration performance tests should take then?
Good idea! Will have a look. |
@nibix How about we start with max of 20 minutes and work backwards from there? I pulled that number from the current max PR check times. |
Sorry for the delay, but I now can follow up with some further responses and items to discuss.
I can totally follow your reasoning here. Yet, there's one further problem we have to consider, which also plays into the trade-offs we have to consider for the infrastructure question. It is about the variance of benchmark results we will have to expect. If you look at the existing OpenSearch benchmarks, you can see for some operations a variation of up to 18% between runs: This poses a problem if one wants to use the benchmark results for automatically verifying the absence of performance regressions in pull requests. This is because we cannot just test for "equal or better" performance, but have to add some tolerance interval. However:
The Benchmark GitHub action will by default trigger an error only if the performance regressed by a factor of 2: https://github.com/benchmark-action/github-action-benchmark#alert-comment-on-commit-page To be honest, I have some doubts whether this provides sufficient control. Let's consider what factors influence the magnitude of the variance:
Cluster benchmarks vs micro benchmarksOriginally, I was thinking about running benchmarks using "real" clusters. This is the "natural" way to benchmark OpenSearch; there is a dedicated tool to perform such tests: opensearch-benchmark. However, we see in the discussion that there might be some requirements which are possibly not compatible with this approach. One of these requirements is the ability to provide trustworthy benchmark results in the context of a pull request. Another approach of doing benchmarks is doing micro-benchmarks, which focus on a particular code section. The relationship of these two approaches is similar to the relationship between integration tests and unit tests. Let's compare the approaches a bit more! Cluster level benchmarks
Micro benchmarks
Discussion
|
At the moment, I am not too concerned about the initialization of test data. Most test cases will only depend on the number of indices, but not the number of documents. Even if we need tens of thousands of indices, it should be possible to create these in a hunch. An exception might be DLS/FLS, but even there it is debatable whether a significant number of documents is required. Still, having an opinion of the benchmarks' folks on the whole topic would be of course extremely helpful :-) |
BTW, to increase visibility, it might make sense to tag this with the |
I like this thought process quite a bit - it satisfies my desire for test reliability. I suppose where I'm curious is what micro encompasses, for example I think the 'base' test case is for privilege evaluator's 'allowed' response is something that we should capture. Is that too big already, what kind of examples did you have in mind? |
I would think the same. Roughly the call of |
Great proposal @nibix. The biggest pain point of adding any new workload is sourcing the data, prepping it and then integrating it with opensearch-benchmark, but if we are not too keen on indexing performance we can quickly get started with using a snapshot from an existing cluster that caters to our requirement. |
[Triage] Thanks for filing this issue @nibix. Since this is an RFC we can mark this triaged for more engagement and steps for moving forward. |
Just to clarify what I'd expect of a performance test. I don't think we can reliability calibrate tests on microsecond/millisecond scales. There are far too many low level system compatibilities that might be hard to reproduce without dedicated hardware which I'm not in favor of. Test that I do think would be viable without environment constraints
There might be ways to do more/better but I think we can establish a baseline performance and detect regressions as we build features over time. What do you think @nibix? |
Generally, that makes sense. However, I am still a bit skeptical that the jitter induced by the environment will allow detecting regressions in a realistic scale. If the tolerance X% is too high (like 200% which is used by default by github-action-benchmark), the benchmarks will have no practical benefit and might actually give a false sense of security. Minor regressions will just go unnoticed and accumulate with the time. In the end, doing the benchmarks as part of pull requests has the downside that you can only consider a single A/B test. This won't give you a sense of the present jitter and also won't detect accumulated minor regressions. On the other hand, a graph similar to the ones at https://opensearch.org/benchmarks/ give you these insights - while lacking automatic actions upon regressions. But, in the end, I guess we just have to try things out. So, I would propose now the following. Let's create a prototype, which:
We would need to collect some data points with this prototype to see whether the results are reliable enough in order to be used for automatic regression testing. If we get confident that this is suitable, we can implement these checks in the CI. If not, we would have to re-consider what to do with the data. |
👍
Yea- this is definitely a concern in my mind as well. Being that we have no capacity to detect a 200% regression, I'd say even at a minimal level this is a great improvement. Beyond detecting large regressions when prompted investigation [1] [2] arrive, having the framework in place would help provide a starting point for measurement and validation on a developers environment.
|
@nibix Thank you for the detailed RFC! I like the idea of adding performance tests in this repo that can help to establish a baseline for the "cost" of common (security) plugin operations under different scenarios. For instance, this user reports that there are CPU spikes when there is a config update in a cluster with 300 users: #4008. A performance test suite can help to ensure that performance issues are properly mitigated and help to understand how performance degrades or improves as changes are merged in. This will especially be helpful when assessing the performance gains from the related RFC: #3903 I think other plugins could benefit from this test suite as well. |
@cwperks #4008 is an interesting issue, indeed. Generally, I think performance needs to be tackled on several dimensions or levels. In this, RFC we have already concluded that we need two separate approaches:
Additionally, the following thing just crossed my mind looking at #4008. Just thinking out loudly: For facilitating the diagnosis of things like #4008, it might be useful to have built-in instrumentation to gain performance metrics on production systems. Especially the security plugin, often needs to talk to external systems - like LDAP servers. This can quickly have big impacts on cluster performance. Thus, having a low-threshold diagnosis system would be beneficial. A "related" project 😉 has this: https://git.floragunn.com/search-guard/search-guard-suite-enterprise/-/merge_requests/200 |
@nibix Agreed on making this a two-part approach. What would be the plan forward based on the all the discussions up until the latest? |
There's already a draft PR for the test suite, but a bit of work is still necessary. The micro benchmarks for privilege evaluation should come with the PR #3870 which should also come up very soon. |
Background
Already now, the OpenSearch project systematically conducts benchmark test runs and publishes the results: https://opensearch.org/benchmarks/ This is based on the pre-defined benchmark-workloads which are executed using the tool opensearch-benchmark.
Additionally, there is ongoing work to define testing processes for performance related quality assurance: opensearch-project/OpenSearch#7499
These efforts, however, focus on the OpenSearch core; the efficiency of the code implementing data retrieval and indexing is tested. In some cases, there is a comparison between OpenSearch with security enabled vs OpenSearch without security enabled. While this can give a first idea of the overhead produced by the security plugin, the results cannot be generalized. This is because the performance overhead of the security plugin depends on a number of different factors. See below for more information on this.
Ideally, there would be an additional security-focused performance test suite. This test suite could be used for assuring that the security plugin does not suffer any performance regressions. Additionally, it could be used to determine the actual benefit of performance enhancing code changes (for example #3870).
Goal of this RFC is to collect requirements for such a test suite, learn about ongoing efforts, and to find a way to integrate it into the software development process.
Performance influencing factors
The performance overhead of the security plugin depends on a number of different factors:
Number of indices on a cluster. The security plugin access controls involve resolving requested index patterns and index patterns in the role definitions to actual indices. For clusters with many indices, this can slow things down (see for example Optimized Privilege Evaluation #3870 )
Number of roles of a user. While the example configurations usually demonstrate users with one or two roles, there are environments which act very liberal in assigning roles to users. Users with hundreds of roles do happen in reality. In order to determine the authorization status of a user, the security plugin needs to loop through every role of a user for each request.
Size of the user object. During inter-node communication, the security plugin needs to pass on the user object to the target node; for this, the user object needs to be serialized and deserialized. Larger user objects cause higher overhead during serialization and deserialization. Additionally, larger user objects put more stress onto the network.
Use of patterns vs constant index names in the role configuration.
Use of dynamic index patterns. The security plugin allows to define roles which use expressions to dynamically define the indices for which privileges are granted. Additionally, date math can be used to dynamically grant privileges. Both cases add a certain overhead to the privilege evaluation.
Operations on aliases vs operations on indices. Privilege evaluation for operations on aliases require an additional resolving of the aliases. This can have a significant impact if an alias has many member indices.
Use of DLS, FLS or field masking. If DLS, FLS or field masking is used, the security plugin hooks at a low level into the OpenSearch/Lucene operations. This is particularly hot code where already subtle details can have big performance impacts. Additionally, the current DLS/FLS implementation requires the serialization/deserialization of the whole DLS/FLS configuration. This can also have performance impacts for clusters with complex configurations.
Type of the OpenSearch operation. While the operation itself (such as bulk index, search or aggregate) should have little effect on the performance, there can be operation specific optimizations which then lead to performance characteristics that vary by operation. For example, the privilege evaluation code contains special cases to skip it on the top level for bulk index operations in favour of checks on the shard level.
TLS version, ciphers, implementation. TLS offers actually a quite broad field of different settings and implementations which might have an effect on the performance. As the security plugin adds TLS to the communication, it might be good to gain more insights into its performance effects.
Proposal for test dimensions
Given the complex dependencies, a three dimensional test matrix seems to be necessary. I would propose these dimensions:
Note: Some combinations in this matrix might not make sense and can be skipped; for example, DLS/FLS makes indices read-only, thus testing indexing does not make sense in that case
Questions:
Process definition
We need to define when and how the performance tests should be executed.
Factors which might influence the decision are:
Related work
There are a number of other ongoing efforts/proposals covering OpenSearch benchmarking. In some cases, it might make sense to coordinate with these efforts:
Question: Are there further efforts?
The text was updated successfully, but these errors were encountered: