-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenSearch 1.2.0 Performance regression of 10% compared to 1.1.0 #1560
Comments
@peternied taking a look at this data it looks like we have significant degradation in index & query latencies across the board. Index requests per second is down ~10%, mean Index latencies are up 11% (p50 9%, p90 11%, p99 37%, p100 46%), and mean query latency is up 116% (p50 up 116%, p90-108%, p99 106%, p100 118%) Looking to see what went on here but it looks like the bundle for 1.1 contains different components than 1.2, can we get a run against the same components that were included with the 1.1 bundle to get a baseline? |
Working with Marc to investigate, |
We are going to deploy min versions for testing, using the 1.1.0 release min version, |
Started a CDK deployment for the min version of 1.2.0 following the performance setup process, as soon as that is done will repeat for 1.2.0. Note we are using build |
Working on getting rally set up to run min comparison locally will report back |
New environment is up, starting on the 1.1.0
|
Both environments are now up
|
|
|
Additional data from our internal performance data service: While default query / distance_amount_agg were marginally impacted, index too a bigger hit and range, autohisto_agg, date_histogram_agg were double or worse OpenSearch 1.1, test run
|
Operation Type | P50 | P90 | P99 | P100 |
---|---|---|---|---|
default | 51.149 | 54.791 | 57.001 | 57.167 |
distance_amount_agg | 49.971 | 50.788 | 51.073 | 53.019 |
index | 1,840.6 | 2,409.8 | 3,128.4 | 5,320.1 |
range | 347.2 | 370.6 | 386.7 | 392.8 |
autohisto_agg | 198.9 | 202.9 | 237.5 | 255.1 |
date_histogram_agg | 202.1 | 206 | 259.8 | 268.2 |
OpenSearch 1.2, test run f76c89f3-24ce-4c30-a285-209b750f3d68
Operation Type | P50 | P90 | P99 | P100 |
---|---|---|---|---|
default | 50.938 | 53.415 | 58.209 | 62.961 |
distance_amount_agg | 49.017 | 53.129 | 53.367 | 53.582 |
index | 2,085.9 | 2,846.6 | 4,563.3 | 6,995.3 |
range | 922.2 | 942.6 | 949 | 983.1 |
date_histogram_agg | 665.8 | 669.6 | 693.7 | 698.6 |
autohisto_agg | 606.3 | 669.6 | 689 | 691.8 |
After deploying the min instances, they needed to be manually started, that has been done. |
Started up the
|
|
Details on how queries that regressed are executed, using nyc_taxis dataset executed by ESRally client. Range query from 1.1 vs 1.2 showed >100% increase in latency (#1560 (comment)) Operation info from the test
|
The percolator results are inconclusive. We are going to run additional tests overnight with the nyc_taxis dataset and the same configurations.
The regression appears for both architectures (arm/x64) and with/without security enabled, so for these tests we will pick a configuration with security enabled and use x64. We are also going to run against older builds of 1.2 between Sept 14th (around when 1.x was bumped to 1.2) and the latest on Nov 17th to try and identify when this could have been introduced.
|
I compared 1.1 and 1.2 performance runs and on queries being slow on 1.2: performance test created |
Excellent work @skkosuri-amzn ! I have reviewed the data from the overnight runs that I scheduled and all of them show the same performance characteristics with query latency which agrees with your findings. Since we have both the older ec2 instance around, and the new ones, I have scheduled shorter running nyc_taxis tests on our `` and
I can also confirm that I see a single shard on both tests that are inflight. perf-x64-1-1-0-with-security/Instance
OpenSearch-1-2-0--ROUND3-1052-x64-enable/Instance
|
Tests on the From the rally logs
Nothing useful on the host logs that I can see
Looking at the settings for the index it was auto generated and the
|
While the test methodology is slightly different, here is a comparison between 1.1 (one shard) vs the original results from 1.2 (one shard). range, autohisto_agg, date_histogram_agg are all significantly closer, +100% variance has dialed down to +-2% for P50 latencies. indexing latency is heavily skewed towards the 1.1 results, coming a whole second faster in both P50/90 OpenSearch 1.1 x64 with security - 564636b0-30b1-4318-b63f-2c7ef8584231Latency (ms)
OpenSearch 1.2 x64 with security - cd07932d-f330-4ad2-88d6-9ce3b9cfb716Latency (ms)
|
In order to have a basis of comparison for 1.1 and rule out test configuration differences, I've triggered 4 new tests against the OpenSearch 1.1, the should be complete in ~8 hours
|
Some quick observations: In IndexingPressureService.java#L29: public Releasable markCoordinatingOperationStarted(long bytes, boolean forceExecution) {
if (isShardIndexingPressureEnabled() == false) {
return shardIndexingPressure.markCoordinatingOperationStarted(bytes, forceExecution);
} else {
return () -> {};
}
} This is called from TransportBulkAction.java#L215: final Releasable releasable = indexingPressureService.markCoordinatingOperationStarted(indexingBytes, isOnlySystem); Which does some seemingly innocuous atomic operations per shard (but could nevertheless amortize): long combinedBytes = this.currentCombinedCoordinatingAndPrimaryBytes.addAndGet(bytes);
long replicaWriteBytes = this.currentReplicaBytes.get();
long totalBytes = combinedBytes + replicaWriteBytes;
... Bulk indexing is the default indexing runner for nyc_taxi rally track: def register_default_runners():
register_runner(track.OperationType.Bulk, BulkIndex(), async_runner=True)
... Which means if ShardIndexingPressure is actually disabled by default... those markCoordinatingOperationsStarted Atomic calls are most certainly happening during rally bulk indexing. We may only catch amortized performance degradation when trying to bulk index large volumes of data; which is exactly what the above benchmark does (~165M docs w/ ~1second increase in latency). I think it would be worthwhile either:
if (isShardIndexingPressureEnabled() == false) { to if (isShardIndexingPressureEnabled() == true) { |
I ran the stackoverflow dataset against a single node for 1.2 min and 1.2 min without ShardIndexingPressure. Each of these ran 2 warmup iterations and 3 test iterations. These are the high level results. There is a ~20% drop in p50, p90, and p99 index latency with ShardIndexingPressure removed. There is also a 23% drop in p50 throughput.
I also executed a single run against each node using esrally directly - the results are consistent with the above, baseline being 1.2 and contender being 1.2 with the revert.
|
Hi @nknize, Thank you for sharing this detail. However the atomic reference updates what is being discussed above exists even in version 1.1 and are executed by default. This came in along with the Indexing Pressure changes for all bulk indexing operations. These were actually introduced with ES-7.9 release (are executed by default), and this is not specific change in OS 1.2 changes. The reason it is restructured here and kept behind a dynamic setting (default false) now, is to ensure that until Shard Indexing Pressure is specifically enabled, Indexing Pressure feature continues to works as is. It also ensures Shard Indexing Pressure does not add any form of execution overhead here as long as this setting is disable, which I see being called out here. Given the code execution logic has not changed, I don’t see this as the root cause here. We will need to identify other changes which could have impacted it, or re-run tests a few time to avoid any discrepancy such as cold starts. |
Here are measurements back for the entire test matrix between 1.1 and 1.2 with the distribution, with a single shard. Query LatencyI confidently say that the previous comparisons around query latency were overblown, while there is drift in performance, single shard vs multi-shard had a dramatic impact and the numbers are +-20% Clearest confidence indicates that these test should be run with a 5 shard setup, which we will schedule. Indexing LatencyFor 3 of 4 tests, OpenSearch 1.2 performance dropped, but for x64 with the security plugin enabled we saw an improvement. This is not conclusive, and I would recommend more measurement focused around the OpenSearch core scenarios were there could be more risk and a smaller delta.
|
Ah! Thanks for clarifying that @getsaurabh02 and that logic consistency makes sense (javadocs would help make this more clear). Based on the latest benchmarking above we need to dig further into the source of the added indexing overhead. Those benchmarks w/ the PR revert are consistent across both data sets (nyc_taxi and stackOverflow) giving more plausibility in the regression introduced by the PR.
Hindsight being 20/20 the backports should've occurred incrementally w/ benchmarks for each change instead of the single sweeping 7K merge benchmarked at the zero hour. This is certainly making it more difficult to root cause. The other area of concern I have is the additional "shard level accounting" logic added to TransportBulkAction. In particular the potentially expensive call to ramBytesUsed which is accumulated regardless of Shard Index Pressure being enabled. // Add the shard level accounting for coordinating and supply the listener
final boolean isOnlySystem = isOnlySystem(bulkRequest, clusterService.state().metadata().getIndicesLookup(), systemIndices);
final Releasable releasable = indexingPressureService.markCoordinatingOperationStarted(
shardId,
bulkShardRequest.ramBytesUsed(),
isOnlySystem
); In the case of the nyc_taxi dataset this is going to stream a window of 10000 DocWriteRequest objects per bulk call and accumulate the total RamUsage. return SHALLOW_SIZE + Stream.of(items).mapToLong(Accountable::ramBytesUsed).sum(); If Shard Indexing Pressure is disabled this computation is wasted on a noop. Rather than revert the whole PR for 1.2 we could consider wrapping this critical path in something like: |
…ssure framework. (opensearch-project#1560) Signed-off-by: Saurabh Singh <sisurab@amazon.com>
Hi @nknize I think having a nightly CI which could benchmark and compare the results with previous build should be a way to detect any regressions in general. Since we had incrementally pushed out the changes in the feature/1.x branch, we would have detected existence of any such issue much ahead in time, even before the backport to 1.2 branch.
For However in order to maintain sanity between these operation calls and optimise this request size computation further, I have raised a PR to delay this computation until reached a point, where absolutely necessary. This will keep the fix ready (to be merged if needed) and is inline with your suggestion functionally. It doesn’t require any additional flag check for feature #1589 |
Can we run a JMH benchmark test for OpenSearch/server/src/main/java/org/opensearch/action/bulk/TransportBulkAction.java Line 212 in c459282
|
We (myself and @mch2) ran the Rally Tests again today, to compare the 1.2 changes with and without the shard indexing pressure commit. This was done using the same branch and just isolating one ShardIndexingPressure commit. Since the previous tests were done against different distributions, it was not equivalent setup. Here numbers are pretty much similar, and definitely not alarming as called out 20% in the earlier comment. Worst case increase is reported as around 2% for p100, however since at the same time p99 is same, it mitigates any risk. Indexing Operation Latencies (ms) Comparison
Throughput (req/s) Comparison
The server side metrics for these tests such as CPU, Memory, Threadpool utilization, Garbage collections are all identical in trend and similar in numbers. Some of the differences in the tests can be attributed to client side metric collections. Please let me know if there are any questions related to these runs. On the other hand, for one possible optimization found (as covered in the above comment #1589), I have already raised a PR. It will reduce one CPU computation in the indexing flow further. Although it doesn’t look like a blocker, but we can get that change in as well if needed. |
Specifically, what distributions did we run this against the first time? Sounds like there was a distribution that was 20% better? |
@dblock What I understood in the previous setup from @mch2 which reported 20% degradation, one test setup was run by checking out the local branch and building the distribution locally after reverting the commit, while other was downloaded from the archived directly. So this time we used the same setup branch, which was checked out locally for both the tests, while one had the commit reverted. |
…ssure framework. (opensearch-project#1560) Signed-off-by: Saurabh Singh <sisurab@amazon.com>
So we can verify using the official build process (matching jdks, etc) we're going to run the benchmarks again but with the CI built tarballs instead of local checkouts. |
Following up on where we are at with the Overall 1.1 vs 1.2 comparisons, we have the additional run data with 5 shards instead of 1 shard from the performance test. This netted a large reduction in ALL latency numbers, how we shard can have dramatic impacts on performance. Indexing Latency P90 milliseconds
Reviewing the indexing performance numbers showed the same pattern as seen previously on indexing, with a more latency seen on arm64 in 1.2, and improvement in x64 enabled, and within the error rate for x64 disabled. When we ran the results on the build over build tests, we had a lot of data and no 'bright line' distinction for a bad change getting introduced. We will be conducting another run, but instead focusing on arm64 enabled where there is the larger distinction between 1.1 and 1.2. Only if there is clear data will I post findings. |
OpenSearch 1.2 Indexing Backpressure Removed vs Normal Distribution Looking over these results, they are within in the 5% error percentage, no 20% outliers. This aligns with the earlier findings indicating that Indexing Backpressure presence does not have a [edit] significant [/edit] impact on the overall performance. 🍾 @getsaurabh02 @nknize Indexing Latency milliseconds delta
Indexing Latency milliseconds delta %
|
Excellent news @peternied!!! So to summarize, the initial reported 20% indexing degradation was the result of a difference in the benchmarking configuration between the 1.1 and 1.2 build? The initial issue above indicates the 1.2.0 distro was built from CI. Were we comparing that distro build against a 1.1.0 built from a local branch checkout? Thereby leading to different jvm and machine configs? Many thanks to all that put a huge effort into verifying this performance! @peternied @mch2 @getsaurabh02 @Bukhtawar Looking forward to getting ongoing benchmarking stood up to catch any potential issues early! |
Yes, with a slight correction, it was between a 1.2 min built by our CI vs a local build of |
In the 1.2 build of OpenSearch we are seeing a reproducible performance impact from the backpressure indexing feature of ~2-3%. I believe given the benefit of the feature this is an acceptable trade-off and we should release 1.2, but we should continue to improve the performance of the feature in subsequent releases. Below is a summary of our findings:. This data was collected using 2 warm-up iterations and 5 test iterations of the StackOverflow dataset to measure indexing latency. The tests executed against a single x86 node running with 5 shards and no installed plugins. During this investigation, we saw some anomalies that need further investigations. This includes a performance degradation specifically on arm64. At this time we believe it is related to testing setup, so have excluded it from the tables below. Delta between 1.1 and 1.2
Further investigations and action items (non blocking):
|
This issue was ultimately determined to be a configuration mismatch between the 1.1.0 and 1.2.0 performance testing setups. Closing this as we have shipped 1.2.0 |
We are seeing 10%+ regression across the board compared to the OpenSearch 1.1.0 release.
Thanks to @mch2 for these numbers
Performance test data is available on opensearch-project/opensearch-build#963, please review and create any issues if follow up is needed.
Builds under test:
The text was updated successfully, but these errors were encountered: