-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding integration tests for search backpressure #5308
Adding integration tests for search backpressure #5308
Conversation
Signed-off-by: PritLadani <pritkladani@gmail.com>
Gradle Check (Jenkins) Run Completed with:
|
Codecov Report
@@ Coverage Diff @@
## main #5308 +/- ##
============================================
- Coverage 71.03% 71.00% -0.04%
- Complexity 58129 58132 +3
============================================
Files 4704 4704
Lines 277266 277266
Branches 40147 40147
============================================
- Hits 196957 196871 -86
- Misses 64129 64303 +174
+ Partials 16180 16092 -88
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
...er/src/internalClusterTest/java/org/opensearch/search/backpressure/SearchBackpressureIT.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is unstable, I just run in locally and it passes from 3rd attempt:
./gradlew :server:internalClusterTest --tests "org.opensearch.search.backpressure.SearchBackpressureIT
> Task :server:internalClusterTest
WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by org.opensearch.bootstrap.BootstrapForTesting (file:/home/andriy.redko/Workspaces/Experiments/OpenSearch/test/framework/build/distributions/framework-3.0.0-SNAPSHOT.jar)
WARNING: Please consider reporting this to the maintainers of org.opensearch.bootstrap.BootstrapForTesting
WARNING: System::setSecurityManager will be removed in a future release
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.search.backpressure.SearchBackpressureIT.testSearchShardTaskCancellationWithHighElapsedTime" -Dtests.seed=899384162E632993 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=es-CR -Dtests.timezone=Etc/Zulu -Druntime.java=19
org.opensearch.search.backpressure.SearchBackpressureIT > testSearchShardTaskCancellationWithHighElapsedTime FAILED
java.lang.AssertionError
at org.junit.Assert.fail(Assert.java:87)
at org.junit.Assert.assertTrue(Assert.java:42)
at org.junit.Assert.assertTrue(Assert.java:53)
at org.opensearch.search.backpressure.SearchBackpressureIT$1.onFailure(SearchBackpressureIT.java:101)
at org.opensearch.action.support.TransportAction$1.onFailure(TransportAction.java:122)
at org.opensearch.search.backpressure.SearchBackpressureIT$TestTransportAction.doExecute(SearchBackpressureIT.java:268)
at org.opensearch.search.backpressure.SearchBackpressureIT$TestTransportAction.doExecute(SearchBackpressureIT.java:233)
at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:218)
at org.opensearch.action.support.TransportAction.execute(TransportAction.java:188)
at org.opensearch.action.support.TransportAction.execute(TransportAction.java:107)
at org.opensearch.client.node.NodeClient.executeLocally(NodeClient.java:110)
at org.opensearch.client.node.NodeClient.doExecute(NodeClient.java:97)
at org.opensearch.client.support.AbstractClient.execute(AbstractClient.java:461)
at org.opensearch.client.FilterClient.doExecute(FilterClient.java:83)
at org.opensearch.client.support.AbstractClient.execute(AbstractClient.java:461)
at org.opensearch.search.backpressure.SearchBackpressureIT.testSearchShardTaskCancellationWithHighElapsedTime(SearchBackpressureIT.java:90)
java.lang.AssertionError: test leaves persistent cluster metadata behind
Expected: an empty collection
but: <[search_backpressure.search_shard_task.elapsed_time_millis_threshold]>
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
Please make sure the tests suite passes all the time.
...er/src/internalClusterTest/java/org/opensearch/search/backpressure/SearchBackpressureIT.java
Outdated
Show resolved
Hide resolved
...er/src/internalClusterTest/java/org/opensearch/search/backpressure/SearchBackpressureIT.java
Outdated
Show resolved
Hide resolved
...er/src/internalClusterTest/java/org/opensearch/search/backpressure/SearchBackpressureIT.java
Outdated
Show resolved
Hide resolved
...er/src/internalClusterTest/java/org/opensearch/search/backpressure/SearchBackpressureIT.java
Outdated
Show resolved
Hide resolved
...er/src/internalClusterTest/java/org/opensearch/search/backpressure/SearchBackpressureIT.java
Outdated
Show resolved
Hide resolved
...er/src/internalClusterTest/java/org/opensearch/search/backpressure/SearchBackpressureIT.java
Outdated
Show resolved
Hide resolved
...er/src/internalClusterTest/java/org/opensearch/search/backpressure/SearchBackpressureIT.java
Outdated
Show resolved
Hide resolved
...er/src/internalClusterTest/java/org/opensearch/search/backpressure/SearchBackpressureIT.java
Outdated
Show resolved
Hide resolved
private TestRequest generateRequestWithHigCpuUsage(double load, long duration) { | ||
return new TestRequest(() -> { | ||
long startTime = System.currentTimeMillis(); | ||
try { | ||
while (System.currentTimeMillis() - startTime < duration) { | ||
// sleeping for the percentage of unladen time every 100ms | ||
if (System.currentTimeMillis() % 100 == 0) { | ||
Thread.sleep((long) Math.floor((1 - load) * 100)); | ||
} | ||
} | ||
} catch (InterruptedException e) { | ||
Thread.currentThread().interrupt(); | ||
} | ||
}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same comment applies here as well. We don't really have to simulate an x%
CPU load for y
seconds.
We just need to execute simple instructions that consume some CPU cycles, and be small enough that it can be interleaved with the cancellation checks. The busy-wait loop already takes care of re-running this 'work' until enough CPU cycles have been consumed for this task cancellation to take place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted.
...er/src/internalClusterTest/java/org/opensearch/search/backpressure/SearchBackpressureIT.java
Outdated
Show resolved
Hide resolved
...er/src/internalClusterTest/java/org/opensearch/search/backpressure/SearchBackpressureIT.java
Outdated
Show resolved
Hide resolved
Signed-off-by: PritLadani <pritkladani@gmail.com>
...er/src/internalClusterTest/java/org/opensearch/search/backpressure/SearchBackpressureIT.java
Outdated
Show resolved
Hide resolved
...er/src/internalClusterTest/java/org/opensearch/search/backpressure/SearchBackpressureIT.java
Show resolved
Hide resolved
...er/src/internalClusterTest/java/org/opensearch/search/backpressure/SearchBackpressureIT.java
Show resolved
Hide resolved
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Signed-off-by: PritLadani <pritkladani@gmail.com>
…/OpenSearch into searchbackpressure/integtests
Gradle Check (Jenkins) Run Completed with:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks much better now, thanks @PritLadani
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test takes about a minute to complete. Is there anyway to make it faster? (our tests are quite slow as it is...)
assertEquals(TaskCancelledException.class, e.getClass()); | ||
assertTrue(e.getMessage().contains("elapsed time exceeded")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting these assertions in the async callback makes the output on a failure confusing, and also relies on the testing framework to fail on an uncaught exception. This also never counts down the latch in case of failure so await
will have to wait until the timeout. It's better to make this callback just capture the exception, e.g.
private static class ExceptionCatchingListener implements ActionListener<TestResponse> {
private final CountDownLatch latch = new CountDownLatch(1);
private Exception e = null;
@Override
public void onResponse(TestResponse r) {
latch.countDown();
}
@Override
public void onFailure(Exception e) {
this.e = e;
latch.countDown();
}
}
and then make assertions about e
after you've blocked on the latch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting these assertions in the async callback makes the output on a failure confusing, and also relies on the testing framework to fail on an uncaught exception.
Added ExceptionCatchingListener
as suggested to handle the exceptions and then asserting on the caught exceptions.
This also never counts down the latch in case of failure so await will have to wait until the timeout.
In case of testcase failure, we are failing the testcase by calling fail()
and it will finish the test execution, so CountDownLatch
doesn't have to wait until the timeout. In case when testcase passes, we are already decrementing the latch count after assertions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrross Good point regarding test execution time!
@PritLadani May be set ElapsedTimeTracker.SETTING_ELAPSED_TIME_MILLIS_THRESHOLD
CpuUsageTracker.SETTING_CPU_TIME_MILLIS_THRESHOLD
in the tests to 1 second instead of 15 seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ketanv3 Sure Ketan, will update the same in the next commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of testcase failure, we are failing the testcase by calling fail() and it will finish the test execution
fail()
doesn't do anything other than throw a AssertionError, so the test doesn't actually fail until the test method completes (after the latch times out) and something in the tear down method of the base class notices the unhandled exception and fails the test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the newly introduced ExceptionCatchingListener
, we will not encounter such case where latch has to wait until timeout happens.
} | ||
}); | ||
|
||
latch.await(TIMEOUT.getSeconds(), TimeUnit.SECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should assert that all these await
calls return true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are having the same timeout value for the request execution as well, and if the TestRequest times out, it will make a call to onResponse()
and we already have necessary assertions there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize it is fine now but this is more about maintainability. Future changes may violate some of the current assumptions so it is better to explicitly verify that things are working as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Asserting on the return value of latch.wait()
.
cf0beb1
to
425b577
Compare
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Signed-off-by: PritLadani <pritkladani@gmail.com>
425b577
to
cbeb1e6
Compare
Gradle Check (Jenkins) Run Completed with:
|
// Before SearchBackpressureService cancels a task based on its heap usage, we need to build up the heap moving average | ||
// To build up the heap moving average, we need to hit the same node with multiple requests and then hit the same node with a | ||
// request having higher heap usage | ||
String node = internalCluster().startDataOnlyNode(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will start a new data node for the test. Maybe your intent is to pick up a random node from the existing cluster.
String node = randomFrom(internalCluster().getNodeNames());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Change it to String node = randomFrom(internalCluster().getNodeNames());
.
final int MOVING_AVERAGE_WINDOW_SIZE = 10; | ||
Settings request = Settings.builder() | ||
.put(SearchBackpressureSettings.SETTING_MODE.getKey(), "enforced") | ||
.put(HeapUsageTracker.SETTING_HEAP_PERCENT_THRESHOLD.getKey(), 0.00001) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to zero so that we can be sure that heap threshold is breaching.
Signed-off-by: PritLadani <pritkladani@gmail.com>
Gradle Check (Jenkins) Run Completed with:
|
CountDownLatch latch = new CountDownLatch(1); | ||
ExceptionCatchingListener listener = new ExceptionCatchingListener(latch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick, but you can just create the latch inside ExceptionCatchingListener then access here with listener.latch
since the outer class can access private fields of nested classes so you don't even need a getter. Just makes the code a bit more concise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Changed it.
} | ||
}); | ||
|
||
latch.await(TIMEOUT.getSeconds(), TimeUnit.SECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize it is fine now but this is more about maintainability. Future changes may violate some of the current assumptions so it is better to explicitly verify that things are working as expected.
assertEquals(TaskCancelledException.class, caughtException.getClass()); | ||
assertTrue(caughtException.getMessage().contains("cpu usage exceeded")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another nitpick but you'll get much better error messages with following style:
MatcherAssert.assertThat(caughtException, instanceOf(TaskCancalledException.class));
MatcherAssert.assertThat(caughtException.getMessage(), containsString("cpu usage exceeded"));
because the error message will show what was found versus what was expected instead of just indicating a failure. Not a big deal though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Changed it.
CountDownLatch latch = new CountDownLatch(1); | ||
ExceptionCatchingListener listener = new ExceptionCatchingListener(latch); | ||
client().execute(TestTransportAction.ACTION, new TestRequest(RequestType.HIGH_ELAPSED_TIME), listener); | ||
latch.await(TIMEOUT.getSeconds(), TimeUnit.SECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I run this test this call always returns false and the test takes the entire 30 seconds. Is that expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, yes. We have set the TIMEOUT to be 30 seconds and it lets the request run for 30 seconds before timing out. We want to be sure that even if the search backpressure thresholds are breads, it is not cancelling the task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anyway to make that faster? Is it possible to validate that the backpressure threshold is indeed breached?
In this case you should assert that the latch does time out, otherwise the request could successfully complete and the test would still pass.
Signed-off-by: PritLadani <pritkladani@gmail.com>
Gradle Check (Jenkins) Run Completed with:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Signed-off-by: PritLadani <pritkladani@gmail.com> (cherry picked from commit bdb8efe) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Signed-off-by: PritLadani <pritkladani@gmail.com> (cherry picked from commit bdb8efe) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Signed-off-by: PritLadani <pritkladani@gmail.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Description
This commit adds integrations tests to tests that SearchBackpressureService is correctly cancelling the tasks.
Issues Resolved
#1181
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.
Signed-off-by: PritLadani pritkladani@gmail.com