Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding integration tests for search backpressure #5308

Merged

Conversation

PritLadani
Copy link
Contributor

@PritLadani PritLadani commented Nov 18, 2022

Description

This commit adds integrations tests to tests that SearchBackpressureService is correctly cancelling the tasks.

Issues Resolved

#1181

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: PritLadani pritkladani@gmail.com

Signed-off-by: PritLadani <pritkladani@gmail.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@codecov-commenter
Copy link

codecov-commenter commented Nov 18, 2022

Codecov Report

Merging #5308 (e336a04) into main (7dc137f) will decrease coverage by 0.03%.
The diff coverage is n/a.

@@             Coverage Diff              @@
##               main    #5308      +/-   ##
============================================
- Coverage     71.03%   71.00%   -0.04%     
- Complexity    58129    58132       +3     
============================================
  Files          4704     4704              
  Lines        277266   277266              
  Branches      40147    40147              
============================================
- Hits         196957   196871      -86     
- Misses        64129    64303     +174     
+ Partials      16180    16092      -88     
Impacted Files Coverage Δ
...adonly/AddIndexBlockClusterStateUpdateRequest.java 0.00% <0.00%> (-75.00%) ⬇️
...r/src/main/java/org/opensearch/http/HttpUtils.java 16.66% <0.00%> (-50.00%) ⬇️
.../java/org/opensearch/node/NodeClosedException.java 50.00% <0.00%> (-50.00%) ⬇️
...opensearch/persistent/PersistentTasksExecutor.java 22.22% <0.00%> (-44.45%) ⬇️
...readonly/TransportVerifyShardIndexBlockAction.java 9.75% <0.00%> (-41.47%) ⬇️
.../admin/cluster/reroute/ClusterRerouteResponse.java 60.00% <0.00%> (-40.00%) ⬇️
...h/action/ingest/SimulateDocumentVerboseResult.java 60.71% <0.00%> (-39.29%) ⬇️
...indices/readonly/TransportAddIndexBlockAction.java 20.68% <0.00%> (-37.94%) ⬇️
...luster/routing/allocation/RoutingExplanations.java 62.06% <0.00%> (-37.94%) ⬇️
...ensearch/client/indices/DetailAnalyzeResponse.java 20.54% <0.00%> (-34.25%) ⬇️
... and 475 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Collaborator

@reta reta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is unstable, I just run in locally and it passes from 3rd attempt:

./gradlew :server:internalClusterTest --tests "org.opensearch.search.backpressure.SearchBackpressureIT
> Task :server:internalClusterTest                                                                                                                                                                                                                                                                                                                                                          
WARNING: A terminally deprecated method in java.lang.System has been called                                                                                                                                                                                                                                                                                                                 
WARNING: System::setSecurityManager has been called by org.opensearch.bootstrap.BootstrapForTesting (file:/home/andriy.redko/Workspaces/Experiments/OpenSearch/test/framework/build/distributions/framework-3.0.0-SNAPSHOT.jar)                                                                                                                                                             
WARNING: Please consider reporting this to the maintainers of org.opensearch.bootstrap.BootstrapForTesting                                                                                                                                                                                                                                                                                  
WARNING: System::setSecurityManager will be removed in a future release                                                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                                                                                                            
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.search.backpressure.SearchBackpressureIT.testSearchShardTaskCancellationWithHighElapsedTime" -Dtests.seed=899384162E632993 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=es-CR -Dtests.timezone=Etc/Zulu -Druntime.java=19       
                                                                                                                                                                                                                                                                                                                                                                                            
org.opensearch.search.backpressure.SearchBackpressureIT > testSearchShardTaskCancellationWithHighElapsedTime FAILED                                                                                                                                                                                                                                                                         
    java.lang.AssertionError                                                                                                                                                                                                                                                                                                                                                                
        at org.junit.Assert.fail(Assert.java:87)                                                                                                                                                                                                                                                                                                                                            
        at org.junit.Assert.assertTrue(Assert.java:42)                                                                                                                                                                                                                                                                                                                                      
        at org.junit.Assert.assertTrue(Assert.java:53)                                                                                                                                                                                                                                                                                                                                      
        at org.opensearch.search.backpressure.SearchBackpressureIT$1.onFailure(SearchBackpressureIT.java:101)                                                                                                                                                                                                                                                                               
        at org.opensearch.action.support.TransportAction$1.onFailure(TransportAction.java:122)                                                                                                                                                                                                                                                                                              
        at org.opensearch.search.backpressure.SearchBackpressureIT$TestTransportAction.doExecute(SearchBackpressureIT.java:268)                                                                                                                                                                                                                                                             
        at org.opensearch.search.backpressure.SearchBackpressureIT$TestTransportAction.doExecute(SearchBackpressureIT.java:233)                                                                                                                                                                                                                                                             
        at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:218)                                                                                                                                                                                                                                                                               
        at org.opensearch.action.support.TransportAction.execute(TransportAction.java:188)                                                                                                    
        at org.opensearch.action.support.TransportAction.execute(TransportAction.java:107)                                                                                                                                                                                                                                                                                                  
        at org.opensearch.client.node.NodeClient.executeLocally(NodeClient.java:110)                                                                                                                                                                                                                                                                                                        
        at org.opensearch.client.node.NodeClient.doExecute(NodeClient.java:97)                                                                                                                
        at org.opensearch.client.support.AbstractClient.execute(AbstractClient.java:461)                                                                                                                                                                                                                                                                                                    
        at org.opensearch.client.FilterClient.doExecute(FilterClient.java:83)                                                                                                                 
        at org.opensearch.client.support.AbstractClient.execute(AbstractClient.java:461)                                                                                                                                                                                                                                                                                                    
        at org.opensearch.search.backpressure.SearchBackpressureIT.testSearchShardTaskCancellationWithHighElapsedTime(SearchBackpressureIT.java:90)                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                                                                                                                                            
    java.lang.AssertionError: test leaves persistent cluster metadata behind                                                                                                                                                                                                                                                                                                                
    Expected: an empty collection                                                                                                                                                                                                                                                                                                                                                           
         but: <[search_backpressure.search_shard_task.elapsed_time_millis_threshold]>                                                                                                                                                                                                                                                                                                       
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)                

Please make sure the tests suite passes all the time.

Comment on lines 180 to 194
private TestRequest generateRequestWithHigCpuUsage(double load, long duration) {
return new TestRequest(() -> {
long startTime = System.currentTimeMillis();
try {
while (System.currentTimeMillis() - startTime < duration) {
// sleeping for the percentage of unladen time every 100ms
if (System.currentTimeMillis() % 100 == 0) {
Thread.sleep((long) Math.floor((1 - load) * 100));
}
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same comment applies here as well. We don't really have to simulate an x% CPU load for y seconds.

We just need to execute simple instructions that consume some CPU cycles, and be small enough that it can be interleaved with the cancellation checks. The busy-wait loop already takes care of re-running this 'work' until enough CPU cycles have been consumed for this task cancellation to take place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: PritLadani <pritkladani@gmail.com>
…/OpenSearch into searchbackpressure/integtests
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@PritLadani PritLadani requested a review from reta November 21, 2022 16:24
Copy link
Collaborator

@reta reta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks much better now, thanks @PritLadani

Copy link
Member

@andrross andrross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test takes about a minute to complete. Is there anyway to make it faster? (our tests are quite slow as it is...)

Comment on lines 103 to 104
assertEquals(TaskCancelledException.class, e.getClass());
assertTrue(e.getMessage().contains("elapsed time exceeded"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting these assertions in the async callback makes the output on a failure confusing, and also relies on the testing framework to fail on an uncaught exception. This also never counts down the latch in case of failure so await will have to wait until the timeout. It's better to make this callback just capture the exception, e.g.

private static class ExceptionCatchingListener implements ActionListener<TestResponse> {
    private final CountDownLatch latch = new CountDownLatch(1);
    private Exception e = null;

    @Override
    public void onResponse(TestResponse r) {
        latch.countDown();
    }

    @Override
    public void onFailure(Exception e) {
        this.e = e;
        latch.countDown();
    }
}

and then make assertions about e after you've blocked on the latch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting these assertions in the async callback makes the output on a failure confusing, and also relies on the testing framework to fail on an uncaught exception.

Added ExceptionCatchingListener as suggested to handle the exceptions and then asserting on the caught exceptions.

This also never counts down the latch in case of failure so await will have to wait until the timeout.

In case of testcase failure, we are failing the testcase by calling fail() and it will finish the test execution, so CountDownLatch doesn't have to wait until the timeout. In case when testcase passes, we are already decrementing the latch count after assertions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrross Good point regarding test execution time!

@PritLadani May be set ElapsedTimeTracker.SETTING_ELAPSED_TIME_MILLIS_THRESHOLD CpuUsageTracker.SETTING_CPU_TIME_MILLIS_THRESHOLD in the tests to 1 second instead of 15 seconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ketanv3 Sure Ketan, will update the same in the next commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of testcase failure, we are failing the testcase by calling fail() and it will finish the test execution

fail() doesn't do anything other than throw a AssertionError, so the test doesn't actually fail until the test method completes (after the latch times out) and something in the tear down method of the base class notices the unhandled exception and fails the test case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the newly introduced ExceptionCatchingListener, we will not encounter such case where latch has to wait until timeout happens.

}
});

latch.await(TIMEOUT.getSeconds(), TimeUnit.SECONDS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should assert that all these await calls return true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are having the same timeout value for the request execution as well, and if the TestRequest times out, it will make a call to onResponse() and we already have necessary assertions there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize it is fine now but this is more about maintainability. Future changes may violate some of the current assumptions so it is better to explicitly verify that things are working as expected.

Copy link
Contributor Author

@PritLadani PritLadani Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Asserting on the return value of latch.wait().

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      2 org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testReplicaThreadedUpdateToShardLimitsAndRejections

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.index.IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex

Signed-off-by: PritLadani <pritkladani@gmail.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.search.aggregations.metrics.TDigestPercentilesIT.testMultiValuedFieldWithValueScriptReverse

@PritLadani PritLadani requested review from andrross and ketanv3 and removed request for andrross November 22, 2022 12:36
// Before SearchBackpressureService cancels a task based on its heap usage, we need to build up the heap moving average
// To build up the heap moving average, we need to hit the same node with multiple requests and then hit the same node with a
// request having higher heap usage
String node = internalCluster().startDataOnlyNode();
Copy link
Contributor

@ketanv3 ketanv3 Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will start a new data node for the test. Maybe your intent is to pick up a random node from the existing cluster.

String node = randomFrom(internalCluster().getNodeNames());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Change it to String node = randomFrom(internalCluster().getNodeNames());.

final int MOVING_AVERAGE_WINDOW_SIZE = 10;
Settings request = Settings.builder()
.put(SearchBackpressureSettings.SETTING_MODE.getKey(), "enforced")
.put(HeapUsageTracker.SETTING_HEAP_PERCENT_THRESHOLD.getKey(), 0.00001)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be zero?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to zero so that we can be sure that heap threshold is breaching.

Signed-off-by: PritLadani <pritkladani@gmail.com>
@PritLadani PritLadani requested review from ketanv3 and andrross and removed request for ketanv3 November 22, 2022 16:42
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testIndexDeletionDuringSnapshotCreationInQueue

Comment on lines 93 to 94
CountDownLatch latch = new CountDownLatch(1);
ExceptionCatchingListener listener = new ExceptionCatchingListener(latch);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick, but you can just create the latch inside ExceptionCatchingListener then access here with listener.latch since the outer class can access private fields of nested classes so you don't even need a getter. Just makes the code a bit more concise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Changed it.

}
});

latch.await(TIMEOUT.getSeconds(), TimeUnit.SECONDS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize it is fine now but this is more about maintainability. Future changes may violate some of the current assumptions so it is better to explicitly verify that things are working as expected.

Comment on lines 118 to 119
assertEquals(TaskCancelledException.class, caughtException.getClass());
assertTrue(caughtException.getMessage().contains("cpu usage exceeded"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another nitpick but you'll get much better error messages with following style:

MatcherAssert.assertThat(caughtException, instanceOf(TaskCancalledException.class));
MatcherAssert.assertThat(caughtException.getMessage(), containsString("cpu usage exceeded"));

because the error message will show what was found versus what was expected instead of just indicating a failure. Not a big deal though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Changed it.

CountDownLatch latch = new CountDownLatch(1);
ExceptionCatchingListener listener = new ExceptionCatchingListener(latch);
client().execute(TestTransportAction.ACTION, new TestRequest(RequestType.HIGH_ELAPSED_TIME), listener);
latch.await(TIMEOUT.getSeconds(), TimeUnit.SECONDS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I run this test this call always returns false and the test takes the entire 30 seconds. Is that expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, yes. We have set the TIMEOUT to be 30 seconds and it lets the request run for 30 seconds before timing out. We want to be sure that even if the search backpressure thresholds are breads, it is not cancelling the task.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anyway to make that faster? Is it possible to validate that the backpressure threshold is indeed breached?

In this case you should assert that the latch does time out, otherwise the request could successfully complete and the test would still pass.

Signed-off-by: PritLadani <pritkladani@gmail.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Copy link
Contributor

@ketanv3 ketanv3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@andrross andrross merged commit bdb8efe into opensearch-project:main Nov 22, 2022
@andrross andrross added the backport 2.x Backport to 2.x branch label Nov 22, 2022
opensearch-trigger-bot bot pushed a commit that referenced this pull request Nov 22, 2022
Signed-off-by: PritLadani <pritkladani@gmail.com>
(cherry picked from commit bdb8efe)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
reta pushed a commit that referenced this pull request Nov 22, 2022
Signed-off-by: PritLadani <pritkladani@gmail.com>
(cherry picked from commit bdb8efe)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Signed-off-by: PritLadani <pritkladani@gmail.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants