Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change bulk's retry condition to be based on RestStatus #29329

Merged
merged 9 commits into from
May 9, 2018

Conversation

PnPie
Copy link
Contributor

@PnPie PnPie commented Apr 1, 2018

Previously bulk's retry logic was based on the exception type (EsRejectedExecutionException) of the failed response, this changes it to be based on RestStatus (RestStatus.TOO_MANY_REQUESTS), in order to support rest hight level client. (more information can be found #29254).

Close #28885

Previously bulk's retry logic was based on Exception type of the failed response, here we change it to be based on RestStatus, in order to support rest hight level's request.
@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

1 similar comment
@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good @PnPie thanks a lot for opening this! I left a few comments but this is already pretty close.

public Retry(Class<? extends Throwable> retryOnThrowable, BackoffPolicy backoffPolicy, Scheduler scheduler) {
this.retryOnThrowable = retryOnThrowable;
public Retry(RestStatus retryOnStatus, BackoffPolicy backoffPolicy, Scheduler scheduler) {
this.retryOnStatus = retryOnStatus;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we consider making this hardcoded rather than an argument given that we always pass in the same value for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as it's only used here in bulk for this case.

final Throwable cause = bulkItemResponse.getFailure().getCause();
final Throwable rootCause = ExceptionsHelper.unwrapCause(cause);
if (!rootCause.getClass().equals(retryOnThrowable)) {
final RestStatus status = bulkItemResponse.getFailure().getStatus();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to double check: we don't need to unwrap here anymore because the status of the root cause is propagated to its ancestor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we get the status field of BulkItemResponse's Failure, it is a seperate field than Exception cause in Failure class. So I find it always has the good value (RestStatus.TOO_MANY_REQUESTS) ? because the exception type was changed only through toXContent/fromXContent of BulkItemResponse, but in it the status was already parsed seperatly. So the status should always be good ? (except after toXContent/fromXContent, the BulkItemResponse was transfered again using readFrom/writeTo, which I don't think it's the case ?)

If it's this, I changed bulkItemResponse.getFailure().getStatus(); to bulkItemResponse.status();, because it's the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is correct. I also don't follow why readFrom/writeTo Would cause issues, the exception type does change but the status stays the same right?

if (rootCause instanceof EsRejectedExecutionException) {
if (rejectedExecutionExpected == false) {
if (failure.getStatus() == RestStatus.TOO_MANY_REQUESTS) {
if (!rejectedExecutionExpected) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you leave the previous rejectedExecutionExpected == false please? we prefer this one for readability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you address this please?

if (rejectedExecutionExpected == false) {
if (failure.getStatus() == RestStatus.TOO_MANY_REQUESTS) {
if (!rejectedExecutionExpected) {
Throwable rootCause = ExceptionsHelper.unwrapCause(failure.getCause());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we do not even need to unwrap it here anymore? could we just use getCause instead when throwing assertion error below?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this has not been addressed. Is that on purpose? I see that you have done this on the client version of the test so it should be fine here too.


static {
System.setProperty("tests.rest.cluster", "localhost:9200");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this static block should go away. I don't see why it's needed.

assertThat(searchResultCount, lessThan(numberOfAsyncOps));
} else {
assertThat(searchResultCount, equalTo(numberOfAsyncOps));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the assertions seem more accurate here, thanks! Would you mind making the same change in the original test for the transport client? It should work there too right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes !

assertThat(searchResultCount, equalTo(numberOfAsyncOps));
}

}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think that we should also have the last check based on the search API and the returned total hits? Or maybe now that we are using multi_get that step is not necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the multi get request we prepared when we indexed the documents is doing this, the same thing ? do a search (multi get) for all the indices in the end, to compare ? Seems in rest high level tests we are using rather multi get.

}

highLevelClient().indices().refresh(new RefreshRequest());
int searchResultCount = highLevelClient().multiGet(multiGetRequest).getResponses().length;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you rename this variable, this is not about search anymore, rather the result of multi_get

Copy link
Contributor Author

@PnPie PnPie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@javanna Thank you for your detailed reveiw ! I updated it according to the comments.

assertThat(searchResultCount, lessThan(numberOfAsyncOps));
} else {
assertThat(searchResultCount, equalTo(numberOfAsyncOps));
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes !

assertThat(searchResultCount, equalTo(numberOfAsyncOps));
}

}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the multi get request we prepared when we indexed the documents is doing this, the same thing ? do a search (multi get) for all the indices in the end, to compare ? Seems in rest high level tests we are using rather multi get.

public Retry(Class<? extends Throwable> retryOnThrowable, BackoffPolicy backoffPolicy, Scheduler scheduler) {
this.retryOnThrowable = retryOnThrowable;
public Retry(RestStatus retryOnStatus, BackoffPolicy backoffPolicy, Scheduler scheduler) {
this.retryOnStatus = retryOnStatus;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as it's only used here in bulk for this case.

final Throwable cause = bulkItemResponse.getFailure().getCause();
final Throwable rootCause = ExceptionsHelper.unwrapCause(cause);
if (!rootCause.getClass().equals(retryOnThrowable)) {
final RestStatus status = bulkItemResponse.getFailure().getStatus();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we get the status field of BulkItemResponse's Failure, it is a seperate field than Exception cause in Failure class. So I find it always has the good value (RestStatus.TOO_MANY_REQUESTS) ? because the exception type was changed only through toXContent/fromXContent of BulkItemResponse, but in it the status was already parsed seperatly. So the status should always be good ? (except after toXContent/fromXContent, the BulkItemResponse was transfered again using readFrom/writeTo, which I don't think it's the case ?)

If it's this, I changed bulkItemResponse.getFailure().getStatus(); to bulkItemResponse.status();, because it's the same.

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @PnPie thanks for addressing the comments, I left a couple more minors, but I see some tests are failing:

Tests with failures:
  - org.elasticsearch.client.CrudIT.testBulkProcessorIntegration
  - org.elasticsearch.client.BulkProcessorRetryIT.testBulkRejectionLoadWithBackoff
  - org.elasticsearch.client.BulkProcessorRetryIT.testBulkRejectionLoadWithoutBackoff
  - org.elasticsearch.client.BulkProcessorIT.testBulkProcessorConcurrentRequestsReadOnlyIndex

and

Tests with failures:
- org.elasticsearch.action.bulk.RetryTests.testRetryWithListenerBacksOff
- org.elasticsearch.action.bulk.RetryTests.testRetryBacksOff

Could you please have a look at these? I am under the impressions that tests were green before recent changes, but I may be wrong, could you please look into this? Let me know if you need any help

if (rejectedExecutionExpected == false) {
if (failure.getStatus() == RestStatus.TOO_MANY_REQUESTS) {
if (!rejectedExecutionExpected) {
Throwable rootCause = ExceptionsHelper.unwrapCause(failure.getCause());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this has not been addressed. Is that on purpose? I see that you have done this on the client version of the test so it should be fine here too.

if (rootCause instanceof EsRejectedExecutionException) {
if (rejectedExecutionExpected == false) {
if (failure.getStatus() == RestStatus.TOO_MANY_REQUESTS) {
if (!rejectedExecutionExpected) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you address this please?

@javanna
Copy link
Member

javanna commented Apr 17, 2018

hi @PnPie would you have a chance to address my last comments? This is a good change, would love to get it in.

@javanna javanna added the >bug label Apr 17, 2018
@PnPie
Copy link
Contributor Author

PnPie commented Apr 17, 2018

Hello @javanna,
Srry for the delay, in fact I had a very quick look last week-end but found a lot of strange things. I just re-check it today and I find I have made a very ridiculous mistake :( I put the retry condition as status == RETRY_STATUS -> fail instead of status != RETRY_STATUS -> fail ...
I modified it and checked, all the tests you mentioned pass, except org.elasticsearch.client.BulkProcessorRetryIT I have a TimeoutException, I feel like it might be my env problem ? So can we launch a test with jenkins to have a look ?

@javanna
Copy link
Member

javanna commented Apr 17, 2018

test this please

@javanna
Copy link
Member

javanna commented Apr 19, 2018

retest this please

@javanna
Copy link
Member

javanna commented Apr 20, 2018

retest this please

@javanna
Copy link
Member

javanna commented Apr 20, 2018

@PnPie tests were failing but not sure the failures were related to your changes, I merged master in and triggered a new build.

@PnPie
Copy link
Contributor Author

PnPie commented Apr 20, 2018

Hi @javanna,
Yes, seems sometimes org.elasticsearch.action.bulk.BulkProcessorRetryIT fails after adding the rejectedAfterAllRetries changes. I can look at it this week-end.

@PnPie
Copy link
Contributor Author

PnPie commented Apr 24, 2018

Hello @javanna,
I updated the PR and seems previously BulkProcessorRetryIT fails sometimes because we didn't take account of the "no retry (no back off)" situation. In this case when there is no retry for bulk request, the indexed docs could be equal to number of docs in bulk request (when there is no failures), or less than the number of docs in bulk request (when there is failures). I changed it as this.

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @PnPie LGTM . As far as I can see the timeout exceptions are caused by another bug, see #24069 . It's great that we can finally reproduce it consistently. I will merge your PR once I fixed that other bug.

@javanna
Copy link
Member

javanna commented May 4, 2018

For the record, this requires #30384 otherwise the added tests fail.

@javanna javanna added the v6.3.1 label May 9, 2018
@javanna
Copy link
Member

javanna commented May 9, 2018

retest this please

@javanna javanna merged commit 2228e6e into elastic:master May 9, 2018
@javanna
Copy link
Member

javanna commented May 9, 2018

thanks a lot @PnPie !

jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request May 9, 2018
…or-you

* elastic/master: (22 commits)
  Docs: Test examples that recreate lang analyzers  (elastic#29535)
  BulkProcessor to retry based on status code (elastic#29329)
  Add GET Repository High Level REST API (elastic#30362)
  add a comment explaining the need for RetryOnReplicaException on missing mappings
  Add `coordinating_only` node selector (elastic#30313)
  Stop forking groovyc (elastic#30471)
  Avoid setting connection request timeout (elastic#30384)
  Use date format in `date_range` mapping before fallback to default (elastic#29310)
  Watcher: Increase HttpClient parallel sent requests (elastic#30130)
  Mute ML upgrade test (elastic#30458)
  Stop forking javac (elastic#30462)
  Client: Deprecate many argument performRequest (elastic#30315)
  Docs: Use task_id in examples of tasks (elastic#30436)
  Security: Rename IndexLifecycleManager to SecurityIndexManager (elastic#30442)
  [Docs] Fix typo in cardinality-aggregation.asciidoc (elastic#30434)
  Avoid NPE in `more_like_this` when field has zero tokens (elastic#30365)
  Build: Switch to building javadoc with html5 (elastic#30440)
  Add a quick tour of the project to CONTRIBUTING (elastic#30187)
  Reindex: Use request flavored methods (elastic#30317)
  Silence SplitIndexIT.testSplitIndexPrimaryTerm test failure. (elastic#30432)
  ...
javanna pushed a commit that referenced this pull request May 9, 2018
Previously `BulkProcessor` retry logic was based on the exception type of the failed response (`EsRejectedExecutionException`). This commit changes it to be based on the returned status code. This allows us to reproduce the same retry behaviour when the `BulkProcessor` is used from the high-level REST client, which was previously not the case as we cannot rebuild the same exception type when parsing back the response. This change has no effect on the transport client.

Closes #28885
dnhatn added a commit that referenced this pull request May 10, 2018
* master:
  Upgrade to Lucene-7.4-snapshot-6705632810 (#30519)
  add version compatibility from 6.4.0 after backport, see #30319 (#30390)
  Security: Simplify security index listeners (#30466)
  Add proper longitude validation in geo_polygon_query (#30497)
  Remove Discovery.AckListener.onTimeout() (#30514)
  Build: move generated-resources to build (#30366)
  Reindex: Fold "with all deps" project into reindex (#30154)
  Isolate REST client single host tests (#30504)
  Solve Gradle deprecation warnings around shadowJar (#30483)
  SAML: Process only signed data (#30420)
  Remove BWC repository test (#30500)
  Build: Remove xpack specific run task (#30487)
  AwaitsFix IntegTestZipClientYamlTestSuiteIT#indices.split tests
  LLClient: Add setJsonEntity (#30447)
  Expose CommonStatsFlags directly in IndicesStatsRequest. (#30163)
  Silence IndexUpgradeIT test failures. (#30430)
  Bump Gradle heap to 1792m (#30484)
  [docs] add warning for read-write indices in force merge documentation (#28869)
  Avoid deadlocks in cache (#30461)
  Test: remove hardcoded list of unconfigured ciphers (#30367)
  mute SplitIndexIT due to #30416
  Docs: Test examples that recreate lang analyzers  (#29535)
  BulkProcessor to retry based on status code (#29329)
  Add GET Repository High Level REST API (#30362)
  add a comment explaining the need for RetryOnReplicaException on missing mappings
  Add `coordinating_only` node selector (#30313)
  Stop forking groovyc (#30471)
  Avoid setting connection request timeout (#30384)
  Use date format in `date_range` mapping before fallback to default (#29310)
  Watcher: Increase HttpClient parallel sent requests (#30130)

# Conflicts:
#	x-pack/plugin/core/src/test/java/org/elasticsearch/xpack/core/LocalStateCompositeXPackPlugin.java
dnhatn added a commit that referenced this pull request May 10, 2018
* 6.x:
  Upgrade to Lucene-7.4-snapshot-6705632810 (#30519)
  Remove Discovery.AckListener.onTimeout() (#30514)
  Build: move generated-resources to build (#30366)
  Reindex: Fold "with all deps" project into reindex (#30154)
  Isolate REST client single host tests (#30504)
  Remove BWC repository test (#30500)
  Build: Remove xpack specific run task (#30487)
  AwaitsFix IntegTestZipClientYamlTestSuiteIT#indices.split tests
  LLClient: Add setJsonEntity (#30447)
  [docs] add warning for read-write indices in force merge documentation (#28869)
  Avoid deadlocks in cache (#30461)
  BulkProcessor to retry based on status code (#29329)
  Avoid setting connection request timeout (#30384)
  Test: remove hardcoded list of unconfigured ciphers (#30367)
  Add GET Repository High Level REST API (#30362)
  mute SplitIndexIT due to #30416
  Docs: Test examples that recreate lang analyzers  (#29535)
  add a comment explaining the need for RetryOnReplicaException on missing mappings
  Pass the task to broadcast actions (#29672)
  Stop forking groovyc (#30471)
  Add `coordinating_only` node selector (#30313)
  Fix accidental error in changelog
  Use date format in `date_range` mapping before fallback to default (#29310)
  Watcher: Increase HttpClient parallel sent requests (#30130)
  [Security][Tests] Azeri(Turkish) locale tripps opensaml dependency
@jpountz
Copy link
Contributor

jpountz commented Jun 13, 2018

@javanna This may be backported now.

javanna pushed a commit that referenced this pull request Jun 14, 2018
Previously `BulkProcessor` retry logic was based on the exception type of the failed response (`EsRejectedExecutionException`). This commit changes it to be based on the returned status code. This allows us to reproduce the same retry behaviour when the `BulkProcessor` is used from the high-level REST client, which was previously not the case as we cannot rebuild the same exception type when parsing back the response. This change has no effect on the transport client.

Closes #28885
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants