Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStoreRepositoryTests testSnapshotWithLargeSegmentFiles #51446

Closed
jakelandis opened this issue Jan 24, 2020 · 4 comments · Fixed by #51593, #51933 or #52804
Assignees
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI

Comments

@jakelandis
Copy link
Contributor

jakelandis commented Jan 24, 2020

Reproduce with (does not repro locally):

./gradlew ':plugins:repository-gcs:test' --tests "org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStoreRepositoryTests.testSnapshotWithLargeSegmentFiles" -Dtests.seed=79146B9E4DC8DB20 -Dtests.security.manager=true -Dtests.locale=ro -Dtests.timezone=America/La_Paz -Dcompiler.java=13 -Druntime.java=8	

Error

 java.lang.AssertionError: Only index blobs should remain in repository but found [indices/U3Ra9xA9QYOIf2JnrS9qiw/0/__U-yvyHGGSC-HG3-rYgESAQ, indices/U3Ra9xA9QYOIf2JnrS9qiw/0/__2W1fsDR7RHqVaHUbR2ljDg, indices/U3Ra9xA9QYOIf2JnrS9qiw/0/__PYCKH3IuR8OvsesbZ53Q4A, indices/U3Ra9xA9QYOIf2JnrS9qiw/0/__ChKaNwG-QDeOLEFMBmV4hg, indices/U3Ra9xA9QYOIf2JnrS9qiw/0/__4BrkUvaDSoaRxda8zaYzmw, indices/U3Ra9xA9QYOIf2JnrS9qiw/0/__8TyJLropTGCbwEdpGgkCgg, snap-6BvhIzAVS8Wk0yEzY1yHHw.dat, indices/U3Ra9xA9QYOIf2JnrS9qiw/0/__TYjSMqKWSAyimBORtvz2hQ, indices/U3Ra9xA9QYOIf2JnrS9qiw/0/__nr1hKmAYS8mwQoYGfONLxA, indices/U3Ra9xA9QYOIf2JnrS9qiw/0/__AIvhEKrQTT6rQ95Q_a6Mcw, indices/U3Ra9xA9QYOIf2JnrS9qiw/0/snap-6BvhIzAVS8Wk0yEzY1yHHw.dat, indices/U3Ra9xA9QYOIf2JnrS9qiw/0/__OYx13Q6MT329W_IM0oSGsg, indices/U3Ra9xA9QYOIf2JnrS9qiw/0/__lc7ep5qgSfGUi5XOU2GgVA, indices/U3Ra9xA9QYOIf2JnrS9qiw/0/__FnKJqi01RL6xd-BqebZ9qw, indices/U3Ra9xA9QYOIf2JnrS9qiw/0/__iv8fkpEiRkGaiSfcDb6cTg, indices/U3Ra9xA9QYOIf2JnrS9qiw/meta-6BvhIzAVS8Wk0yEzY1yHHw.dat, indices/U3Ra9xA9QYOIf2JnrS9qiw/0/__45KqhxnDQJyJbNxg5bOdBw, indices/U3Ra9xA9QYOIf2JnrS9qiw/0/__zLdkon40Q5aMTM5YjMsXWQ, indices/U3Ra9xA9QYOIf2JnrS9qiw/0/__UAbVvyV5QHq0RsZGVRdphg, indices/U3Ra9xA9QYOIf2JnrS9qiw/0/__jXvlBf1lTn6Ct1d2xyEeHg, indices/U3Ra9xA9QYOIf2JnrS9qiw/0/__ED_68lu_Qci0brGcdBWWqA, indices/U3Ra9xA9QYOIf2JnrS9qiw/0/__0j2RqpNKRKqy6RNbv4iniA, indices/U3Ra9xA9QYOIf2JnrS9qiw/0/__SqeeF3m4TkqcEJrcZfx0XA, indices/U3Ra9xA9QYOIf2JnrS9qiw/0/__6j7bvik8RDqSNmZIHaHdlw, indices/U3Ra9xA9QYOIf2JnrS9qiw/0/__gj5mlfNKQmC_sqRVWkQh4A, meta-6BvhIzAVS8Wk0yEzY1yHHw.dat, indices/U3Ra9xA9QYOIf2JnrS9qiw/0/__YCpLvzOYR6m3ZMbSCjvCuw]	
 
    Expected: a collection with size <0>	
         but: collection size was <27>	
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)	
        at org.junit.Assert.assertThat(Assert.java:956)	
        at org.elasticsearch.repositories.blobstore.ESMockAPIBasedRepositoryIntegTestCase.tearDownHttpServer(ESMockAPIBasedRepositoryIntegTestCase.java:112)

Build scan : https://gradle-enterprise.elastic.co/s/5agjwd5uxd4g6

90 day history
image

Suspected related issues (via comments from prior failures)

Note - it seems this happens (exclusively?) on 7.5/7.6/7.x and sometimes, but not always have a SocketTimeout too. For example (different build scan then above) https://gradle-enterprise.elastic.co/s/4z2vxrxrohjmq/console-log?anchor=7206

@jakelandis jakelandis added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Jan 24, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@original-brownbear original-brownbear self-assigned this Jan 24, 2020
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jan 29, 2020
This test was still very GC heavy in Java 8 runs in particular
which seems to slow down request processing to the point of timeouts
in some runs.
This PR completely removes the large number of O(MB) `byte[]` allocations
that were happening in the mock http handler which cuts the allocation rate
by about a factor of 5 in my local testing for the GC heavy `testSnapshotWithLargeSegmentFiles`
run.

Closes elastic#51446
original-brownbear added a commit that referenced this issue Jan 29, 2020
This test was still very GC heavy in Java 8 runs in particular
which seems to slow down request processing to the point of timeouts
in some runs.
This PR completely removes the large number of O(MB) `byte[]` allocations
that were happening in the mock http handler which cuts the allocation rate
by about a factor of 5 in my local testing for the GC heavy `testSnapshotWithLargeSegmentFiles`
run.

Closes #51446
Closes #50754
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jan 29, 2020
This test was still very GC heavy in Java 8 runs in particular
which seems to slow down request processing to the point of timeouts
in some runs.
This PR completely removes the large number of O(MB) `byte[]` allocations
that were happening in the mock http handler which cuts the allocation rate
by about a factor of 5 in my local testing for the GC heavy `testSnapshotWithLargeSegmentFiles`
run.

Closes elastic#51446
Closes elastic#50754
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jan 29, 2020
This test was still very GC heavy in Java 8 runs in particular
which seems to slow down request processing to the point of timeouts
in some runs.
This PR completely removes the large number of O(MB) `byte[]` allocations
that were happening in the mock http handler which cuts the allocation rate
by about a factor of 5 in my local testing for the GC heavy `testSnapshotWithLargeSegmentFiles`
run.

Closes elastic#51446
Closes elastic#50754
original-brownbear added a commit that referenced this issue Jan 29, 2020
This test was still very GC heavy in Java 8 runs in particular
which seems to slow down request processing to the point of timeouts
in some runs.
This PR completely removes the large number of O(MB) `byte[]` allocations
that were happening in the mock http handler which cuts the allocation rate
by about a factor of 5 in my local testing for the GC heavy `testSnapshotWithLargeSegmentFiles`
run.

Closes #51446
Closes #50754
original-brownbear added a commit that referenced this issue Jan 29, 2020
This test was still very GC heavy in Java 8 runs in particular
which seems to slow down request processing to the point of timeouts
in some runs.
This PR completely removes the large number of O(MB) `byte[]` allocations
that were happening in the mock http handler which cuts the allocation rate
by about a factor of 5 in my local testing for the GC heavy `testSnapshotWithLargeSegmentFiles`
run.

Closes #51446
Closes #50754
@ywelsch
Copy link
Contributor

ywelsch commented Feb 5, 2020

New failure on 7.6: https://gradle-enterprise.elastic.co/s/prwylyxj4k5ja

@ywelsch ywelsch reopened this Feb 5, 2020
@original-brownbear
Copy link
Member

This looks like it's in fact this JDK bug https://bugs.openjdk.java.net/browse/JDK-8180754, that's why we're only seeing the failure on JDK-8. Looking into a workaround ...

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Feb 5, 2020
There is an open JDK bug that is causing an assertion in the JDK's
http server to trip if we don't drain the request body before sending response headers.
See https://bugs.openjdk.java.net/browse/JDK-8180754
Working around this issue here by always draining the request at the beginning of the handler.

Fixes elastic#51446
original-brownbear added a commit that referenced this issue Feb 5, 2020
There is an open JDK bug that is causing an assertion in the JDK's
http server to trip if we don't drain the request body before sending response headers.
See https://bugs.openjdk.java.net/browse/JDK-8180754
Working around this issue here by always draining the request at the beginning of the handler.

Fixes #51446
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Feb 5, 2020
There is an open JDK bug that is causing an assertion in the JDK's
http server to trip if we don't drain the request body before sending response headers.
See https://bugs.openjdk.java.net/browse/JDK-8180754
Working around this issue here by always draining the request at the beginning of the handler.

Fixes elastic#51446
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Feb 5, 2020
There is an open JDK bug that is causing an assertion in the JDK's
http server to trip if we don't drain the request body before sending response headers.
See https://bugs.openjdk.java.net/browse/JDK-8180754
Working around this issue here by always draining the request at the beginning of the handler.

Fixes elastic#51446
original-brownbear added a commit that referenced this issue Feb 5, 2020
There is an open JDK bug that is causing an assertion in the JDK's
http server to trip if we don't drain the request body before sending response headers.
See https://bugs.openjdk.java.net/browse/JDK-8180754
Working around this issue here by always draining the request at the beginning of the handler.

Fixes #51446
original-brownbear added a commit that referenced this issue Feb 5, 2020
There is an open JDK bug that is causing an assertion in the JDK's
http server to trip if we don't drain the request body before sending response headers.
See https://bugs.openjdk.java.net/browse/JDK-8180754
Working around this issue here by always draining the request at the beginning of the handler.

Fixes #51446
@dnhatn
Copy link
Member

dnhatn commented Feb 25, 2020

@original-brownbear This happened again on 7.x: https://gradle-enterprise.elastic.co/s/w45psdcg3vtji and https://gradle-enterprise.elastic.co/s/jauynygl5zsxw. Would you taking another look?

@dnhatn dnhatn reopened this Feb 25, 2020
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Feb 26, 2020
We were not correctly respecting the download range which lead
to the GCS SDK client closing the connection at times.
Also, we are randomly chosing to fail requests in these tests
but set the retry count to `0` by force which could lead to
issues when a chunk of a ranged download is failed and the
retrying input stream doesn't retry the chunk (GCS SDK strangely does not retry
this case on a `5xx`).

Closes elastic#51446
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Feb 26, 2020
We were not correctly respecting the download range which lead
to the GCS SDK client closing the connection at times.

Closes elastic#51446
original-brownbear added a commit that referenced this issue Feb 26, 2020
We were not correctly respecting the download range which lead
to the GCS SDK client closing the connection at times.
Also, fixes another instance of failing to drain the request fully before sending the response headers.

Closes #51446
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Feb 26, 2020
We were not correctly respecting the download range which lead
to the GCS SDK client closing the connection at times.
Also, fixes another instance of failing to drain the request fully before sending the response headers.

Closes elastic#51446
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Feb 26, 2020
We were not correctly respecting the download range which lead
to the GCS SDK client closing the connection at times.
Also, fixes another instance of failing to drain the request fully before sending the response headers.

Closes elastic#51446
original-brownbear added a commit that referenced this issue Feb 26, 2020
We were not correctly respecting the download range which lead
to the GCS SDK client closing the connection at times.
Also, fixes another instance of failing to drain the request fully before sending the response headers.

Closes #51446
original-brownbear added a commit that referenced this issue Feb 26, 2020
We were not correctly respecting the download range which lead
to the GCS SDK client closing the connection at times.
Also, fixes another instance of failing to drain the request fully before sending the response headers.

Closes #51446
tlrx added a commit that referenced this issue Mar 5, 2020
Tests in GoogleCloudStorageBlobStoreRepositoryTests are known 
to be flaky on JDK 8 (#51446, #52430 ) and we suspect a JDK 
bug (https://bugs.openjdk.java.net/browse/JDK-8180754) that triggers
 some assertion on the server side logic that emulates the Google 
Cloud Storage service.

Sadly we were not able to reproduce the failures, even when using 
the same OS (Debian 9, Ubuntu 16.04) and JDK (Oracle Corporation 
1.8.0_241 [Java HotSpot(TM) 64-Bit Server VM 25.241-b07]) of 
almost all the test failures on CI. While we spent some time fixing 
code (#51933, #52431) to circumvent the JDK bug they are still flaky 
on JDK-8. This commit mute these tests for JDK-8 only.

Close ##52906
tlrx added a commit that referenced this issue Mar 5, 2020
Tests in GoogleCloudStorageBlobStoreRepositoryTests are known 
to be flaky on JDK 8 (#51446, #52430 ) and we suspect a JDK 
bug (https://bugs.openjdk.java.net/browse/JDK-8180754) that triggers
 some assertion on the server side logic that emulates the Google 
Cloud Storage service.

Sadly we were not able to reproduce the failures, even when using 
the same OS (Debian 9, Ubuntu 16.04) and JDK (Oracle Corporation 
1.8.0_241 [Java HotSpot(TM) 64-Bit Server VM 25.241-b07]) of 
almost all the test failures on CI. While we spent some time fixing 
code (#51933, #52431) to circumvent the JDK bug they are still flaky 
on JDK-8. This commit mute these tests for JDK-8 only.

Close ##52906
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI
Projects
None yet
5 participants