Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] GoogleCloudStorageBlobStoreRepositoryTests.testWriteReadLarge (and others) causes assertion error in HTTP server #52906

Closed
dakrone opened this issue Feb 27, 2020 · 8 comments
Assignees
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI

Comments

@dakrone
Copy link
Member

dakrone commented Feb 27, 2020

On https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+matrix-java-periodic/ES_RUNTIME_JAVA=corretto8,nodes=general-purpose/538/console / https://gradle-enterprise.elastic.co/s/3f3lkoii653zq

The test fails due to

09:42:14     java.lang.AssertionError: Only index blobs should remain in repository but found [foobar]
09:42:14     Expected: a collection with size <0>
09:42:14          but: collection size was <1>
09:42:14         at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
09:42:14         at org.junit.Assert.assertThat(Assert.java:956)
09:42:14         at org.elasticsearch.repositories.blobstore.ESMockAPIBasedRepositoryIntegTestCase.tearDownHttpServer(ESMockAPIBasedRepositoryIntegTestCase.java:112)

However, this is largely due in part to the HTTP server dying and causing a bunch of

09:42:14   2> com.google.cloud.storage.StorageException: Read timed out
09:42:14         at com.google.cloud.storage.spi.v1.HttpStorageRpc.translate(HttpStorageRpc.java:227)
09:42:14         at com.google.cloud.storage.spi.v1.HttpStorageRpc.write(HttpStorageRpc.java:762)
09:42:14         at com.google.cloud.storage.BlobWriteChannel$1.run(BlobWriteChannel.java:60)
...
09:42:14         Caused by:
09:42:14         java.net.SocketTimeoutException: Read timed out
09:42:14             at java.net.SocketInputStream.socketRead0(Native Method)
09:42:14             at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)

The server seems to die due to a very unhelpful assertion failure in the actual ServerImpl:

09:42:14   2> ADVERTENCIA: Uncaught exception in thread: Thread[Thread-6,5,TGRP-GoogleCloudStorageBlobStoreRepositoryTests]
09:42:14   2> java.lang.AssertionError
09:42:14   2> 	at __randomizedtesting.SeedInfo.seed([E55E3CFA3325D9BE]:0)
09:42:14   2> 	at sun.net.httpserver.ServerImpl.responseCompleted(ServerImpl.java:795)
09:42:14   2> 	at sun.net.httpserver.ServerImpl$Dispatcher.handleEvent(ServerImpl.java:284)
09:42:14   2> 	at sun.net.httpserver.ServerImpl$Dispatcher.run(ServerImpl.java:343)
09:42:14   2> 	at java.lang.Thread.run(Thread.java:748)

I was unable to reproduce this on the 7.x branch.

@dakrone dakrone added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Feb 27, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@original-brownbear original-brownbear self-assigned this Feb 27, 2020
@original-brownbear
Copy link
Member

This is very likely yet another instance of a JDK bug.
See #51933 for explanation and fix. I'll try to track it down today before my vacation.

@dakrone
Copy link
Member Author

dakrone commented Feb 27, 2020

This one looks like it might be related: https://gradle-enterprise.elastic.co/s/vrjiq7uidkdv4 it has the same Read timed out error, but a different (410) exception. I don't see the assertion tripping, but it could also be not provided in the output.

Let me know if it looks related or whether I should open a new issue for it.

@original-brownbear
Copy link
Member

@dakrone thanks for posting that one. It's related and explains the issue perfectly :) Somehow we're failing to fully drain a stream there ...

@original-brownbear
Copy link
Member

@tlrx I don't have any more time to look into this one I'm afraid. Maybe you can have a look next week while I'm away?

This is definitely the same assertion from https://bugs.openjdk.java.net/browse/JDK-8180754 but I can't find a spot where we wouldn't drain the input stream before writing the response at this point.
My suggestion would be to maybe not waste too much time and simply skip these tests on JDK-8.
I haven't seen this fail on anything but JDK-8 and that's in line with the open JDK bug that seems to only apply to 8. If we just skip 8 we retain all the coverage and this ordeal is finally over with :D

@dakrone
Copy link
Member Author

dakrone commented Feb 27, 2020

dakrone added a commit to dakrone/elasticsearch that referenced this issue Feb 27, 2020
These intermittently fail due to an assertion triggered by a JDK bug.

Relates to elastic#52906
dakrone added a commit that referenced this issue Feb 27, 2020
These intermittently fail due to an assertion triggered by a JDK bug.

Relates to #52906
dakrone added a commit that referenced this issue Feb 27, 2020
These intermittently fail due to an assertion triggered by a JDK bug.

Relates to #52906
dakrone added a commit that referenced this issue Feb 27, 2020
These intermittently fail due to an assertion triggered by a JDK bug.

Relates to #52906
@tlrx
Copy link
Member

tlrx commented Mar 4, 2020

I've spent some time today trying to reproduce locally with the same OS/JDK that almost all the test failures on CI and I did not reproduce the failure. I also looked at the code and saw nothing worrying, all the streams seem to be correctly closed. I've also not found any message in tests execution logs that could indicate that we were not fully draining the request's input stream. I verified the stats and I agree it only failed on JDK-8.

Thus, I'm following your suggestion Armin and opened #53119 to mute the tests on JDK8.

tlrx added a commit that referenced this issue Mar 5, 2020
Tests in GoogleCloudStorageBlobStoreRepositoryTests are known 
to be flaky on JDK 8 (#51446, #52430 ) and we suspect a JDK 
bug (https://bugs.openjdk.java.net/browse/JDK-8180754) that triggers
 some assertion on the server side logic that emulates the Google 
Cloud Storage service.

Sadly we were not able to reproduce the failures, even when using 
the same OS (Debian 9, Ubuntu 16.04) and JDK (Oracle Corporation 
1.8.0_241 [Java HotSpot(TM) 64-Bit Server VM 25.241-b07]) of 
almost all the test failures on CI. While we spent some time fixing 
code (#51933, #52431) to circumvent the JDK bug they are still flaky 
on JDK-8. This commit mute these tests for JDK-8 only.

Close ##52906
tlrx added a commit that referenced this issue Mar 5, 2020
Tests in GoogleCloudStorageBlobStoreRepositoryTests are known 
to be flaky on JDK 8 (#51446, #52430 ) and we suspect a JDK 
bug (https://bugs.openjdk.java.net/browse/JDK-8180754) that triggers
 some assertion on the server side logic that emulates the Google 
Cloud Storage service.

Sadly we were not able to reproduce the failures, even when using 
the same OS (Debian 9, Ubuntu 16.04) and JDK (Oracle Corporation 
1.8.0_241 [Java HotSpot(TM) 64-Bit Server VM 25.241-b07]) of 
almost all the test failures on CI. While we spent some time fixing 
code (#51933, #52431) to circumvent the JDK bug they are still flaky 
on JDK-8. This commit mute these tests for JDK-8 only.

Close ##52906
@tlrx
Copy link
Member

tlrx commented Mar 5, 2020

Tests muted for JDK8 in #53119

@tlrx tlrx closed this as completed Mar 5, 2020
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Mar 24, 2020
Failure elastic#52906 does not happen in master and is limited to the `7.x` branch
so it wasn't unmuted when the `7.x` fix for this landed => unmuting it here.
original-brownbear added a commit that referenced this issue Mar 24, 2020
Failure #52906 does not happen in master and is limited to the `7.x` branch
so it wasn't unmuted when the `7.x` fix for this landed => unmuting it here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

4 participants