GoogleCloudStorageBlobContainerRetriesTests.testWriteLargeBlob fails #50754

dnhatn · 2020-01-08T17:13:24Z

GoogleCloudStorageBlobContainerRetriesTests.testWriteLargeBlob has failed several times last week (see build stats). Some instances:

elasticmachine · 2020-01-08T17:13:27Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

It's impossible to tell why elastic#50754 fails without this change. We're failing to close the `exchange` somewhere and there is no write timeout in the GCS SDK (something to look into separately) only a read timeout on the socket so if we're failing on an assertion without reading the full request body (at least into the read-buffer) we're locking up waiting forever on `write0`. This change ensure the `exchange` is closed in the tests where we could lock up on a write and logs the failure so we can find out what broke elastic#50754.

It's impossible to tell why #50754 fails without this change. We're failing to close the `exchange` somewhere and there is no write timeout in the GCS SDK (something to look into separately) only a read timeout on the socket so if we're failing on an assertion without reading the full request body (at least into the read-buffer) we're locking up waiting forever on `write0`. This change ensure the `exchange` is closed in the tests where we could lock up on a write and logs the failure so we can find out what broke #50754.

It's impossible to tell why elastic#50754 fails without this change. We're failing to close the `exchange` somewhere and there is no write timeout in the GCS SDK (something to look into separately) only a read timeout on the socket so if we're failing on an assertion without reading the full request body (at least into the read-buffer) we're locking up waiting forever on `write0`. This change ensure the `exchange` is closed in the tests where we could lock up on a write and logs the failure so we can find out what broke elastic#50754.

It's impossible to tell why #50754 fails without this change. We're failing to close the `exchange` somewhere and there is no write timeout in the GCS SDK (something to look into separately) only a read timeout on the socket so if we're failing on an assertion without reading the full request body (at least into the read-buffer) we're locking up waiting forever on `write0`. This change ensure the `exchange` is closed in the tests where we could lock up on a write and logs the failure so we can find out what broke #50754.

It's impossible to tell why elastic#50754 fails without this change. We're failing to close the `exchange` somewhere and there is no write timeout in the GCS SDK (something to look into separately) only a read timeout on the socket so if we're failing on an assertion without reading the full request body (at least into the read-buffer) we're locking up waiting forever on `write0`. This change ensure the `exchange` is closed in the tests where we could lock up on a write and logs the failure so we can find out what broke elastic#50754.

This test was still very GC heavy in Java 8 runs in particular which seems to slow down request processing to the point of timeouts in some runs. This PR completely removes the large number of O(MB) `byte[]` allocations that were happening in the mock http handler which cuts the allocation rate by about a factor of 5 in my local testing for the GC heavy `testSnapshotWithLargeSegmentFiles` run. Closes #51446 Closes #50754

This test was still very GC heavy in Java 8 runs in particular which seems to slow down request processing to the point of timeouts in some runs. This PR completely removes the large number of O(MB) `byte[]` allocations that were happening in the mock http handler which cuts the allocation rate by about a factor of 5 in my local testing for the GC heavy `testSnapshotWithLargeSegmentFiles` run. Closes elastic#51446 Closes elastic#50754

This test was still very GC heavy in Java 8 runs in particular which seems to slow down request processing to the point of timeouts in some runs. This PR completely removes the large number of O(MB) `byte[]` allocations that were happening in the mock http handler which cuts the allocation rate by about a factor of 5 in my local testing for the GC heavy `testSnapshotWithLargeSegmentFiles` run. Closes #51446 Closes #50754

dnhatn added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Jan 8, 2020

original-brownbear self-assigned this Jan 8, 2020

original-brownbear mentioned this issue Jan 8, 2020

Enforce Logging of Errors in GCS Rest RetriesTests #50761

Merged

original-brownbear mentioned this issue Jan 9, 2020

Enforce Logging of Errors in GCS Rest RetriesTests (#50761) #50783

Merged

original-brownbear mentioned this issue Jan 29, 2020

Optimize GCS Mock #51593

Merged

original-brownbear closed this as completed in #51593 Jan 29, 2020

original-brownbear mentioned this issue Jan 29, 2020

Optimize GCS Mock (#51593) #51594

Merged

original-brownbear mentioned this issue Jan 29, 2020

Optimize GCS Mock (#51593) #51595

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GoogleCloudStorageBlobContainerRetriesTests.testWriteLargeBlob fails #50754

GoogleCloudStorageBlobContainerRetriesTests.testWriteLargeBlob fails #50754

dnhatn commented Jan 8, 2020

elasticmachine commented Jan 8, 2020

GoogleCloudStorageBlobContainerRetriesTests.testWriteLargeBlob fails #50754

GoogleCloudStorageBlobContainerRetriesTests.testWriteLargeBlob fails #50754

Comments

dnhatn commented Jan 8, 2020

elasticmachine commented Jan 8, 2020