Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT/test {yaml=repository_s3/20_repository_permanent_credentials/Snapshot and Restore with repository-s3 using permanent credentials} flaky #5219

Closed
dblock opened this issue Nov 11, 2022 · 14 comments · Fixed by #5325 or #5482
Assignees
Labels
bug Something isn't working flaky-test Random test failure that succeeds on second run

Comments

@dblock
Copy link
Member

dblock commented Nov 11, 2022

org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT/test {yaml=repository_s3/20_repository_permanent_credentials/Snapshot and Restore with repository-s3 using permanent credentials}

https://build.ci.opensearch.org/job/gradle-check/6782/
https://build.ci.opensearch.org/job/gradle-check/6779/
https://build.ci.opensearch.org/job/gradle-check/6778/
https://build.ci.opensearch.org/job/gradle-check/6766/
https://build.ci.opensearch.org/job/gradle-check/6751/
https://build.ci.opensearch.org/job/gradle-check/6750/

@dblock
Copy link
Member Author

dblock commented Nov 11, 2022

java.lang.AssertionError: Failure at [repository_s3/20_repository_permanent_credentials:201]: expected [2xx] status code but api [snapshot.create] returned [500 Internal Server Error] [{"error":{"root_cause":[{"type":"repository_exception","reason":"[repository_permanent] concurrent modification of the index-N file, expected current generation [30972] but it was not found in the repository","stack_trace":"RepositoryException[[repository_permanent] concurrent modification of the index-N file, expected current generation [30972] but it was not found in the repository]\n\tat 
java.lang.AssertionError: Failure at [repository_s3/20_repository_permanent_credentials:152]: expected [2xx] status code but api [snapshot.create] returned [400 Bad Request] [{"error":{"root_cause":[{"type":"invalid_snapshot_name_exception","reason":"[repository_permanent:snapshot-one] Invalid snapshot name [snapshot-one], snapshot with the same name already exists","stack_trace":"InvalidSnapshotNameException[[repository_permanent:snapshot-one] Invalid snapshot name [snapshot-one], snapshot with the same name already exists]\n\tat 
java.lang.AssertionError: Failure at [repository_s3/20_repository_permanent_credentials:262]: expected [2xx] status code but api [snapshot.delete] returned [500 Internal Server Error] [{"error":{"root_cause":[{"type":"repository_exception","reason":"[repository_permanent] concurrent modification of the index-N file, expected current generation [30954] but it was not found in the repository","stack_trace":"RepositoryException[[repository_permanent] concurrent modification of the index-N file, expected current generation [30954] but it was not found in the repository]\n\tat 
"type" : "repository_exception",
        "reason" : "[repository_permanent] concurrent modification of the index-N file, expected current generation [30896] but it was not found in the repository",
        "stack_trace" : "RepositoryException[[repository_permanent] concurrent modification of the index-N file, expected current generation [30896] but it was not found in the repository]
java.lang.AssertionError: Failure at [repository_s3/20_repository_permanent_credentials:201]: expected [2xx] status code but api [snapshot.create] returned [500 Internal Server Error] [{"error":{"root_cause":[{"type":"repository_exception","reason":"[repository_permanent] concurrent modification of the index-N file, expected current generation [30818] but it was not found in the repository","stack_trace":"RepositoryException[[repository_permanent] concurrent modification of the index-N file, expected current generation [30818] but it was not found in the repository]\n\tat 

Tells me it's some race condition.

@reta
Copy link
Collaborator

reta commented Nov 14, 2022

@dblock I will try to fix this one Oops sorry, I see you are working on it, nevermind

@dblock
Copy link
Member Author

dblock commented Nov 15, 2022

@reta Take it! I suspect the root cause is a failure in setup/teardown of tests, but so far I didn't find anything meaningful.

@reta
Copy link
Collaborator

reta commented Nov 23, 2022

Fresh failure: https://build.ci.opensearch.org/job/gradle-check/7276/ but a new cause:

"[groupSize must be greater than 0 but was -1]; nested: IllegalArgumentException[groupSize must be greater than 0 but was -1];
	at org.opensearch.OpenSearchException.guessRootCauses(OpenSearchException.java:679)
	at org.opensearch.OpenSearchException.generateFailureXContent(OpenSearchException.java:607)
	at org.opensearch.rest.BytesRestResponse.build(BytesRestResponse.java:164)
	at org.opensearch.rest.BytesRestResponse.<init>(BytesRestResponse.java:125)
	at org.opensearch.rest.BytesRestResponse.<init>(BytesRestResponse.java:105)
	at org.opensearch.rest.action.RestActionListener.onFailure(RestActionListener.java:71)
	at org.opensearch.action.support.TransportAction$1.onFailure(TransportAction.java:112)
	at org.opensearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.lambda$doStart$2(TransportMasterNodeAction.java:193)
	at org.opensearch.action.ActionListener$2.onFailure(ActionListener.java:109)
	at org.opensearch.action.ActionListener$4.onFailure(ActionListener.java:188)
	at org.opensearch.action.ActionListener.onFailure(ActionListener.java:247)
	at org.opensearch.snapshots.SnapshotsService.failListenersIgnoringException(SnapshotsService.java:3006)
	at org.opensearch.snapshots.SnapshotsService.access$3300(SnapshotsService.java:141)
	at org.opensearch.snapshots.SnapshotsService$15.handleListeners(SnapshotsService.java:2764)
	at org.opensearch.snapshots.SnapshotsService$RemoveSnapshotDeletionAndContinueTask.clusterStateProcessed(SnapshotsService.java:2857)
	at org.opensearch.cluster.service.MasterService$SafeClusterStateTaskListener.clusterStateProcessed(MasterService.java:616)
	at org.opensearch.cluster.service.MasterService$TaskOutputs.lambda$processedDifferentClusterState$1(MasterService.java:488)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
	at org.opensearch.cluster.service.MasterService$TaskOutputs.processedDifferentClusterState(MasterService.java:488)
	at org.opensearch.cluster.service.MasterService.onPublicationSuccess(MasterService.java:316)
	at org.opensearch.cluster.service.MasterService.publish(MasterService.java:308)
	at org.opensearch.cluster.service.MasterService.runTasks(MasterService.java:285)
	at org.opensearch.cluster.service.MasterService.access$000(MasterService.java:86)
	at org.opensearch.cluster.service.MasterService$Batcher.run(MasterService.java:173)
	at org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:174)
	at org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:212)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:733)
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:275)
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:238)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.IllegalArgumentException: groupSize must be greater than 0 but was -1
	at org.opensearch.action.support.GroupedActionListener.<init>(GroupedActionListener.java:64)
	at org.opensearch.repositories.blobstore.BlobStoreRepository.cleanupStaleIndices(BlobStoreRepository.java:1248)
	at org.opensearch.repositories.blobstore.BlobStoreRepository.cleanupStaleBlobs(BlobStoreRepository.java:1102)
	at org.opensearch.repositories.blobstore.BlobStoreRepository.cleanupUnlinkedRootAndIndicesBlobs(BlobStoreRepository.java:889)
	at org.opensearch.repositories.blobstore.BlobStoreRepository.lambda$doDeleteShardSnapshots$7(BlobStoreRepository.java:854)

@reta
Copy link
Collaborator

reta commented Nov 24, 2022

https://build.ci.opensearch.org/job/gradle-check/7310

RepositoryException[[repository_permanent] could not read repository data from index blob]; nested: NoSuchFileException[Blob object [base_integration_tests/index-33116] not found: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: BV24YG8Z1EYBVKQV; S3 Extended Request ID: RzTmKdLPXdgVFSDIpmulCFgiVMTc82ccLUA/rV2Ibkdz8RUIyP6SvUn02mCH3ffjrumP/jmtyCo=; Proxy: null)];
	at org.opensearch.repositories.blobstore.BlobStoreRepository.getRepositoryData(BlobStoreRepository.java:1842)
	at org.opensearch.repositories.blobstore.BlobStoreRepository.safeRepositoryData(BlobStoreRepository.java:798)
	at org.opensearch.repositories.blobstore.BlobStoreRepository$2.doRun(BlobStoreRepository.java:731)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1589)
Caused by: java.nio.file.NoSuchFileException: Blob object [base_integration_tests/index-33116] not found: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: BV24YG8Z1EYBVKQV; S3 Extended Request ID: RzTmKdLPXdgVFSDIpmulCFgiVMTc82ccLUA/rV2Ibkdz8RUIyP6SvUn02mCH3ffjrumP/jmtyCo=; Proxy: null)
	at org.opensearch.repositories.s3.S3RetryingInputStream.openStream(S3RetryingInputStream.java:118)
	at org.opensearch.repositories.s3.S3RetryingInputStream.<init>(S3RetryingInputStream.java:96)
	at org.opensearch.repositories.s3.S3RetryingInputStream.<init>(S3RetryingInputStream.java:79)
	at org.opensearch.repositories.s3.S3BlobContainer.readBlob(S3BlobContainer.java:112)
	at org.opensearch.repositories.blobstore.BlobStoreRepository.getRepositoryData(BlobStoreRepository.java:1827)
	... 7 more

@dblock
Copy link
Member Author

dblock commented Nov 28, 2022

@reta Looks like this is a legit bug (thanks for opening #5371) vs. random flaky test failure. Shall we re-close this one to avoid confusion?

@reta
Copy link
Collaborator

reta commented Nov 28, 2022

@reta Looks like this is a legit bug (thanks for opening #5371) vs. random flaky test failure. Shall we re-close this one to avoid confusion?

Thanks @dblock, the second occurrence (#5219 (comment)) is not clear to me, I suspect it is still flakyness, I will watch the builds a bit and update you, thanks!

@dreamer-89
Copy link
Member

dreamer-89 commented Dec 6, 2022

Thanks @dblock @reta for analysis on this so far and @andrross for identfying recent gradle failures due to this issue. I looked into this and below is progress so far.

From stack trace it does seem a race condition which fails on specific action (listed in failure stack trace), but all repro efforts with appropriate logs to prove race condition failed. To repro, I ran ./gradlew :plugins:repository-s3:yamlRestTest but didn't see a single failure even with 100 of repitition counts. Next, I tried to run gradle check (mimic ci check and hoping for failures), but after 4 successful runs I see java heap space failures.
@reta : Any ideas how do I fix this ? I am running on latest main on ubuntu c5.24xlarge machine. Tried reboot as well. The heap dump of one of OpenSearch process was 5GBs in size and most of the tools declines to parse this file size.

for i in $(seq 0 100) ; do echo "Iteration: $i" && ./gradlew clean > /dev/null 2>&1 && ./gradlew check >> /tmp/test-output-2022-12-05_$i.txt 2>&1; done

ubuntu@ip-172-31-41-165:~/OpenSearch$ grep "^BUILD FAILED" /tmp/test-output-2022-12-05_* | wc -l
4

1: Task failed with an exception.
-----------
* What went wrong:
Execution failed for task ':testAggregateTestReport'.
> Multiple build operations failed.
      Java heap space
      Java heap space
   > Java heap space
   > Java heap space

7310 Snapshot deletion failure

java.lang.AssertionError: Failure at [repository_s3/20_repository_permanent_credentials:22]: expected [2xx] status code but api [snapshot.delete] returned [500 Internal Server Error] [{"error":{"root_cause":[{"type":"repository_exception","reason":"[repository_permanent] could not read repository data from index blob","stack_trace":"RepositoryException[[repository_permanent] could not read repository data from index blob]; nested: NoSuchFileException[Blob object [****_integration_tests/index-33116] not found: The specified key does not exist

6778 Snapshot creation

java.lang.AssertionError: expected [2xx] status code but api [snapshot.create] returned [500 Internal Server Error] [{"error":{"root_cause":[{"type":"snapshot_exception","reason":"[repository_permanent:snapshot-one/tBb-g3U1QsGROobhOKrjGw] failed to update snapshot in repository","stack_trace":"SnapshotException[[repository_permanent:snapshot-one/tBb-g3U1QsGROobhOKrjGw] failed to update snapshot in repository]; nested: IllegalStateException[Duplicate key docs (attempted merging values [docs/JdXo8HClSeu2GEuqJUgpMQ] and [docs/dB3pskUbQ3S3rsTds_INgw])]

6779 Snapshot deletion

java.lang.AssertionError: expected [2xx] status code but api [snapshot.delete] returned [500 Internal Server Error] [{"error":{"root_cause":[{"type":"repository_exception","reason":"[repository_permanent] concurrent modification of the index-N file, expected current generation [30943] but it was not found in the repository","stack_trace":"RepositoryException[[repository_permanent] concurrent modification of the index-N file, expected current generation [30943] but it was not found in the repository]

Also, I see the repository-s3 yaml tests are invoked against external s3 end-points and fixtures are only used when s3 credentials are not set. I verified that by setting the creds and running the repository-s3 tests. We updated the gradle check job in #1763 to use the external s3 end-points and is probably the cause of race condition.

@reta
Copy link
Collaborator

reta commented Dec 6, 2022

Thanks @dreamer-89

To repro, I ran ./gradlew :plugins:repository-s3:yamlRestTest but didn't see a single failure even with 100 of repitition counts.

Same, no luck reproducing it locally

I verified that by setting the creds and running the repository-s3 tests. We updated the gradle check job in #1763 to use the external s3 end-points.

Aha, that could be the difference since locally the tests are run against Minio clusters

@dreamer-89
Copy link
Member

Unrelated to flaky test and more around repro'ing test failure which was unsuccessful :(

Previous effort to decipher heap dump (auto generated by gradle check) was unsuccessful due to large heap dump files (~5GB).

ubuntu@ip-172-31-41-165:~/OpenSearch$ ls -al
total 35339500
...
-rwxrwxr-x  1 ubuntu ubuntu       8070 Dec  5 21:55 gradlew
-rw-rw-r--  1 ubuntu ubuntu       2763 Jun 14 19:40 gradlew.bat
-rw-------  1 ubuntu ubuntu 5219185718 Dec  6 09:25 java_pid1532657.hprof
-rw-------  1 ubuntu ubuntu 4971058607 Dec  6 04:35 java_pid1745377.hprof
-rw-------  1 ubuntu ubuntu 5271518597 Dec  6 03:05 java_pid2568.hprof
-rw-------  1 ubuntu ubuntu 5260716271 Dec  6 10:50 java_pid2681937.hprof
-rw-------  1 ubuntu ubuntu 5155513083 Dec  6 06:15 java_pid3408534.hprof
-rw-------  1 ubuntu ubuntu 5101514079 Dec  6 12:18 java_pid3882150.hprof
-rw-------  1 ubuntu ubuntu 5207750826 Dec  6 07:54 java_pid419165.hprof
...

Used jhat, Eclipse's mat but failed due to heap space issue. Increased available heap for mat to 8GB (slowed down my machine drastically) but still not enough.

➜  ~ jhat -port 9999 Desktop/dumps/java_pid2568.hprof
Reading from Desktop/dumps/java_pid2568.hprof...
Dump file created Mon Dec 05 19:04:38 PST 2022
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at com.sun.tools.hat.internal.model.Snapshot.makeId(Snapshot.java:586)
	at com.sun.tools.hat.internal.model.Snapshot.addHeapObject(Snapshot.java:166)
	at com.sun.tools.hat.internal.parser.HprofReader.readInstance(HprofReader.java:744)
	at com.sun.tools.hat.internal.parser.HprofReader.readHeapDump(HprofReader.java:491)
	at com.sun.tools.hat.internal.parser.HprofReader.read(HprofReader.java:275)
	at com.sun.tools.hat.internal.parser.Reader.readFile(Reader.java:92)
	at com.sun.tools.hat.Main.main(Main.java:159)

To capture m/y issue, re-ran the gradle check and used pstree to capture long running java processes (>1 hr using red color code) depicts gradle process GC & jar transform are running.

  ├─java,5272,5272 -XX:+HeapDumpOnOutOfMemoryError -Xss2m --add-exports jdk.compiler/com.sun.tools.javac.api=ALL-UNNAMED --add-exports jdk.compiler/com.sun.tools.javac.file=ALL-UNNAMED --add-exportsjdk.compiler/com.sun.tools.javac.parser=ALL
  │   ├─{C1 CompilerThre},5334,5272
  │   ├─{C2 CompilerThre},5333,5272
  │   ├─{Cache worker fo},6470,5272
  │   ├─{Cache worker fo},8552,5272

All sub-process from above can be checked in gist

Survivor space 1 shows 100% consumption probably issue with garbage collection. May be increasing YG size may help. Just curious how gradle check not causing m/y issues on CI.

ubuntu@ip-172-31-41-165:~/OpenSearch$ jstat -gcutil 5272 1000 5
  S0     S1     E      O      M     CCS    YGC     YGCT     FGC    FGCT     CGC    CGCT       GCT
  0.00 100.00  64.50  62.84  58.16  43.30    105     4.694     0     0.000    24     1.307     6.000
  0.00 100.00  64.50  62.84  58.16  43.30    105     4.694     0     0.000    24     1.307     6.000
  0.00 100.00  64.50  62.84  58.16  43.30    105     4.694     0     0.000    24     1.307     6.000
  0.00 100.00  64.50  62.84  58.16  43.30    105     4.694     0     0.000    24     1.307     6.000
  0.00 100.00  64.50  62.84  58.16  43.30    105     4.694     0     0.000    24     1.307     6.000

@reta
Copy link
Collaborator

reta commented Dec 7, 2022

@dreamer-89 I think I know what is happening (this is related to S3 fixture failures, not OOM): since we now run against S3 (after #1763) and the tests use same bucket, it is highly likely that a few builds running at the same time do modify same S3 bucket concurrently, hence these random failures, I will try to randomize this part.

@dreamer-89
Copy link
Member

dreamer-89 commented Dec 7, 2022

@dreamer-89 I think I know what is happening (this is related to S3 fixture failures, not OOM): since we now run against S3 (after #1763) and the tests use same bucket, it is highly likely that a few builds running at the same time do modify same S3 bucket concurrently, hence these random failures, I will try to randomize this part.

Thanks @reta, this makes sense.

I tried creating a randomized base path and it seems to work fine.

@reta
Copy link
Collaborator

reta commented Dec 7, 2022

I tried creating a randomized base path and it seems to work fine.

Exactly @dreamer-89 , we already do that in a few places with testSeed, pull request on the way (testing)

@dreamer-89
Copy link
Member

Verified in latest runs there are no s3 repository test failures.

➜  OpenSearch git:(main) ✗ ruby flaky-finder.rb y-finder.rb --s 7740 --e 8258
Will crawl builds from 7740 to 8258
------------------
8 org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness (7814,7864,8014,8066,8078,8166,8217,8248)
5 org.opensearch.indices.stats.IndexStatsIT.testFilterCacheStats (7932,7977,8066,8156,8169)
3 org.opensearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT.test {yaml=pit/10_basic/Delete all} (7947,8078,8154)
2 org.opensearch.index.IndexServiceTests.testAsyncTranslogTrimTaskOnClosedIndex (7976,8250)
2 org.opensearch.snapshots.DedicatedClusterSnapshotRestoreIT.testIndexDeletionDuringSnapshotCreationInQueue (8004,8122)
2 org.opensearch.cluster.service.MasterServiceTests.classMethod (8193,8193)
1 org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testNodesRemovedAfterZoneDecommission_ClusterManagerNotInToBeDecommissionedZone (7930)
1 org.opensearch.repositories.s3.S3BlobContainerRetriesTests.testReadRangeBlobWithRetries (7942)
1 org.opensearch.cluster.routing.allocation.decider.ConcurrentRecoveriesAllocationDeciderTests.testClusterConcurrentRecoveries (8035)
1 org.opensearch.client.PitIT.testDeleteAllAndListAllPits (8115)
1 org.opensearch.http.SearchRestCancellationIT.testAutomaticCancellationMultiSearchDuringFetchPhase (8142)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment