-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT/test {yaml=repository_s3/20_repository_permanent_credentials/Snapshot and Restore with repository-s3 using permanent credentials} flaky #5219
Comments
Tells me it's some race condition. |
|
@reta Take it! I suspect the root cause is a failure in setup/teardown of tests, but so far I didn't find anything meaningful. |
Fresh failure: https://build.ci.opensearch.org/job/gradle-check/7276/ but a new cause:
|
https://build.ci.opensearch.org/job/gradle-check/7310
|
Thanks @dblock, the second occurrence (#5219 (comment)) is not clear to me, I suspect it is still flakyness, I will watch the builds a bit and update you, thanks! |
Thanks @dblock @reta for analysis on this so far and @andrross for identfying recent gradle failures due to this issue. I looked into this and below is progress so far. From stack trace it does seem a race condition which fails on specific action (listed in failure stack trace), but all repro efforts with appropriate logs to prove race condition failed. To repro, I ran
7310 Snapshot deletion failure
6778 Snapshot creation
6779 Snapshot deletion
Also, I see the repository-s3 yaml tests are invoked against external s3 end-points and fixtures are only used when s3 credentials are not set. I verified that by setting the creds and running the repository-s3 tests. We updated the gradle check job in #1763 to use the external s3 end-points and is probably the cause of race condition. |
Thanks @dreamer-89
Same, no luck reproducing it locally
Aha, that could be the difference since locally the tests are run against Minio clusters |
Unrelated to flaky test and more around repro'ing test failure which was unsuccessful :( Previous effort to decipher heap dump (auto generated by gradle check) was unsuccessful due to large heap dump files (~5GB).
Used
To capture m/y issue, re-ran the gradle check and used
All sub-process from above can be checked in gist Survivor space 1 shows 100% consumption probably issue with garbage collection. May be increasing YG size may help. Just curious how gradle check not causing m/y issues on CI.
|
@dreamer-89 I think I know what is happening (this is related to S3 fixture failures, not OOM): since we now run against S3 (after #1763) and the tests use same bucket, it is highly likely that a few builds running at the same time do modify same S3 bucket concurrently, hence these random failures, I will try to randomize this part. |
Thanks @reta, this makes sense. I tried creating a randomized base path and it seems to work fine. |
Exactly @dreamer-89 , we already do that in a few places with |
Verified in latest runs there are no s3 repository test failures.
|
org.opensearch.repositories.s3.RepositoryS3ClientYamlTestSuiteIT/test {yaml=repository_s3/20_repository_permanent_credentials/Snapshot and Restore with repository-s3 using permanent credentials}
https://build.ci.opensearch.org/job/gradle-check/6782/
https://build.ci.opensearch.org/job/gradle-check/6779/
https://build.ci.opensearch.org/job/gradle-check/6778/
https://build.ci.opensearch.org/job/gradle-check/6766/
https://build.ci.opensearch.org/job/gradle-check/6751/
https://build.ci.opensearch.org/job/gradle-check/6750/
The text was updated successfully, but these errors were encountered: