-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: tests: adds tooling to run very expensive tests #12234
base: master
Are you sure you want to change the base?
Conversation
7f536fb
to
7b6ddb5
Compare
6d114c7
to
ce6947c
Compare
Seems like the feature is working: the test failing is the one using very expensive tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good :D
Setting this up for review, @rvagg we have an open question for you! |
@rvagg bumping this, the PR is ready for review, |
2 things about very expensive tests:
You can see in here that
I'm not sure how long it takes in the CI machines we have, but we probably need to increase the |
Next steps? We either need somewhere new to run these that can handle the resources, and/or a change in test timeout to make these runnable. It'd be nice if there were a way to couple the timeout specifically with the expensive tests (so we don't end up giving a long timeout to tests that really should finish quickly) but I can't think of a good way to achieve this - the expensive tests are marked within the code but the timeout needs to be done outside the code. |
@rvagg thanks for taking a look, |
abded1d
to
578b05c
Compare
Even after 60m it still times out, that's not great at all, these machines must be pretty basic.
I could probably have a look at making that test even dumber and doing less work, this is all the fault of proofs for niporep being really expensive. What options do we have for introducing custom execution machines here? Can we throw our own hardware at this? We have a bare-metal machine which is usable for this, are we able to hook that up and have this test specifically run on there? |
@rvagg We use custom runners on AWS, c5.4xlarge machine with 16 cores and 32GB of memory,
(no success since I rebased with latest master) On the last example, I noticed the log hang'ed after ~25min, job log grouped per minute:
Same for that run:
We could allocate more resources to the job,
Edit: latest run succeeded in 24 min - https://github.com/filecoin-project/lotus/actions/runs/10054703843/job/27789842695?pr=12234 Edit: following run failed after 3 min - https://github.com/filecoin-project/lotus/actions/runs/10055630487/job/27792728538?pr=12234 |
Please read my comment above on why I suspect the issue is related to the test itself, A proposal to move this forward:
|
Co-authored-by: Steve Loeppky <biglep@protocol.ai>
cb5416c
to
ed60445
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Your proposal to move this forward makes sense to me:
How much variability should we expect on the selection of these machines, and do we have a dedicated c5.4xlarge for the specific test?
It happens at 14:02:29, test finishes up @ 14:03:11. Total time is 1270.43s; which is decent. This is the kind of behaviour I'd expect from this test.
That looks like an OOM error. I hope we're not taking more than 32G to run this test, are we sure we're getting those resources?
This looks like the culprit, but I have no idea why it might get stuck there!
So yes, I think there might be test issues here, but the |
Co-authored-by: Steve Loeppky <biglep@protocol.ai>
@rvagg The test gets a fresh EC2 instance; instances are not shared or reused between tests, no noisy neighbors there,
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌 this is good for now I think; the flaky test can be dealt with separately by me
@laurentsenta good idea to have me test in AWS, if you could share the AMI then I'll give that a go and see. My dev machine is too overpowered to simulate this.
@rvagg Could you give us an account ID that you'd like to use the AMI in? We'll make it available for that account. |
@rvagg: did you see the accountId comment (maybe was handled offline)? Also, anything else holding us back from merging this? |
@BigLep @galargh yes, in slack: https://filecoinproject.slack.com/archives/C06CY12V83S/p1721957023321019, I haven't checked yet whether I've got a new AMI to play with but I didn't get a response to that comment yet. |
I assigned this to @rvagg because I believe the ball is in his court for merging this functionality (but this isn't a top priority currently). |
Related Issues
Closes #12136
Proposed Changes
LOTUS_RUN_EXPENSIVE_TESTS
flag unchanged, but make it explicit in CILOTUS_RUN_VERY_EXPENSIVE_TESTS
flag based on a few signals (wip)need/very-expensive-tests
very_expensive_tests
array in thetest.yml
settings (similar to how test are configured for custom runners, etc).-timeout=60m
Additional Info
Currently the test using VERY EXPENSIVE is
niporep_manual_test.go
.Example of a regular run:
https://github.com/filecoin-project/lotus/actions/runs/9908341740/job/27439500093?pr=12225
You'll find
skipping test
logs in thegotestsum
step.Adding the
need/very-expensive-tests
label should toggle those on.Checklist
Before you mark the PR ready for review, please make sure that:
<PR type>: <area>: <change being made>
fix: mempool: Introduce a cache for valid signatures
PR type
: fix, feat, build, chore, ci, docs, perf, refactor, revert, style, testarea
, e.g. api, chain, state, mempool, multisig, networking, paych, proving, sealing, wallet, deps[skip changelog]
to the PR titleskip/changelog
to the PR