Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: tests: adds tooling to run very expensive tests #12234

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

laurentsenta
Copy link
Contributor

@laurentsenta laurentsenta commented Jul 15, 2024

Related Issues

Closes #12136

Proposed Changes

  • Keep the LOTUS_RUN_EXPENSIVE_TESTS flag unchanged, but make it explicit in CI
  • Adds the LOTUS_RUN_VERY_EXPENSIVE_TESTS flag based on a few signals (wip)
    • If a PR contains the label need/very-expensive-tests
    • During a nightly run
  • Adds a very_expensive_tests array in the test.yml settings (similar to how test are configured for custom runners, etc).
    • Tests in that array will run with -timeout=60m

Additional Info

Currently the test using VERY EXPENSIVE is niporep_manual_test.go.

Example of a regular run:
https://github.com/filecoin-project/lotus/actions/runs/9908341740/job/27439500093?pr=12225
You'll find skipping test logs in the gotestsum step.

Adding the need/very-expensive-tests label should toggle those on.

Checklist

Before you mark the PR ready for review, please make sure that:

  • Commits have a clear commit message.
  • PR title is in the form of of <PR type>: <area>: <change being made>
    • example: fix: mempool: Introduce a cache for valid signatures
    • PR type: fix, feat, build, chore, ci, docs, perf, refactor, revert, style, test
    • area, e.g. api, chain, state, mempool, multisig, networking, paych, proving, sealing, wallet, deps
  • Update CHANGELOG.md or signal that this change does not need it.
    • If the PR affects users (e.g., new feature, bug fix, system requirements change), update the CHANGELOG.md and add details to the UNRELEASED section.
    • If the change does not require a CHANGELOG.md entry, do one of the following:
      • Add [skip changelog] to the PR title
      • Add the label skip/changelog to the PR
  • New features have usage guidelines and / or documentation updates in
  • Tests exist for new functionality or change in behavior
  • CI is green

@laurentsenta laurentsenta added the skip/changelog This change does not require CHANGELOG.md update label Jul 15, 2024
@laurentsenta laurentsenta marked this pull request as draft July 15, 2024 09:41
.github/workflows/test.yml Outdated Show resolved Hide resolved
.github/workflows/test.yml Outdated Show resolved Hide resolved
@laurentsenta laurentsenta force-pushed the ipdx-with-very-expensive-flag branch 2 times, most recently from 7f536fb to 7b6ddb5 Compare July 15, 2024 11:44
@laurentsenta laurentsenta added need/very-expensive-tests skip/changelog This change does not require CHANGELOG.md update and removed skip/changelog This change does not require CHANGELOG.md update labels Jul 15, 2024
@laurentsenta laurentsenta force-pushed the ipdx-with-very-expensive-flag branch 2 times, most recently from 6d114c7 to ce6947c Compare July 15, 2024 13:47
@laurentsenta laurentsenta requested a review from galargh July 15, 2024 14:04
@laurentsenta
Copy link
Contributor Author

Seems like the feature is working: the test failing is the one using very expensive tests
https://github.com/filecoin-project/lotus/actions/runs/9940645037/job/27457885144?pr=12234

Copy link
Contributor

@galargh galargh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good :D

.github/workflows/test.yml Show resolved Hide resolved
.github/workflows/test.yml Show resolved Hide resolved
@laurentsenta laurentsenta marked this pull request as ready for review July 16, 2024 11:57
@laurentsenta laurentsenta requested a review from rvagg July 16, 2024 11:57
@laurentsenta
Copy link
Contributor Author

laurentsenta commented Jul 16, 2024

Setting this up for review, @rvagg we have an open question for you!

@laurentsenta
Copy link
Contributor Author

@rvagg bumping this, the PR is ready for review,
we need your input on the envs discussion above, and you might want to fix the very expensive test before merging.

@rvagg
Copy link
Member

rvagg commented Jul 19, 2024

2 things about very expensive tests:

  1. They'll take a lot longer to run
  2. They'll probably need a beefier machine to run on (or, they take a lot lot longer, which might be acceptable)

You can see in here that itest-niporep_manual where the only VERY_EXPENSIVE test is so far is timing out at the default golang 10m mark:

DONE 9 tests, 5 skipped, 3 failures in 673.657s

I'm not sure how long it takes in the CI machines we have, but we probably need to increase the go test timeout by quite a bit. Maybe 30m to start with and see how it goes? If it's something that we can easily bump in the config file then we could do that as we add more.

@rvagg
Copy link
Member

rvagg commented Jul 22, 2024

Next steps? We either need somewhere new to run these that can handle the resources, and/or a change in test timeout to make these runnable. It'd be nice if there were a way to couple the timeout specifically with the expensive tests (so we don't end up giving a long timeout to tests that really should finish quickly) but I can't think of a good way to achieve this - the expensive tests are marked within the code but the timeout needs to be done outside the code.

@laurentsenta
Copy link
Contributor Author

@rvagg thanks for taking a look,
I agree it'd make sense to wait for something like golang/go#48157 then set a timeout per test,
I'll check with timeout first and report back, that's the most manageable value to tweak.

@laurentsenta laurentsenta force-pushed the ipdx-with-very-expensive-flag branch 9 times, most recently from abded1d to 578b05c Compare July 22, 2024 14:10
@rvagg
Copy link
Member

rvagg commented Jul 23, 2024

Even after 60m it still times out, that's not great at all, these machines must be pretty basic.

DONE 9 tests, 5 skipped, 3 failures in 3670.053s

I could probably have a look at making that test even dumber and doing less work, this is all the fault of proofs for niporep being really expensive.

What options do we have for introducing custom execution machines here? Can we throw our own hardware at this? We have a bare-metal machine which is usable for this, are we able to hook that up and have this test specifically run on there?

@laurentsenta
Copy link
Contributor Author

laurentsenta commented Jul 23, 2024

@rvagg We use custom runners on AWS, c5.4xlarge machine with 16 cores and 32GB of memory,
I suspect the test is super flaky:

(no success since I rebased with latest master)

On the last example, I noticed the log hang'ed after ~25min,

job log grouped per minute:

2024-07-22T15:29 19
2024-07-22T15:30 2653
2024-07-22T15:32 755
2024-07-22T15:33 1086
2024-07-22T15:34 178
2024-07-22T15:35 142
2024-07-22T15:36 109
2024-07-22T15:37 196
2024-07-22T15:38 88
2024-07-22T15:39 208
2024-07-22T15:40 57
2024-07-22T15:41 164
2024-07-22T15:42 124
2024-07-22T15:43 222
2024-07-22T15:44 207
2024-07-22T15:45 189
2024-07-22T15:46 145
2024-07-22T15:47 167
2024-07-22T15:48 161
2024-07-22T15:49 181
2024-07-22T15:50 204
2024-07-22T15:51 102
2024-07-22T15:52 19
2024-07-22T15:53 199
2024-07-22T15:54 212
2024-07-22T15:55 220
2024-07-22T15:56 36 # no more logs until timeout
2024-07-22T16:32 14761 # timeout errors

Same for that run:

2024-07-22T18:20 78
2024-07-22T18:21 2594
2024-07-22T18:22 63
2024-07-22T18:23 1385
2024-07-22T18:24 664
2024-07-22T18:25 208
2024-07-22T18:26 105
2024-07-22T18:27 221
2024-07-22T18:28 57
2024-07-22T18:29 235
2024-07-22T18:30 194
2024-07-22T18:31 142
2024-07-22T18:32 240
2024-07-22T18:33 136
2024-07-22T18:34 244
2024-07-22T18:35 131
2024-07-22T18:36 174
2024-07-22T18:37 207
2024-07-22T18:38 257
2024-07-22T18:39 181
2024-07-22T18:40 167
2024-07-22T18:41 118
2024-07-22T18:43 217
2024-07-22T18:44 234
2024-07-22T19:22 14917 

We could allocate more resources to the job,

  • but could you double check nothing is wrong with the test itself? some sort of deadlock or flakiness.
  • if 16 cores and 32GB CPU is not enough, approx. how much would you need for the test to succeed consistently?

Edit: latest run succeeded in 24 min - https://github.com/filecoin-project/lotus/actions/runs/10054703843/job/27789842695?pr=12234

Edit: following run failed after 3 min - https://github.com/filecoin-project/lotus/actions/runs/10055630487/job/27792728538?pr=12234

@laurentsenta
Copy link
Contributor Author

laurentsenta commented Jul 23, 2024

Please read my comment above on why I suspect the issue is related to the test itself,

A proposal to move this forward:

  • I'll add the "create github issue on schedule job failure" feature requested by @BigLep, (edit: Done)
  • We merge this PR and let the lotus team investigate, thanks to the need/very-expensive-tests label, this will be easier as its own PR managed by the team
  • If you identify an issue with the runners we can fix that later on.

@laurentsenta laurentsenta force-pushed the ipdx-with-very-expensive-flag branch from cb5416c to ed60445 Compare July 23, 2024 09:50
@laurentsenta laurentsenta requested a review from BigLep July 23, 2024 12:53
@BigLep BigLep added area/expensive-test-failure Used in .github/workflows/test.yml to denote a VERY_EXPENSIVE_TEST failure. and removed area/expensive-test-failure Used in .github/workflows/test.yml to denote a VERY_EXPENSIVE_TEST failure. labels Jul 23, 2024
Copy link
Member

@BigLep BigLep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Your proposal to move this forward makes sense to me:

.github/workflows/test.yml Outdated Show resolved Hide resolved
.github/workflows/test.yml Outdated Show resolved Hide resolved
@rvagg
Copy link
Member

rvagg commented Jul 24, 2024

How much variability should we expect on the selection of these machines, and do we have a dedicated c5.4xlarge for the specific test?

I've seen a success after 24min: https://github.com/filecoin-project/lotus/actions/runs/10041916541/job/27751141605

Aggregating circuit proofs in the logs marks the end of the most expensive bit of the test, that happens somewhere around epoch 28257; and we have a blocktime of 2ms (but it ends up being much slower than that in practice, we just try and make it as quick as possible and let the block miner work hard to keep up and then we pause it now and again anyway).

It happens at 14:02:29, test finishes up @ 14:03:11. Total time is 1270.43s; which is decent. This is the kind of behaviour I'd expect from this test.

A fail at 3min: https://github.com/filecoin-project/lotus/actions/runs/10043543941/job/27756581112

signal: killed

That looks like an OOM error. I hope we're not taking more than 32G to run this test, are we sure we're getting those resources?

That one failed after 1h: https://github.com/filecoin-project/lotus/actions/runs/10043543941/job/27757352123

Aggregating circuit proofs at around epoch 32517, at 15:56:01, we continue for a bit until there's nothing after 15:56:05 until at all, until a timeout at 16:32:16. Which is very weird, we should at least have some log spam in there about something even if it continues to spin.

Screenshot 2024-07-24 at 1 43 00 PM

This looks like the culprit, but I have no idea why it might get stuck there!

Mon, 22 Jul 2024 16:32:16 GMT
goroutine 107964 [select, 36 minutes]:
Mon, 22 Jul 2024 16:32:16 GMT
github.com/filecoin-project/lotus/chain/stmgr.(*StateManager).WaitForMessage(0xc00084cdc0, {0x7c4fab8?, 0xc00ca90b90?}, {{0xc011844630?, 0x30?}}, 0x2, 0xffffffffffffffff, 0x1)
Mon, 22 Jul 2024 16:32:16 GMT
	/home/runner/work/lotus/lotus/chain/stmgr/searchwait.go:84 +0x9fd
Mon, 22 Jul 2024 16:32:16 GMT
github.com/filecoin-project/lotus/node/impl/full.(*StateModule).StateWaitMsg(0xc009310040, {0x7c4fab8, 0xc00ca90b90}, {{0xc011844630?, 0xc001f36fd0?}}, 0xc001f37078?, 0x2?, 0xd0?)
Mon, 22 Jul 2024 16:32:16 GMT
	/home/runner/work/lotus/lotus/node/impl/full/state.go:607 +0x55
Mon, 22 Jul 2024 16:32:16 GMT
github.com/filecoin-project/lotus/itests/kit.(*TestUnmanagedMiner).waitMessage(0xc00779a540, {{0xc011844630?, 0xc00ed954f0?}})
Mon, 22 Jul 2024 16:32:16 GMT
	/home/runner/work/lotus/lotus/itests/kit/node_unmanaged.go:1405 +0x6a
Mon, 22 Jul 2024 16:32:16 GMT
github.com/filecoin-project/lotus/itests/kit.(*TestUnmanagedMiner).SubmitMessage(0xc00779a540?, {0x7c2e300?, 0xc00ed954f0?}, 0x0?, 0x0?)
Mon, 22 Jul 2024 16:32:16 GMT
	/home/runner/work/lotus/lotus/itests/kit/node_unmanaged.go:1401 +0x48
Mon, 22 Jul 2024 16:32:16 GMT
github.com/filecoin-project/lotus/itests/kit.(*TestUnmanagedMiner).submitProveCommit(0xc00779a540, 0xf, {0xc0004f2500, 0x7, 0x0?}, 0x0, 0xc00421a390)
Mon, 22 Jul 2024 16:32:16 GMT
	/home/runner/work/lotus/lotus/itests/kit/node_unmanaged.go:752 +0x14c7
Mon, 22 Jul 2024 16:32:16 GMT
github.com/filecoin-project/lotus/itests/kit.(*TestUnmanagedMiner).OnboardSectors(0xc00779a540, 0xf, 0x0, 0x7, {0xc001a67ed8, 0x3, 0x2?})
Mon, 22 Jul 2024 16:32:16 GMT
	/home/runner/work/lotus/lotus/itests/kit/node_unmanaged.go:283 +0x41d
Mon, 22 Jul 2024 16:32:16 GMT
command-line-arguments.TestManualNISectorOnboarding.func2(0xc00c875520)
Mon, 22 Jul 2024 16:32:16 GMT
	/home/runner/work/lotus/lotus/itests/niporep_manual_test.go:277 +0xc0c
Mon, 22 Jul 2024 16:32:16 GMT
testing.tRunner(0xc00c875520, 0xc00c103bf0)
Mon, 22 Jul 2024 16:32:16 GMT
	/home/runner/work/_tool/go/1.21.12/x64/src/testing/testing.go:1595 +0xff
Mon, 22 Jul 2024 16:32:16 GMT
created by testing.(*T).Run in goroutine 67

So yes, I think there might be test issues here, but the signal: killed makes me worried that we're not getting the resources we need for this test. I'll have to do some local testing to find out what the memory demands are.

Co-authored-by: Steve Loeppky <biglep@protocol.ai>
@laurentsenta
Copy link
Contributor Author

@rvagg The test gets a fresh EC2 instance; instances are not shared or reused between tests, no noisy neighbors there,

  • We can share the AMI we use on the runner if you want to try it out manually on your AWS account,
  • This is out of scope for this work, but maybe researching the use of GOMEMLIMIT might help.

Copy link
Member

@rvagg rvagg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌 this is good for now I think; the flaky test can be dealt with separately by me

@laurentsenta good idea to have me test in AWS, if you could share the AMI then I'll give that a go and see. My dev machine is too overpowered to simulate this.

@rvagg
Copy link
Member

rvagg commented Jul 25, 2024

#12307

@galargh
Copy link
Contributor

galargh commented Jul 25, 2024

@rvagg Could you give us an account ID that you'd like to use the AMI in? We'll make it available for that account.

@BigLep
Copy link
Member

BigLep commented Aug 12, 2024

@rvagg: did you see the accountId comment (maybe was handled offline)? Also, anything else holding us back from merging this?

@rvagg
Copy link
Member

rvagg commented Aug 13, 2024

@BigLep @galargh yes, in slack: https://filecoinproject.slack.com/archives/C06CY12V83S/p1721957023321019, I haven't checked yet whether I've got a new AMI to play with but I didn't get a response to that comment yet.

@BigLep
Copy link
Member

BigLep commented Sep 4, 2024

I assigned this to @rvagg because I believe the ball is in his court for merging this functionality (but this isn't a top priority currently).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/very-expensive-tests skip/changelog This change does not require CHANGELOG.md update
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DX Streamline] LOTUS_RUN_EXPENSIVE_TESTS periodically outside of PR workflow
4 participants