-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{ai}[foss/2022b] PyTorch v1.13.1 #18421
{ai}[foss/2022b] PyTorch v1.13.1 #18421
Conversation
…-1.13.1_fix-gcc-12-compilation.patch, PyTorch-1.13.1_fix-protobuf-dependency.patch, PyTorch-1.13.1_fix-warning-in-test-cpp-api.patch, PyTorch-1.13.1_increase-tolerance-test_ops.patch, PyTorch-1.13.1_skip-tests-without-fbgemm.patch
Test report by @Flamefire |
Test report by @Flamefire |
@boegelbot please test @ generoso |
@casparvl: Request for testing this PR well received on login1 PR test command '
Test results coming soon (I hope)... - notification for comment with ID 1671683522 processed Message to humans: this is just bookkeeping information for me, |
Test report by @boegelbot |
…asyconfigs into 20230731133908_new_pr_PyTorch1131
@boegelbot please test @ jsc-zen2 |
@casparvl: Request for testing this PR well received on jsczen2l1.int.jsc-zen2.easybuild-test.cluster PR test command '
Test results coming soon (I hope)... - notification for comment with ID 1672663950 processed Message to humans: this is just bookkeeping information for me, |
@casparvl Again it is Can you check the log for |
Test report by @casparvl |
Different test for me...
|
Thanks for that, this really helps as that example when used manually reproduces this reliably on my machine too. And it makes me think we have a real bug here: The input can be further reduced to all This is either a bug in PyTorch or the compiler and as it affects more than this EC I think we/I should investigate this further. For reference the reduced test case is attached: reproduce_pytorch_quantization_fail.py Reported upstream: pytorch/pytorch#107030 |
Test report by @branfosj |
`test_sigmoid_non_observed`
`TestGradientsCPU.test_forward_mode_AD_linalg_det_singular_cpu_complex128`
|
Test report by @boegel |
@boegel The failure you see is fixed by the updated ECs in #18413 I now mentioned this in the PR description. @branfosj The quantization test is pytorch/pytorch#107030 which happens randomly due to the random test input. I'll allow a couple tests to fail to compensate as there is no fix I can see. |
…w 2 failing test (test_quantization may randomly fail)
The relevant lines are (from a different build to the test report):
It is not matching for the other type:
|
@branfosj I was referring to the old(er) RegExes that are meant to match the counts of failures and tests for each test suite which seem to be faulty as only 1 failure is counted while there are at least 2.
This part would be for the new RegEx unless it could be already matched elsewhere, i.e. the one you mention below:
Was this really what you meant here? Because that (new) regEx doesn't match what you quoted, so I'm confused...
This should be matched by https://github.com/easybuilders/easybuild-easyblocks/blob/3f95af4acb2d8c86728027ec0688ca357e6e1808/easybuild/easyblocks/p/pytorch.py#L325-L327 but likely isn't due to the new time format added. So from a first check the 2nd regexp needs to be updated but somehow the results (1 failure counted) seem to be the other way round. Can you attach the log? I think it makes sense to turn parts of that into a new test-case and fix them. |
Sorry. I'd got confused between that and the changes in easybuilders/easybuild-easyblocks#2983 Log is at https://gist.github.com/branfosj/39d6c72617b71589101fd6bb5870d8ad I think the issue it related to the sharding - we are running the test in two part:
So, there is one failure for both shards. Also, looking at that log, we are stopping on first failure in each shard. |
@branfosj Thanks. That log is from a different run than #18421 (comment) though, isn't it? I see 4 valid failures in the log and 2 reported by the old code But you are right that one issue is with the sharding. With my latest change just now I get 3 of the 4 errors counted (but all 4 reported by the newly introduced RegEx collecting single failures) and I don't see how we can count the last one without counting anything twice with the current approach which also attributes the failures to test suites.
I.e. it prints the logs of all shards and only then the "test_ops_gradients failed!" message so our regex matches only the 2nd/last summary. I see 4 solutions:
Edit: As for 4.: |
@branfosj I patched the sharding out so our error counting should work and with the allowed failures I'd say this should pass now also for you. |
Test report by @branfosj |
@boegelbot please test @ generoso |
@branfosj: Request for testing this PR well received on login1 PR test command '
Test results coming soon (I hope)... - notification for comment with ID 1722239174 processed Message to humans: this is just bookkeeping information for me, |
@boegelbot please test @ jsc-zen2 |
@branfosj: Request for testing this PR well received on jsczen2l1.int.jsc-zen2.easybuild-test.cluster PR test command '
Test results coming soon (I hope)... - notification for comment with ID 1722242004 processed Message to humans: this is just bookkeeping information for me, |
Test report by @boegelbot |
Test report by @boegelbot |
Going in, thanks @Flamefire! |
(created using
eb --new-pr
)This is a bit of a struggle as PyTorch 1.13 isn't compatible with neither GCC 12 nor with Python 3.11 however I think I am/was able to backport enough of the changes in PyTorch 2 to that such that this works. I'm a bit afraid of jumping straight to PyTorch 2 and users might want to have PyTorch 1.x and 2.x anyway
IMPORTANT Failures related to Abseil can be fixed by reinstalling the latter including the updated ECs from #18413