Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{devel}[foss/2021a] PyTorch v1.10.0 w/ Python 3.9.5 (CPU-only) #14460

Conversation

boegel
Copy link
Member

@boegel boegel commented Nov 29, 2021

(created using eb --new-pr)

@boegel boegel added the update label Nov 29, 2021
@boegel
Copy link
Member Author

boegel commented Nov 29, 2021

@boegelbot please test @ generoso
CORE_CNT=16

@boegel boegel changed the title {devel}[foss/2021a] PyTorch v1.10.0 w/ Python 3.9.5 {devel}[foss/2021a] PyTorch v1.10.0 w/ Python 3.9.5 (CPU-only) Nov 29, 2021
@boegel boegel added this to the 4.x milestone Nov 29, 2021
@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on login1

PR test command 'EB_PR=14460 EB_ARGS= /opt/software/slurm/bin/sbatch --job-name test_PR_14460 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 7392

Test results coming soon (I hope)...

- notification for comment with ID 981834022 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
cnx1 - Linux rocky linux 8.4, x86_64, Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz (haswell), Python 3.6.8
See https://gist.github.com/6217fb2836131925005550af9a2dccdd for a full test report.

@boegel
Copy link
Member Author

boegel commented Nov 29, 2021

Test report by @boegel
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
node3505.doduo.os - Linux RHEL 8.2, x86_64, AMD EPYC 7552 48-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/c1f5704e3926e68a8a8d7e817f6aff43 for a full test report.

@boegel
Copy link
Member Author

boegel commented Nov 29, 2021

Test report by @boegel
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
node3112.skitty.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (skylake_avx512), Python 3.6.8
See https://gist.github.com/4defbecaf0e907f8ae053fdcc5926354 for a full test report.

@boegel
Copy link
Member Author

boegel commented Nov 29, 2021

Test report by @boegel
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
node2621.swalot.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz (haswell), Python 3.6.8
See https://gist.github.com/f816c43472899874b726b4bb079bb66e for a full test report.

@migueldiascosta
Copy link
Member

so, a couple of tests fail on skylake? or was there another reason for that failed test report?

@boegel
Copy link
Member Author

boegel commented Nov 30, 2021

Test report by @boegel
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
node3136.skitty.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (skylake_avx512), Python 3.6.8
See https://gist.github.com/e0d7721680046c6d8d83ddc15cc104de for a full test report.

@migueldiascosta
Copy link
Member

I guess that answers my question, even though I'm sure there's no causal relation there :)

what happened in the failed test then?

@SebastianAchilles
Copy link
Member

Test report by @SebastianAchilles
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
jsfc01.int.jusuf.sebastian.cluster - Linux rocky linux 8.4, x86_64, AMD EPYC 7742 64-Core Processor (zen2), 1 x NVIDIA GRID V100-4C, 460.73.01, Python 3.6.8
See https://gist.github.com/671c3911fa8da49d1408b58e98384df0 for a full test report.

@boegel
Copy link
Member Author

boegel commented Nov 30, 2021

I guess that answers my question, even though I'm sure there's no causal relation there :)

what happened in the failed test then?

I'm not sure, I didn't make any changes, I just resubmitted the test with the same resources requested.
It's not the first time that the PyTorch test suite doesn't seem 100% stable...

For future reference (I don't have the details on how these tests failed anymore)

distributed/algorithms/test_join failed!
distributed/rpc/test_tensorpipe_agent failed!

@migueldiascosta Since the tests have passed fine on a variety of other systems, I think we can assume it was some fluke, and go ahead and merge?

@boegel boegel modified the milestones: 4.x, next release (4.5.1?) Nov 30, 2021
Copy link
Member

@migueldiascosta migueldiascosta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@migueldiascosta
Copy link
Member

Going in, thanks @boegel!

@migueldiascosta migueldiascosta merged commit d890c98 into easybuilders:develop Dec 1, 2021
@boegel boegel deleted the 20211129180101_new_pr_PyTorch1100 branch December 1, 2021 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants