Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{ai}[foss/2022a] PyTorch v1.13.1 w/ CUDA 11.7.0 #18424

Merged

Conversation

Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Jul 31, 2023

(created using eb --new-pr)

…hes: PyTorch-1.7.0_disable-dev-shm-test.patch, PyTorch-1.10.0_fix-kineto-crash.patch, PyTorch-1.11.1_skip-test_init_from_local_shards.patch, PyTorch-1.12.1_add-hypothesis-suppression.patch, PyTorch-1.12.1_fix-skip-decorators.patch, PyTorch-1.12.1_fix-test_cpp_extensions_jit.patch, PyTorch-1.12.1_fix-test_wishart_log_prob.patch, PyTorch-1.12.1_fix-TestTorch.test_to.patch, PyTorch-1.12.1_fix-use-after-free-in-tensorpipe-agent.patch, PyTorch-1.12.1_fix-vsx-loadu.patch, PyTorch-1.12.1_fix-vsx-vector-funcs.patch, PyTorch-1.12.1_skip-test_round_robin.patch, PyTorch-1.13.1_fix-fsdp-fp16-test.patch, PyTorch-1.13.1_fix-kineto-crash-on-exit.patch, PyTorch-1.13.1_fix-pytest-args.patch, PyTorch-1.13.1_fix-test-ops-conf.patch, PyTorch-1.13.1_increase-tolerance-test_ops.patch, PyTorch-1.13.1_increase-tolerance-test_optim.patch, PyTorch-1.13.1_install-vsx-vec-headers.patch, PyTorch-1.13.1_no-cuda-stubs-rpath.patch, PyTorch-1.13.1_remove-flaky-test-in-testnn.patch, PyTorch-1.13.1_skip-failing-grad-test.patch, PyTorch-1.13.1_skip-test-requiring-online-access.patch, PyTorch-1.13.1_skip-tests-without-fbgemm.patch
@casparvl
Copy link
Contributor

Since #18421 is a draft, should I assume this is also a draft? Or do you already want me to look at it?

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusml30 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/Flamefire/841d9706a7a221bc922074d6ebd58748 for a full test report.

@Flamefire
Copy link
Contributor Author

Flamefire commented Aug 1, 2023

Since #18421 is a draft, should I assume this is also a draft? Or do you already want me to look at it?

Yes, this one wasn't supposed to fail, I have to investigate. Might be a patch I missed to include in the PR or so.
#18421 is for a newer toolchain with many new issues so independent from this. Hence the draft state of that one.

@boegel boegel added the update label Aug 1, 2023
@boegel boegel modified the milestones: 4.8.0, 4.x Aug 1, 2023
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusi8033 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/Flamefire/3543f63568472699d6198bc43c3a4267 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusml24 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/Flamefire/3f624760e0518ebc178a741603705c5e for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusi8010 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/Flamefire/d0c31e3b6e837c29242c3866a48168e9 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
taurusml27 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/Flamefire/a0d0a52ac646c224a169d55051be6649 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusi8017 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/Flamefire/6178e7e68d0bc16547d1d010b0f3fda0 for a full test report.

@casparvl
Copy link
Contributor

casparvl commented Aug 9, 2023

@boegelbot please test @ generoso
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on login1

PR test command 'EB_PR=18424 EB_ARGS= EB_CONTAINER= /opt/software/slurm/bin/sbatch --job-name test_PR_18424 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 11417

Test results coming soon (I hope)...

- notification for comment with ID 1671683983 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
cnx2 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/5bd53a84735fb642d3b385836515bccf for a full test report.

@casparvl
Copy link
Contributor

Test report by @casparvl
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
gcn3.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 520.61.05, Python 3.6.8
See https://gist.github.com/casparvl/2476b7d33ea4b7573c0401981940718e for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2983
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusi8013 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/Flamefire/1dd8a4de1d2e1414ad2ee5f2145445d8 for a full test report.

@branfosj
Copy link
Member

Test report by @branfosj
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2983
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bear-pg0203u30a.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 520.61.05, Python 3.6.8
See https://gist.github.com/branfosj/b63fa5220eb4ecf6d386264eeb42dfc3 for a full test report.

@easybuilders easybuilders deleted a comment from boegelbot Aug 12, 2023
@easybuilders easybuilders deleted a comment from boegelbot Aug 12, 2023
@easybuilders easybuilders deleted a comment from boegelbot Aug 12, 2023
@easybuilders easybuilders deleted a comment from boegelbot Aug 12, 2023
@boegel
Copy link
Member

boegel commented Sep 12, 2023

@Flamefire Any updates on this?

@boegel boegel modified the milestones: 4.x, next release (4.8.2?) Sep 12, 2023
@Flamefire
Copy link
Contributor Author

Flamefire commented Sep 13, 2023

Test report by @Flamefire
1x FAILED 3x SUCCESS
Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusi8003 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor, 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/Flamefire/a9bc2fe74afa37bfe58278261f414143 for a full test report.

@Flamefire
Copy link
Contributor Author

@boegel I forgot about this and my (draft) answer is gone. The last run shows 1 failure but the easyblock missed to count it correctly --> easybuilders/easybuild-easyblocks#3003

Also added a patch here and allow some failures to avoid it randomly failing. New reports coming up (3x the same nodes to check for flukes)

@branfosj
Copy link
Member

Test report by @branfosj
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3003
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0208u09a - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), 1 x NVIDIA NVIDIA A100-SXM4-40GB, 520.61.05, Python 3.6.8
See https://gist.github.com/branfosj/13c37a69b64060a9ab08682e864194fe for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3003
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
taurusi8026 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor, 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/Flamefire/197796bb9c15f50d2abcc5a891da668c for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3003
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
taurusi8027 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor, 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/Flamefire/e81d3e6e1c0de48d1673413424b4817d for a full test report.

@SebastianAchilles
Copy link
Member

Test report by @SebastianAchilles
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3003
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
skl-rockylinux-88 - Linux Rocky Linux 8.8, x86_64, Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz (skylake), 1 x NVIDIA NVIDIA RTX A4000, 535.104.05, Python 3.6.8
See https://gist.github.com/SebastianAchilles/de827bdbe0500d6c5ec7747f465efe1d for a full test report.

@akesandgren
Copy link
Contributor

Test report by @akesandgren
SUCCESS
Build succeeded for 3 out of 3 (1 easyconfigs in total)
b-cn1603.hpc2n.umu.se - Linux Ubuntu 22.04, x86_64, AMD EPYC 7313 16-Core Processor, 1 x NVIDIA NVIDIA A100 80GB PCIe, 525.125.06, Python 3.10.6
See https://gist.github.com/akesandgren/206baa3368cd915505671a5f23f84f38 for a full test report.

@akesandgren
Copy link
Contributor

@Flamefire Anything more you think need to be done to this? looks good as far as testing goes.

@Flamefire
Copy link
Contributor Author

@akesandgren No it does look good with everything fixed so far. I'm even already using some patches for the 2.x version so yes please get this merged :)

Copy link
Contributor

@akesandgren akesandgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@akesandgren
Copy link
Contributor

Going in, thanks @Flamefire!

@akesandgren akesandgren merged commit be30e2a into easybuilders:develop Sep 25, 2023
5 checks passed
@Flamefire Flamefire deleted the 20230731160602_new_pr_PyTorch1131 branch September 25, 2023 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants