Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add patches to fix PyTorch-1.12.1 w/ foss/2022a + CUDA v11.7.0 on POWER #18494

Merged

Conversation

Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Aug 8, 2023

(created using eb --new-pr)

Same as #18490 but for the CUDA version (single EC changed due to time it takes for testing the CUDA version)

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
taurusml24 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/Flamefire/a73d4119146b3c1e6efabdbd1b48f77c for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
taurusi8004 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/Flamefire/71fb08e9a3211a2d7cff8208e5d721a2 for a full test report.

@Flamefire Flamefire force-pushed the 20230808143638_new_pr_PyTorch1121 branch from 63a0389 to 69c9d48 Compare August 10, 2023 08:43
@Flamefire
Copy link
Contributor Author

Accidentally added a wrong EC, hence the force-push to remove that commit.

@boegel boegel added the bug fix label Aug 15, 2023
@boegel boegel added this to the next release (4.8.1?) milestone Aug 15, 2023
@boegel boegel changed the title Fix PyTorch-1.12.1-foss-2022a (CUDA) on POWER add patches to fix PyTorch-1.12.1 w/ foss/2022a + CUDA v11.7.0 on POWER Aug 15, 2023
@boegel
Copy link
Member

boegel commented Aug 16, 2023

Test report by @boegel
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
node3309.joltik.os - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 530.30.02, Python 3.6.8
See https://gist.github.com/boegel/55a22a5b7669c6a9a06d1fa8a2ae6b83 for a full test report.

@akesandgren
Copy link
Contributor

Test report by @akesandgren
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
b-cn1601.hpc2n.umu.se - Linux Ubuntu 22.04, x86_64, AMD EPYC 7413 24-Core Processor, 2 x NVIDIA NVIDIA A100 80GB PCIe, 525.125.06, Python 3.10.6
See https://gist.github.com/akesandgren/52d21905884f3017d2580a667196d739 for a full test report.

@akesandgren
Copy link
Contributor

@boegel any idea on why it failed for you?

@Flamefire
Copy link
Contributor Author

The updated easyblock would likely help in showing the error.

@branfosj
Copy link
Member

Test report by @branfosj
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2983
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0208u23a - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), 1 x NVIDIA NVIDIA A100-SXM4-40GB, 520.61.05, Python 3.6.8
See https://gist.github.com/branfosj/3ee7f76c9445090f03195119d97c32de for a full test report.

@akesandgren
Copy link
Contributor

akesandgren commented Aug 25, 2023

Test report by @akesandgren
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2983
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
b-cn1501.hpc2n.umu.se - Linux Ubuntu 20.04, x86_64, Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz, 2 x NVIDIA Tesla V100-PCIE-16GB, 470.199.02, Python 3.8.10
See https://gist.github.com/akesandgren/b1195a85a8a38ca7ace5ecc26adfd1b4 for a full test report.

Ignore, error cause by manual actions during build...

@Flamefire
Copy link
Contributor Author

@akesandgren Looks like an issue with your (not) existing modules?

== 2023-08-25 11:46:36,693 modules.py:647 INFO Module magma/2.6.2-CUDA-11.7.0 not found via module avail/show, checking whether it is a wrapper
== 2023-08-25 11:46:36,697 modules.py:654 INFO Result for existence check of magma/2.6.2-CUDA-11.7.0 module: False
== 2023-08-25 11:46:37,083 build_log.py:171 ERROR EasyBuild crashed with an error (at easybuild/base/exceptions.py:126 in __init__): Can't get value from a non-existing module magma/2.6.2-CUDA-11.7.0 (at easybuild/tools/modules.py:743 in get_value_from_modulefile)
== 2023-08-25 11:46:37,084 easyblock.py:3519 WARNING Sanity check: loading fake module failed: "Can't get value from a non-existing module magma/2.6.2-CUDA-11.7.0"

@boegel Another instance of "EasyBuild crashed with an error" which is later "downgraded" to a warning due to EasyBuildError logging itself on construction

@akesandgren
Copy link
Contributor

Problem there was me changing things under the feet of the ongoing build...

@akesandgren
Copy link
Contributor

Test report by @akesandgren
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2983
SUCCESS
Build succeeded for 3 out of 3 (1 easyconfigs in total)
b-cn1501.hpc2n.umu.se - Linux Ubuntu 20.04, x86_64, Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz, 2 x NVIDIA Tesla V100-PCIE-16GB, 470.199.02, Python 3.8.10
See https://gist.github.com/akesandgren/c92d5a86298af5052f89a6349a0a5db7 for a full test report.

@boegel
Copy link
Member

boegel commented Aug 25, 2023

Test report by @boegel
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
node3308.joltik.os - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 530.30.02, Python 3.6.8
See https://gist.github.com/boegel/7d17e0133f68237659a5dfdf9e3826da for a full test report.

Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegel
Copy link
Member

boegel commented Aug 29, 2023

Going in, thanks @Flamefire!

@boegel boegel merged commit 495f5bc into easybuilders:develop Aug 29, 2023
@Flamefire Flamefire deleted the 20230808143638_new_pr_PyTorch1121 branch August 29, 2023 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants