Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{lib}[foss/2021b] TensorFlow v2.8.4 w/ CUDA-11.4.1 and fix patches + extensions in easyconfig for TensorFlow 2.8.4 w/ foss/2021b #17058

Merged

Conversation

Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Jan 6, 2023

This adds the new TF 2.8.4 EC with CUDA support and some changes for the non-CUDA version:

Once this is merged we can finish TF 2.9 in #16620 and #16008 as they suffer from the same issues as the ones fixed here.

@Flamefire Flamefire force-pushed the 20230106094450_new_pr_TensorFlow284 branch from 3351849 to 5acbdd2 Compare January 6, 2023 15:12
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2854
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusml5 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/760de78429e2c0d951a3901327fdd7a0 for a full test report.

@jfgrimm
Copy link
Member

jfgrimm commented Jan 11, 2023

Test report by @jfgrimm
FAILED
Build succeeded for 10 out of 11 (2 easyconfigs in total)
gpu02.pri.viking.alces.network - Linux CentOS Linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz (skylake_avx512), 1 x NVIDIA Tesla V100-SXM2-32GB, 510.47.03, Python 3.6.8
See https://gist.github.com/753931b5a275c7cec40e69a5687a90d1 for a full test report.

@Flamefire
Copy link
Contributor Author

Flamefire commented Jan 12, 2023

@jfgrimm The CUDA build requires the new easyblock from easybuilders/easybuild-easyblocks#2854

I made that more clear in the description but your test verifies that indeed (only) the CUDA build requires the update (but doesn't hurt the CPU variant)

@jfgrimm
Copy link
Member

jfgrimm commented Jan 12, 2023

yeah I read that, then forgot to add the --include-easyblocks-from-pr bit anyway before submitting 🤦

@jfgrimm
Copy link
Member

jfgrimm commented Jan 12, 2023

Test report by @jfgrimm
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2854
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
gpu01.pri.viking.alces.network - Linux CentOS Linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz (skylake_avx512), 2 x NVIDIA Tesla V100-SXM2-32GB, 510.47.03, Python 3.6.8
See https://gist.github.com/629db23d8a62faa3187616bd114d57c5 for a full test report.

@akesandgren
Copy link
Contributor

Test report by @akesandgren
FAILED
Build succeeded for 0 out of 2 (2 easyconfigs in total)
b-cn1501.hpc2n.umu.se - Linux Ubuntu 20.04, x86_64, Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz, 2 x NVIDIA Tesla V100-PCIE-16GB, 470.182.03, Python 3.8.10
See https://gist.github.com/akesandgren/7f5e9427ee0057f0eb74abe40d8887a6 for a full test report.

@easybuilders easybuilders deleted a comment from boegelbot Jun 20, 2023
@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0105u03a.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/branfosj/ca4813f4055b017956b4fd91b6a8df7f for a full test report.

@Flamefire
Copy link
Contributor Author

Flamefire commented Jun 29, 2023

After #17892 I'm not sure the replacement of the fix-cuda-build patch by the upstream patch is a good idea as it can cause regressions when using rpath wrappers or ccache, see #17892 (comment) for how fixing the rpath issue is possible from easybuild-framework (but not the ccache use case)

Given the good state of this PR (the only failing report is due to lack of space) I'd suggest to merge this anyway and then do a bulk-update of the TensorFlow ECs (re-)adding the fix-cuda-build patch (potentially additionally to the upstream patch) and doing a single test report with --stop=patch as that patch is proven to work already so we just need to ensure that it applies.

@branfosj
Copy link
Member

Going in, thanks @Flamefire!

@branfosj branfosj merged commit 823409d into easybuilders:develop Jun 29, 2023
@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0203u29a.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 520.61.05, Python 3.6.8
See https://gist.github.com/branfosj/9e04416d6a2fc59dd271a5a6f3928797 for a full test report.

@Flamefire
Copy link
Contributor Author

@branfosj Opened #18235 for the CUDA patch

@boegel boegel changed the title Add TensorFlow-2.8.4-foss-2021b-CUDA-11.4.1.eb and fix TensorFlow-2.8.4-foss-2021b.eb {lib}[foss/2021b] TensorFlow v2.8.4 w/ CUDA-11.4.1 and fix patches + extensions in easyconfig for TensorFlow 2.8.4 w/ foss/2021b Jul 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants