Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for running TensorFlow CPU and GPU tests separately and enhance test failure reporting #2312

Merged

Conversation

Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Jan 19, 2021

(created using eb --new-pr)

Based on

@boegel boegel changed the title Add support for running CPU and GPU tests separately and enhance test failure reporting Add support for running TensorFlow CPU and GPU tests separately and enhance test failure reporting Jan 20, 2021
@boegel boegel added this to the next release (4.3.3?) milestone Jan 20, 2021
@boegel boegel added the change label Jan 20, 2021
@Flamefire Flamefire force-pushed the 20210119163542_new_pr_UgyaqaPRcl branch from ed75407 to 2b599dd Compare January 25, 2021 11:31
@boegel
Copy link
Member

boegel commented Feb 2, 2021

@Flamefire #2293 merged, please rebase so we can get this in too before the next EasyBuild release, since it makes a couple of changes compared to #2263 (and #2292)

@Flamefire Flamefire force-pushed the 20210119163542_new_pr_UgyaqaPRcl branch from 2b599dd to 07cbbf9 Compare February 2, 2021 16:42
easybuild/easyblocks/t/tensorflow.py Outdated Show resolved Hide resolved
easybuild/easyblocks/t/tensorflow.py Outdated Show resolved Hide resolved
easybuild/easyblocks/t/tensorflow.py Outdated Show resolved Hide resolved
easybuild/easyblocks/t/tensorflow.py Outdated Show resolved Hide resolved

# TF can only run tests explicitely marked as 'gpu' on GPUs and those don't work on cpu
# See https://github.com/tensorflow/tensorflow/issues/45664
if device == 'cpu':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be gpu?

If not, please clarify the comment, since this doesn't make sense at first glance...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it's -1... 🤦

OK, please mention what -1 does in the comment, it's easy to overlook that as 1 which strongly suggests "enable"...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the comment to say "Disable GPUs..." (or so), is that enough?

easybuild/easyblocks/t/tensorflow.py Outdated Show resolved Hide resolved
easybuild/easyblocks/t/tensorflow.py Show resolved Hide resolved
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire

Overview of tested easyconfigs (in order)

  • SUCCESS TensorFlow-2.3.1-foss-2019b-Python-3.7.4.eb
  • SUCCESS TensorFlow-2.2.0-fosscuda-2019b-Python-3.7.4.eb
  • SUCCESS TensorFlow-2.0.0-fosscuda-2019b-Python-3.7.4.eb

Build succeeded for 3 out of 3 (3 easyconfigs in total)
taurusml28 - Linux RHEL 7.6, POWER, 8335-GTX, Python 2.7.5
See https://gist.github.com/5d5822c30ce87ca2c4482d33eccc2cbb for a full test report.

Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Flamefire Some minor tweaks done in Flamefire#5, please check/merge so we can (finally) get this show on the road :)

@boegel
Copy link
Member

boegel commented Feb 12, 2021

Test report by @boegel

Overview of tested easyconfigs (in order)

  • SUCCESS TensorFlow-1.15.2-foss-2019b-Python-3.7.4.eb
  • SUCCESS TensorFlow-2.1.0-fosscuda-2019b-Python-3.7.4.eb

Build succeeded for 2 out of 2 (2 easyconfigs in total)
node2715.swalot.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz (haswell), Python 3.6.8
See https://gist.github.com/8ffd1ebd27024437e642d3ab176e5b22 for a full test report.

@Flamefire
Copy link
Contributor Author

As discussed at easybuilders/easybuild-easyconfigs#11637 (comment) I added 2 more changes to a) allow running on non-GPU nodes and b) propagate cuda cache variables, see also easybuilders/easybuild-framework#3569

easybuild/easyblocks/t/tensorflow.py Outdated Show resolved Hide resolved
@boegel
Copy link
Member

boegel commented Feb 12, 2021

Test report by @boegel

Overview of tested easyconfigs (in order)

  • SUCCESS protobuf-3.10.0-GCCcore-8.3.0.eb
  • SUCCESS Zip-3.0-GCCcore-8.3.0.eb
  • SUCCESS git-2.23.0-GCCcore-8.3.0-nodocs.eb
  • SUCCESS Java-1.8.0_281.eb
  • SUCCESS Java-1.8.eb
  • SUCCESS Bazel-0.26.1-GCCcore-8.3.0.eb
  • SUCCESS pkgconfig-1.5.1-GCCcore-8.3.0-Python-3.7.4.eb
  • SUCCESS CUDA-10.1.243-GCC-8.3.0.eb
  • SUCCESS gcccuda-2019b.eb
  • SUCCESS cuDNN-7.6.4.38-gcccuda-2019b.eb
  • SUCCESS NCCL-2.4.8-gcccuda-2019b.eb
  • SUCCESS OpenMPI-3.1.4-gcccuda-2019b.eb
  • SUCCESS gompic-2019b.eb
  • SUCCESS HDF5-1.10.5-gompic-2019b.eb
  • SUCCESS FFTW-3.3.8-gompic-2019b.eb
  • SUCCESS ScaLAPACK-2.0.2-gompic-2019b.eb
  • SUCCESS fosscuda-2019b.eb
  • SUCCESS SciPy-bundle-2019.10-fosscuda-2019b-Python-3.7.4.eb
  • SUCCESS h5py-2.10.0-fosscuda-2019b-Python-3.7.4.eb
  • SUCCESS TensorFlow-1.15.5-fosscuda-2019b-Python-3.7.4.eb

Build succeeded for 20 out of 20 (1 easyconfigs in total)
node3406.kirlia.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz (cascadelake), Python 3.6.8
See https://gist.github.com/4c6aa56918521bafca9e736cf83b6989 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire

Overview of tested easyconfigs (in order)

  • SUCCESS TensorFlow-2.4.1-fosscuda-2020b.eb

Build succeeded for 1 out of 1 (1 easyconfigs in total)
taurusi8021 - Linux centos linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), Python 2.7.5
See https://gist.github.com/6f20da0dd6c66f23b78ca8a47aae2dae for a full test report.

@boegel
Copy link
Member

boegel commented Feb 12, 2021

@Flamefire This needs non-trivial merge conflict fixing after the merging of #2314...

@Flamefire Flamefire force-pushed the 20210119163542_new_pr_UgyaqaPRcl branch from 42aeda3 to 71582c9 Compare February 13, 2021 13:43
@Flamefire
Copy link
Contributor Author

Rebased. Merge conflict was minor (self.with_cuda usage)

@Flamefire
Copy link
Contributor Author

Also found 2 small issues when double-checking the code. I hope that's all, otherwise I'll check back on Monday morning

@boegel
Copy link
Member

boegel commented Feb 13, 2021

Test report by @boegel

Overview of tested easyconfigs (in order)

Build succeeded for 0 out of 1 (1 easyconfigs in total)
node3525.doduo.os - Linux RHEL 8.2, x86_64, AMD EPYC 7552 48-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/034d4d42639626a17523682113ed3034 for a full test report.

edit: failed due to All mirrors are down: [GET returned 404 Not Found, connect timed out 🙄

@boegel
Copy link
Member

boegel commented Feb 13, 2021

Test report by @boegel

Overview of tested easyconfigs (in order)

  • SUCCESS TensorFlow-1.15.5-fosscuda-2019b-Python-3.7.4.eb

Build succeeded for 1 out of 1 (1 easyconfigs in total)
node3525.doduo.os - Linux RHEL 8.2, x86_64, AMD EPYC 7552 48-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/5619d14f59809e58cf5ea1414712447d for a full test report.

@boegel
Copy link
Member

boegel commented Feb 13, 2021

This has been extensively tested, both with existing TensorFlow easyconfigs (which shouldn't be impacted at all by these changes), as well as with easybuilders/easybuild-easyconfigs#11637 where the TensorFlow tests are being run, which shows that this is working as intended.

So let's get this merged, thanks a lot for your efforts here @Flamefire!

@boegel boegel merged commit eec61d7 into easybuilders:develop Feb 13, 2021
@Flamefire Flamefire deleted the 20210119163542_new_pr_UgyaqaPRcl branch February 13, 2021 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants