Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{bio,devel}[fosscuda/2019b] PyTorch v1.7.1, typing-extensions v3.7.4.3 w/ Python 3.7.4 #11636

Merged

Conversation

Flamefire
Copy link
Contributor

(created using eb --new-pr)

@Flamefire Flamefire marked this pull request as draft November 9, 2020 08:16
@Micket Micket added the update label Nov 12, 2020
@Micket Micket added this to the 4.3.2 milestone Nov 12, 2020
@terjekv
Copy link
Collaborator

terjekv commented Nov 28, 2020

On RHEL8:

/usr/bin/cmake3: /opt/uio/modules/rhel8/easybuild/software/XZ/5.2.4-GCCcore-8.3.0/lib/liblzma.so.5: version `XZ_5.2' not found (required by /lib64/libarchive.so.13)

Testing if this is fixed by creating and adding libarchive/3.4.2-GCCcore-8.3.0.eb as a build dependency... Yup, that did it. However...

eb --update-pr 11636 --pr-commit-msg "Adds a libarchive easyconfig to fix missing symbols on RHEL8" \
/mn/sarpanitu/drift-u1/terjekv/projects/easybuild-develop/easybuild-easyconfigs/easybuild/easyconfigs/l/libarchive/libarchive-3.4.2-GCCcore-8.3.0.eb \
/tmp/eb-mb3fcdzq/files_pr11636/p/PyTorch/PyTorch-1.7.0-fosscuda-2019b-Python-3.7.4.eb

gave me...

== pushing branch '20201109091319_new_pr_PyTorch170' to remote 'github_Flamefire_KbNss' (git@github.com:Flamefire/easybuild-easyconfigs.git)
ERROR: Failed to push branch '20201109091319_new_pr_PyTorch170' to GitHub (git@github.com:Flamefire/easybuild-easyconfigs.git): Cmd('git') failed due to: exit code(128)
  cmdline: git push --porcelain github_Flamefire_KbNss 20201109091319_new_pr_PyTorch170
  stderr: 'fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.'

Which kinda makes sense?

@terjekv
Copy link
Collaborator

terjekv commented Nov 28, 2020

Gah, the quantization tests fail again?

======================================================================
FAIL: test_lstm (quantization.test_backward_compatibility.TestSerialization)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/eb-_o4gx9g0/tmpdlupb35n/lib/python3.7/site-packages/torch/testing/_internal/common_quantized.py", line 122, in test_fn
    qfunction(*args, **kwargs)
  File "/tmp/easybuild/build/PyTorch/1.7.0/fosscuda-2019b-Python-3.7.4/pytorch-1.7.0/test/quantization/test_backward_compatibility.py", line 230, in test_lstm
    self._test_op(mod, input_size=[4, 4, 3], input_quantized=False, generate=False, new_zipfile_serialization=True)
  File "/tmp/easybuild/build/PyTorch/1.7.0/fosscuda-2019b-Python-3.7.4/pytorch-1.7.0/test/quantization/test_backward_compatibility.py", line 76, in _test_op
    self.assertEqual(qmodule(input_tensor), expected, atol=prec)
  File "/tmp/eb-_o4gx9g0/tmpdlupb35n/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 1162, in assertEqual
    exact_dtype=exact_dtype, exact_device=exact_device)
  File "/tmp/eb-_o4gx9g0/tmpdlupb35n/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 1136, in assertEqual
    self.assertTrue(result, msg=msg)
AssertionError: False is not true : Tensors failed to compare as equal! With rtol=1.3e-06 and atol=1e-05, found 13 element(s) (out of 112) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.9640435565029293 (4.41188467448228e-06 vs. 0.9640479683876038), which occurred at index (3, 0, 6).

----------------------------------------------------------------------
Ran 373 tests in 383.065s

@Flamefire
Copy link
Contributor Author

@terjekv /usr/bin/cmake3 --> Why is it using the system cmake and an easybuild module? Mixing that won't work for sure, so I think something in your setup is off.

…ing-extensions-3.7.4.3-fosscuda-2019b-Python-3.7.4.eb and patches: PyTorch-1.7.0_fix_altivec_defines.patch, PyTorch-1.7.0_fix_test_DistributedDataParallel.patch
@Flamefire Flamefire force-pushed the 20201109091319_new_pr_PyTorch170 branch from 939b2a1 to b7039c3 Compare November 30, 2020 08:34
@Flamefire Flamefire marked this pull request as ready for review December 1, 2020 08:15
@easybuilders easybuilders deleted a comment from boegelbot Dec 8, 2020
@easybuilders easybuilders deleted a comment from boegelbot Dec 8, 2020
@easybuilders easybuilders deleted a comment from boegelbot Dec 8, 2020
@boegel boegel modified the milestones: 4.3.2 (next release), 4.4.0 Dec 8, 2020
@Flamefire Flamefire changed the title {bio,devel}[fosscuda/2019b] PyTorch v1.7.0, typing-extensions v3.7.4.3 w/ Python 3.7.4 {bio,devel}[fosscuda/2019b] PyTorch v1.7.1, typing-extensions v3.7.4.3 w/ Python 3.7.4 Dec 10, 2020
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
taurusa16 - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz, Python 2.7.5
See https://gist.github.com/397ffae567aac41864260a3f16ff0cee for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusml1 - Linux RHEL 7.6, POWER, 8335-GTX, Python 2.7.5
See https://gist.github.com/b6617b5c92624064535a61f41ddbf533 for a full test report.

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
bear-pg0305u03a.bear.cluster - Linux RHEL 7.6, POWER, 8335-GTX (power9le), Python 3.6.8
See https://gist.github.com/1acce27d4aba8e936176bc62fcf65138 for a full test report.

@branfosj
Copy link
Member

branfosj commented Dec 11, 2020

Test report by @branfosj
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2259
FAILED
Build succeeded for 0 out of 1 (2 easyconfigs in total)
bear-pg0212u15b.bear.cluster - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz (broadwell), Python 3.6.8
See https://gist.github.com/9bc5042e632f3efe01a1a08217065761 for a full test report.

Error was

RuntimeError: test_nn failed! Received signal: SIGKILL

I think using /dev/shm for the build path is causing issues.

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
bear-pg0212u17a.bear.cluster - Linux centos linux 7.8.2003, x86_64, Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz (broadwell), Python 3.6.8
See https://gist.github.com/c4f2f9b5fd41a9acc156b9a796ef00c5 for a full test report.

@lexming
Copy link
Contributor

lexming commented Dec 22, 2020

@Flamefire I think that PyTorch v1.7 fits better in fosscuda/2020a given that one of its main features is support for CUDA 11. All dependencies seem to be already available. Would it be too much work to move this to 2020a or do you still prefer to have it in 2019b?

@Flamefire
Copy link
Contributor Author

CUDA 11 is an issue on some sites, e.g. we need some driver update first and IIRC there were some GPUs dropped by CUDA 11 which might be relevant for others. So I'm inclined to keep that in 2019b but add another one to 2020b (I guess as that is the latest), or what would you say?

@boegel
Copy link
Member

boegel commented Dec 29, 2020

Test report by @boegel
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
node3300.joltik.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), Python 3.6.8
See https://gist.github.com/a0d8ae58a7fd7770041859a3d505858e for a full test report.

@boegel
Copy link
Member

boegel commented Dec 29, 2020

Hmm, still a failing test?

======================================================================
FAIL: test_lstm (quantization.test_backward_compatibility.TestSerialization)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/eb-7cdl9ze7/tmphh33pz6r/lib/python3.7/site-packages/torch/testing/_internal/common_quantized.py", line 122, in test_fn
    qfunction(*args, **kwargs)
  File "/tmp/vsc40023/easybuild_build/PyTorch/1.7.1/fosscuda-2019b-Python-3.7.4/pytorch-1.7.1/test/quantization/test_backward_compatibility.py", line 230, in test_lstm
    self._test_op(mod, input_size=[4, 4, 3], input_quantized=False, generate=False, new_zipfile_serialization=True)
  File "/tmp/vsc40023/easybuild_build/PyTorch/1.7.1/fosscuda-2019b-Python-3.7.4/pytorch-1.7.1/test/quantization/test_backward_compatibility.py", line 76, in _test_op
    self.assertEqual(qmodule(input_tensor), expected, atol=prec)
  File "/tmp/eb-7cdl9ze7/tmphh33pz6r/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 1162, in assertEqual
    exact_dtype=exact_dtype, exact_device=exact_device)
  File "/tmp/eb-7cdl9ze7/tmphh33pz6r/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py", line 1136, in assertEqual
    self.assertTrue(result, msg=msg)
AssertionError: False is not true : Tensors failed to compare as equal! With rtol=1.3e-06 and atol=1e-05, found 13 element(s) (out of 112) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.9640435565029293 (4.41188467448228e-06 vs. 0.9640479683876038), which occurred at index (3, 0, 6).

----------------------------------------------------------------------
Ran 373 tests in 348.725s

FAILED (failures=1, skipped=12)
Traceback (most recent call last):
  File "run_test.py", line 745, in <module>
    main()
  File "run_test.py", line 728, in main
    raise RuntimeError(err_message)
RuntimeError: test_quantization failed!

@Flamefire
Copy link
Contributor Author

This again? See pytorch/pytorch#43209
I disabled it again as in 1.6 but don't feel comfortable with it as there might be a real issue on your system (Neither I nor the PyTorch guys could reproduce this)

@boegel
Copy link
Member

boegel commented Jan 8, 2021

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on generoso

PR test command 'EB_PR=11636 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_11636 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 12395

Test results coming soon (I hope)...

- notification for comment with ID 756771900 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegel
Copy link
Member

boegel commented Jan 8, 2021

Test report by @boegel
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
node3300.joltik.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), Python 3.6.8
See https://gist.github.com/c6bbd7cfca5dd1c5a6efe21f96b10a6a for a full test report.

@boegel
Copy link
Member

boegel commented Jan 8, 2021

Test report by @boegel
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
node3525.doduo.os - Linux RHEL 8.2, x86_64, AMD EPYC 7552 48-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/451272463cc1cd99a531b2b22eb9b66c for a full test report.

@boegel
Copy link
Member

boegel commented Jan 9, 2021

Test report by @boegel
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
easybuild2.novalocal - Linux centos linux 8.3.2011, POWER, IBM pSeries (emulated by qemu) (power9le), Python 3.6.8
See https://gist.github.com/463f87ed70a66057e6c591214e500d79 for a full test report.

@boegel
Copy link
Member

boegel commented Jan 9, 2021

@Flamefire Installation failed on POWER9, needs the same patch as in #11936?

/usr/include/bits/floatn.h(79): error: identifier "__ieee128" is undefined
/usr/include/bits/floatn.h(82): error: invalid argument to attribute "__mode__"

@boegel
Copy link
Member

boegel commented Jan 10, 2021

@Flamefire Installation failed on POWER9, needs the same patch as in #11936?

/usr/include/bits/floatn.h(79): error: identifier "__ieee128" is undefined
/usr/include/bits/floatn.h(82): error: invalid argument to attribute "__mode__"

fixed in Flamefire#1

add patch for PyTorch 1.7.1 to fix undefined __ieee128 identifier on CentOS 8
@Flamefire
Copy link
Contributor Author

Uff, that again... Thanks!

@boegel
Copy link
Member

boegel commented Jan 11, 2021

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on generoso

PR test command 'EB_PR=11636 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_11636 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 12911

Test results coming soon (I hope)...

- notification for comment with ID 757684698 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

boegelbot commented Jan 11, 2021

Test report by @boegelbot
FAILED
Build succeeded for 1 out of 3 (2 easyconfigs in total)
generoso-x-3 - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/b4d1477562b0611788614c8715ffe608 for a full test report.

edit: ignore, failed due to lack of disk space...

@boegel
Copy link
Member

boegel commented Jan 11, 2021

@boegelbot please test @ generoso
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on generoso

PR test command 'EB_PR=11636 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_11636 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 13048

Test results coming soon (I hope)...

- notification for comment with ID 757998242 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegel
Copy link
Member

boegel commented Jan 11, 2021

Test report by @boegel
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
easybuild2.novalocal - Linux centos linux 8.3.2011, POWER, IBM pSeries (emulated by qemu) (power9le), Python 3.6.8
See https://gist.github.com/c0616c60dacba79a088d67c1a6d4e048 for a full test report.

@boegel
Copy link
Member

boegel commented Jan 11, 2021

Test report by @boegel
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
node3115.skitty.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, Python 3.6.8
See https://gist.github.com/4f509d703da103b0ed230745b19901b4 for a full test report.

@boegel
Copy link
Member

boegel commented Jan 11, 2021

Test report by @boegel
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
node3308.joltik.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), Python 3.6.8
See https://gist.github.com/7636cd59349427097ccf0a53cb18f745 for a full test report.

@boegel boegel added the new label Jan 11, 2021
Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegel
Copy link
Member

boegel commented Jan 11, 2021

Going in, thanks @Flamefire!

@boegel boegel merged commit 633e363 into easybuilders:develop Jan 11, 2021
@Flamefire Flamefire deleted the 20201109091319_new_pr_PyTorch170 branch January 12, 2021 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants