Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{toolchain} intel/2020b #11337

Merged
merged 9 commits into from
Nov 9, 2020

Conversation

boegel
Copy link
Member

@boegel boegel commented Sep 23, 2020

(created using eb --new-pr)

This is a candidate for intel/2020b...

@boegel boegel added the update label Sep 23, 2020
@boegel boegel added this to the next release (4.3.1) milestone Sep 23, 2020
@boegel
Copy link
Member Author

boegel commented Sep 23, 2020

@boegelbot please test @ generoso

@easybuilders easybuilders deleted a comment from boegelbot Sep 23, 2020
@easybuilders easybuilders deleted a comment from boegelbot Sep 23, 2020
@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on generoso

PR test command 'EB_PR=11337 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_11337 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 7877

Test results coming soon (I hope)...

- notification for comment with ID 697345114 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

boegelbot commented Sep 23, 2020

Test report by @boegelbot
FAILED
Build succeeded for 1 out of 6 (6 easyconfigs in this PR)
generoso-x-5 - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/95e4fe185f22b9baf64958f52eaf4ae9 for a full test report.

edit (by @boegel): error setting up the boostrap proxies was caused by not having srun available via $PATH, see also #11425 (comment)

@boegel
Copy link
Member Author

boegel commented Sep 23, 2020

Test report by @boegel
SUCCESS
Build succeeded for 6 out of 6 (6 easyconfigs in this PR)
node2633.swalot.os - Linux centos linux 7.8.2003, x86_64, Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz (haswell), Python 2.7.5
See https://gist.github.com/d2bf8fcb45f6233ad636b1c6d017f099 for a full test report.

@lexming
Copy link
Contributor

lexming commented Sep 24, 2020

Test report by @lexming
SUCCESS
Build succeeded for 6 out of 6 (6 easyconfigs in this PR)
node128.hydra.os - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, Python 2.7.5
See https://gist.github.com/2b0051c80f8b9bfd47ce6e4f2ca315e4 for a full test report.

@lexming
Copy link
Contributor

lexming commented Sep 24, 2020

Test report by @lexming
SUCCESS
Build succeeded for 6 out of 6 (6 easyconfigs in this PR)
node376.hydra.os - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, Python 2.7.5
See https://gist.github.com/361d250b9c5df62ceb1b04cbc0f2f5f6 for a full test report.

@boegel boegel added the 2020b issues & PRs related to 2020b label Sep 24, 2020
…und impi bug, no longer relevant for impi 2019 update 8

Co-authored-by: Alex Domingo <alex.domingo.toro@vub.be>
@boegel
Copy link
Member Author

boegel commented Oct 5, 2020

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on generoso

PR test command 'EB_PR=11337 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_11337 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 8031

Test results coming soon (I hope)...

- notification for comment with ID 703803892 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

boegelbot commented Oct 5, 2020

Test report by @boegelbot
FAILED
Build succeeded for 1 out of 6 (6 easyconfigs in this PR)
generoso-x-3 - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/ad9264af3c8e83ca7e353947fb5b5e09 for a full test report.

edit (by @boegel):

[1601922345.426920] [generoso-x-3:3774850:0]         select.c:444  UCX  ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable
Abort(1091215) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(136)........: 
MPID_Init(1138)..............: 
MPIDI_OFI_mpi_init_hook(1541): OFI get address vector map failed

@jhein32
Copy link
Collaborator

jhein32 commented Oct 9, 2020

Hmm, doing a dry-run, it wants two versions of bison

-bash-4.2$ eb HPL-2.3-intel-2020.09.eb --robot --dry-run --use-existing-modules --from-pr=11337
== temporary log file in case of crash /tmp/eb-7Oh6BJ/easybuild-QdalMh.log
== found valid index for /sw/easybuild/software/EasyBuild/4.3.0/easybuild/easyconfigs, so using it...
Dry run: printing build status of easyconfigs and dependencies
 * [x] /sw/easybuild/software/EasyBuild/4.3.0/easybuild/easyconfigs/m/M4/M4-1.4.18.eb (module: Core | M4/1.4.18)
 * [ ] /sw/easybuild/software/EasyBuild/4.3.0/easybuild/easyconfigs/b/Bison/Bison-3.7.1.eb (module: Core | Bison/3.7.1)
 * [x] /sw/easybuild/software/EasyBuild/4.3.0/easybuild/easyconfigs/b/Bison/Bison-3.3.2.eb (module: Core | Bison/3.3.2)
 * [x] /sw/easybuild/software/EasyBuild/4.3.0/easybuild/easyconfigs/z/zlib/zlib-1.2.11.eb (module: Core | zlib/1.2.11)
 * [x] /sw/easybuild/software/EasyBuild/4.3.0/easybuild/easyconfigs/h/help2man/help2man-1.47.4.eb (module: Core | help2man/1.47.4)
 * [x] /sw/easybuild/software/EasyBuild/4.3.0/easybuild/easyconfigs/f/flex/flex-2.6.4.eb (module: Core | flex/2.6.4)
 * [ ] /sw/easybuild/software/EasyBuild/4.3.0/easybuild/easyconfigs/b/binutils/binutils-2.35.eb (module: Core | binutils/2.35)
 * [ ] /sw/easybuild/software/EasyBuild/4.3.0/easybuild/easyconfigs/g/GCCcore/GCCcore-10.2.0.eb (module: Core | GCCcore/10.2.0)
 * [ ] /sw/easybuild/software/EasyBuild/4.3.0/easybuild/easyconfigs/z/zlib/zlib-1.2.11-GCCcore-10.2.0.eb (module: Compiler/GCCcore/10.2.0 | zlib/1.2.11)
 * [ ] /sw/easybuild/software/EasyBuild/4.3.0/easybuild/easyconfigs/h/help2man/help2man-1.47.16-GCCcore-10.2.0.eb (module: Compiler/GCCcore/10.2.0 | help2man/1.47.16)
 * [ ] /sw/easybuild/software/EasyBuild/4.3.0/easybuild/easyconfigs/m/M4/M4-1.4.18-GCCcore-10.2.0.eb (module: Compiler/GCCcore/10.2.0 | M4/1.4.18)
 * [ ] /sw/easybuild/software/EasyBuild/4.3.0/easybuild/easyconfigs/b/Bison/Bison-3.7.1-GCCcore-10.2.0.eb (module: Compiler/GCCcore/10.2.0 | Bison/3.7.1)
 * [ ] /sw/easybuild/software/EasyBuild/4.3.0/easybuild/easyconfigs/f/flex/flex-2.6.4-GCCcore-10.2.0.eb (module: Compiler/GCCcore/10.2.0 | flex/2.6.4)
 * [ ] /sw/easybuild/software/EasyBuild/4.3.0/easybuild/easyconfigs/b/binutils/binutils-2.35-GCCcore-10.2.0.eb (module: Compiler/GCCcore/10.2.0 | binutils/2.35)
 * [ ] /tmp/eb-7Oh6BJ/files_pr11337/i/iccifort/iccifort-2020.3.275.eb (module: Core | iccifort/2020.3.275)
 * [ ] /tmp/eb-7Oh6BJ/files_pr11337/p/pkg-config/pkg-config-0.29.2-GCCcore-10.2.0.eb (module: Compiler/GCCcore/10.2.0 | pkg-config/0.29.2)
 * [ ] /tmp/eb-7Oh6BJ/files_pr11337/l/libtool/libtool-2.4.6-GCCcore-10.2.0.eb (module: Compiler/GCCcore/10.2.0 | libtool/2.4.6)
 * [ ] /sw/easybuild/software/EasyBuild/4.3.0/easybuild/easyconfigs/e/expat/expat-2.2.9-GCCcore-10.2.0.eb (module: Compiler/GCCcore/10.2.0 | expat/2.2.9)
 * [ ] /sw/easybuild/software/EasyBuild/4.3.0/easybuild/easyconfigs/n/ncurses/ncurses-6.2-GCCcore-10.2.0.eb (module: Compiler/GCCcore/10.2.0 | ncurses/6.2)
 * [ ] /sw/easybuild/software/EasyBuild/4.3.0/easybuild/easyconfigs/l/libreadline/libreadline-8.0-GCCcore-10.2.0.eb (module: Compiler/GCCcore/10.2.0 | libreadline/8.0)
 * [ ] /sw/easybuild/software/EasyBuild/4.3.0/easybuild/easyconfigs/p/Perl/Perl-5.32.0-GCCcore-10.2.0.eb (module: Compiler/GCCcore/10.2.0 | Perl/5.32.0)
 * [ ] /tmp/eb-7Oh6BJ/files_pr11337/a/Autoconf/Autoconf-2.69-GCCcore-10.2.0.eb (module: Compiler/GCCcore/10.2.0 | Autoconf/2.69)
 * [ ] /tmp/eb-7Oh6BJ/files_pr11337/a/Automake/Automake-1.16.2-GCCcore-10.2.0.eb (module: Compiler/GCCcore/10.2.0 | Automake/1.16.2)
 * [ ] /tmp/eb-7Oh6BJ/files_pr11337/a/Autotools/Autotools-20200321-GCCcore-10.2.0.eb (module: Compiler/GCCcore/10.2.0 | Autotools/20200321)
 * [ ] /tmp/eb-7Oh6BJ/files_pr11337/n/numactl/numactl-2.0.13-GCCcore-10.2.0.eb (module: Compiler/GCCcore/10.2.0 | numactl/2.0.13)
 * [ ] /tmp/eb-7Oh6BJ/files_pr11337/u/UCX/UCX-1.9.0-GCCcore-10.2.0.eb (module: Compiler/GCCcore/10.2.0 | UCX/1.9.0)
 * [ ] /tmp/eb-7Oh6BJ/files_pr11337/i/impi/impi-2019.8.254-iccifort-2020.3.275.eb (module: Compiler/intel/2020.3.275 | impi/2019.8.254)
 * [ ] /tmp/eb-7Oh6BJ/files_pr11337/i/iimpi/iimpi-2020.09.eb (module: Core | iimpi/2020.09)
 * [ ] /tmp/eb-7Oh6BJ/files_pr11337/i/imkl/imkl-2020.3.279-iimpi-2020.09.eb (module: MPI/intel/2020.3.275/impi/2019.8.254 | imkl/2020.3.279)
 * [ ] /tmp/eb-7Oh6BJ/files_pr11337/i/intel/intel-2020.09.eb (module: Core | intel/2020.09)
 * [ ] /tmp/eb-7Oh6BJ/files_pr11337/h/HPL/HPL-2.3-intel-2020.09.eb (module: MPI/intel/2020.3.275/impi/2019.8.254 | HPL/2.3)

@boegel
Copy link
Member Author

boegel commented Oct 9, 2020

Hmm, doing a dry-run, it wants two versions of bison

@jhein32 That's because of the bootstrapping mechanism that is done for binutils+GCCcore. We can consider bumping the Bison version that is used as an indirect build dep for the initial binutils, but that'll only fix the issue for recent toolchains.

@jhein32
Copy link
Collaborator

jhein32 commented Oct 12, 2020

I reported in #10899 that this set-up allows multi node running without any issues regarding UCX. Performance of the hpl is very good.

@zao
Copy link
Contributor

zao commented Oct 16, 2020

Test report by @zao
SUCCESS
Build succeeded for 6 out of 6 (6 easyconfigs in total)
freja - Linux Ubuntu 20.04, x86_64, Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz (skylake), Python 3.8.5
See https://gist.github.com/ed1de9ab3adf96a59f6ee68ba47690f6 for a full test report.

@jhein32
Copy link
Collaborator

jhein32 commented Oct 26, 2020

One of our users reported a floating point exception in MKL with v2020 u1. The issue is still present in v2020 u3. She was advised by intel that this is fixed in v2020 u4. We did a test install of that version and v2020 u4 resolved her issues.

I am preparing a PR to move MKL to 2020 u4

jhein32 and others added 2 commits October 26, 2020 16:25
MKL component released in Oct 2020
move imkl to v2020.4.304 + bump version to intel/2020.10
@boegel boegel changed the title {toolchain} intel/2020.09 {toolchain} intel/2020.10 (candidate for intel/2020b) [WIP] Oct 26, 2020
@boegel
Copy link
Member Author

boegel commented Oct 27, 2020

Test report by @boegel
SUCCESS
Build succeeded for 6 out of 6 (6 easyconfigs in total)
node3163.skitty.os - Linux centos linux 7.8.2003, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, Python 3.6.8
See https://gist.github.com/e9a89aec64220f4c4fe4d241e2df1b1e for a full test report.

@boegel
Copy link
Member Author

boegel commented Oct 27, 2020

The problem that is causing the failing test on generoso can be fixed by defining $UCX_TLS, as follows:

export UCX_TLS=rc,ud,sm,self

See also the discussion in #10899 .

@bartoldeman Any idea if it's safe to always set this, or should I implement a hook on generoso to only inject that variable for specific UCX versions?

@akesandgren
Copy link
Contributor

That's something that will be highly site specific, so please don't always set it.

@bartoldeman
Copy link
Contributor

It's safe but not good for performance

@jhein32
Copy link
Collaborator

jhein32 commented Oct 27, 2020

When does it fail? We had issues with multi node running of MPI codes. Everything build fine.

I think UCX_TLS should be set in the Intel mpi module, since e.g. OpenMPI seems to handle this fine, when ÙCX-TLS`is unset. So setting this in the UCX modules for all MPI libs seems not the done thing to me.

@akesandgren
Copy link
Contributor

It should not be set at all by upstream, as I said before, this is a site config and must remain so and, if at all, it should be handled in the site hooks.

@boegel
Copy link
Member Author

boegel commented Oct 28, 2020

@akesandgren Don't worry, I wasn't going to set it for everyone, just implement a hook on generoso to inject it into the impi module there...

generoso is a VM cluster that is only used for testing EasyBuild, MPI performance doesn't matter much there.

@jhein32 The problem occurs during the sanity check for impi, which is an mpirun of a trivial MPI hello world C program (the test/test.c that is included in the impi installation).

@akesandgren
Copy link
Contributor

@boegel I know, it was more a comment to @jhein32

@jhein32
Copy link
Collaborator

jhein32 commented Oct 28, 2020

@boegel I know, it was more a comment to @jhein32

I understood, that disabling dc in general is a bad idea. We are singing from the same hymnsheet.

But for some of these things there are hints in easyconfigs what one might want do if things don't work. I suggested to put this into impi instead of the ucx module. But I realise that others here think differently.

@jhein32
Copy link
Collaborator

jhein32 commented Oct 28, 2020

@jhein32 The problem occurs during the sanity check for impi, which is an mpirun of a trivial MPI hello world C program (the test/test.c that is included in the impi installation).

@boegel Hmm, I never had that fail when we had issues with intel/2020a. I assume that test was in there already. On our system this tests inside the build node and will not use our stone-age IB. So if the non-existence of dc makes it trip, I am wondering whether it is going out of the (virtual) node. Does your "virtual cluster" have something resembling multiple nodes?

@lexming
Copy link
Contributor

lexming commented Oct 29, 2020

Test report by @lexming
SUCCESS
Build succeeded for 8 out of 8 (6 easyconfigs in total)
node101.hydra.os - Linux centos linux 7.8.2003, x86_64, Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, Python 2.7.5
See https://gist.github.com/f8468568e84b6b65cedd55a819ce6113 for a full test report.

@boegel boegel changed the title {toolchain} intel/2020.10 (candidate for intel/2020b) [WIP] {toolchain} intel/2020b Nov 6, 2020
@boegel
Copy link
Member Author

boegel commented Nov 6, 2020

I tested OpenFOAM 8 on top of intel/2020b after updating impi to 2019 update 9, works fine.

CP2K 7.1 test installation is still under way...

@boegel
Copy link
Member Author

boegel commented Nov 7, 2020

CP2K/7.1-intel-2020b worked fine on Intel Cascade Lake (CentOS 7), with good results for the regression test (correct: 3253 / 3270; new: 8; wrong: 8; failed: 1)

@boegel
Copy link
Member Author

boegel commented Nov 8, 2020

Also got WRF 3.9.1.1 to install with intel/2020b.

@boegel
Copy link
Member Author

boegel commented Nov 8, 2020

Test report by @boegel
SUCCESS
Build succeeded for 6 out of 6 (6 easyconfigs in total)
node3404.kirlia.os - Linux centos linux 7.8.2003, x86_64, Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz (cascadelake), Python 2.7.5
See https://gist.github.com/2d2567a33db9ad87f83f2b28970e0cde for a full test report.

@boegel
Copy link
Member Author

boegel commented Nov 8, 2020

Test report by @boegel
SUCCESS
Build succeeded for 6 out of 6 (6 easyconfigs in total)
node3108.skitty.os - Linux centos linux 7.8.2003, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, Python 3.6.8
See https://gist.github.com/4a9101ca528d0069c3d70c50a0ec5a58 for a full test report.

@Micket

This comment has been minimized.

@Micket
Copy link
Contributor

Micket commented Nov 8, 2020

Test report by @Micket
SUCCESS
Build succeeded for 6 out of 6 (6 easyconfigs in total)
vera-c1 - Linux centos linux 7.8.2003, x86_64, Intel Xeon Processor (Skylake), Python 2.7.5
See https://gist.github.com/c93a2ef57c8e521481e966529e56bb31 for a full test report.

Copy link
Contributor

@Micket Micket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Micket
Copy link
Contributor

Micket commented Nov 9, 2020

Going in, thanks @boegel!

@Micket Micket merged commit 3ee8e5a into easybuilders:develop Nov 9, 2020
@boegel boegel deleted the 20200923141429_new_pr_HPL23 branch November 9, 2020 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2020b issues & PRs related to 2020b update
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants