Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{toolchain} gobff/2020.11 + gobff/2020.06-amd (toolchains with BLIS + libFLAME) #11761

Merged

Conversation

SebastianAchilles
Copy link
Member

@SebastianAchilles SebastianAchilles commented Nov 25, 2020

(created using eb --new-pr)
depends on PR easybuilders/easybuild-framework#3505

…20a-amd.eb, BLIS-0.8.0-GCC-9.3.0.eb, BLIS-2.2-GCC-9.3.0-amd.eb, FFTW-3.3.8-gompi-2020a-amd.eb, HPL-2.3-gobff-2020a-amd.eb, HPL-2.3-gobff-2020a.eb, libFLAME-2.2-GCC-9.3.0-amd.eb, libFLAME-5.2.0-GCC-9.3.0.eb, make-4.3-GCC-9.3.0.eb, ScaLAPACK-2.1.0-gompi-2020a-bf.eb, ScaLAPACK-2.2-gompi-2020a-amd.eb
@SebastianAchilles
Copy link
Member Author

SebastianAchilles commented Nov 25, 2020

This pull request adds two toolchains:

  • gobff/2020a
  • gobff/2020a-amd

Both toolchains use GNU Compiler with OpenMPI, BLIS, libFLAME, ScaLAPACK and FFTW.

gobff/2020a uses the vanilla libraries, while gobff/2020a-amd is using AMD's forks for the libriaries.

I tested the performance on JUSUF @ JSC. On JUSUF modified toolchains are used, however the HPL and dgemm performance relies almost entirely on the math libraries. So the following numbers can be seen as comparison between imkl/2020.2.254 and BLIS/0.8.0 / BLIS/2.2-amd. The performance of BLIS/0.8.0 will be similar to BLIS/2.2-amd as the mikrokernel for AMD Rome has been merged back to vanilla BLIS.

Performance on JUSUF @ JSC (theoretical PEAK: 4608 GFLOPS) with HPL using 1 node with 32 ranks and 4 cores each with ~20% of memory (51.2GB out of 256G) used (Pinning with OMP_PROC_BIND=TRUE and OMP_PLACES=cores)

gpsmkl/2020 (MKL)
with

  • GCC/9.3.0
  • psmpi/5.4.7-1
  • imkl/2020.2.254
T/V                N    NB     P     Q               Time                 Gflops 
-------------------------------------------------------------------------------- 
WR01L2R4       82824   232     4     8             144.70             2.6176e+03

gpsbff/2020 (BLIS)
with

  • GCC/9.3.0
  • psmpi/5.4.7-1
  • BLIS/2.2-amd
  • libFLAME/2.2-amd
  • ScaLAPACK/2.2-amd
  • FFTW/3.3.8-amd
T/V                N    NB     P     Q               Time                 Gflops 
-------------------------------------------------------------------------------- 
WR01L2R4       82824   232     4     8             130.08             2.9119e+03

Performance on JUSUF @ JSC (theoretical PEAK: 4608 GFLOPS) with DGEMM with m=n=k=10240 using 1 node with 128 cores each (Pinning with OMP_PROC_BIND=TRUE and OMP_PLACES=cores)

imkl/2020.2.254: 1784 GFLOPS
BLIS/2.2-amd:    3080 GFLOPS

@bartoldeman
Copy link
Contributor

I'm just wondering what the benefit is to including libFLAME into a toolchain, ie. why not simply use goblf plus an explicit dependency on libFLAME if you need it?

@SebastianAchilles
Copy link
Member Author

I'm just wondering what the benefit is to including libFLAME into a toolchain, ie. why not simply use goblf plus an explicit dependency on libFLAME if you need it?

@bartoldeman goblf is using Reference Netlib LAPACK. The idea here is to use libFLAME instead of LAPACK. The benefit is that it is faster.

@bartoldeman
Copy link
Contributor

By the way my testing on a dual AMD 7452 showed this for HPL some months ago:

T/V            	N	NB 	P 	Q           	Time             	Gflops
--------------------------------------------------------------------------------
WR11C2R4  	128000   384 	8 	8         	678.88          	2.059e+03 (7452 AMD Rome, MKL2020.1)
WR12R2R4  	177000   192 	8 	8        	1528.47          	2.419e+03 (7452,MKL2020.0,MKL_DEBUG_CPU_TYPE=5)
WR12R2R4  	168960   232 	4 	4        	1370.64         	2.3461e+03 (7452, AMD BLIS)
WR12R2R4  	177000   232 	4 	4        	1629.23         	2.2691e+03 (7452, OpenBLAS)

I agree that AMD BLIS is the way to go for optimal performance on AMD chips since MKL slightly edged it out here but depended on undocumented settings.
I did have to use threaded AMD BLIS though for optimal HPL scores with 4 threads on each chiplet.

@bartoldeman
Copy link
Contributor

I'm just wondering what the benefit is to including libFLAME into a toolchain, ie. why not simply use goblf plus an explicit dependency on libFLAME if you need it?

@bartoldeman goblf is using Reference Netlib LAPACK. The idea here is to use libFLAME instead of LAPACK. The benefit is that it is faster.

ah yes I missed that it replaces LAPACK if you use its included lapack2flame.

@SebastianAchilles
Copy link
Member Author

By the way my testing on a dual AMD 7452 showed this for HPL some months ago:

T/V            	N	NB 	P 	Q           	Time             	Gflops
--------------------------------------------------------------------------------
WR11C2R4  	128000   384 	8 	8         	678.88          	2.059e+03 (7452 AMD Rome, MKL2020.1)
WR12R2R4  	177000   192 	8 	8        	1528.47          	2.419e+03 (7452,MKL2020.0,MKL_DEBUG_CPU_TYPE=5)
WR12R2R4  	168960   232 	4 	4        	1370.64         	2.3461e+03 (7452, AMD BLIS)
WR12R2R4  	177000   232 	4 	4        	1629.23         	2.2691e+03 (7452, OpenBLAS)

I agree that AMD BLIS is the way to go for optimal performance on AMD chips since MKL slightly edged it out here but depended on undocumented settings.
I did have to use threaded AMD BLIS though for optimal HPL scores with 4 threads on each chiplet.

That is very interesting! Yes, I also used the threaded BLIS library.
Do you know which AMD BLIS version you have used?
MKL_DEBUG_CPU_TYPE was removed in a new version of MKL. My impression is that the performance of MKL on AMD CPUs is getting better which each new version. This is good news! But personally I would prefer to have a toolchain that uses BLIS.

@SebastianAchilles
Copy link
Member Author

ah yes I missed that it replaces LAPACK if you use its included lapack2flame.

As far as I understood this is way that is suggested by AMD. At least this is how I understood the AMD Optimized CPU Libraries User Guide:
https://developer.amd.com/wp-content/resources/AOCL_User Guide_2.2.pdf

@migueldiascosta
Copy link
Member

migueldiascosta commented Nov 25, 2020

@SebastianAchilles did you also compare FFTW (either vanilla or AMD's fork) with MKL DFT?

@SebastianAchilles
Copy link
Member Author

@migueldiascosta That is very interesting question. Do you have a specific benchmark in mind?

I used the 3d complex-to-complex benchmark from
https://github.com/project-gemmi/benchmarking-fft

This are the results on a dual AMD EPYC 7742 (JUSUF @ JSC):

                128x128x320  256x256x256  512x512x512
imkl/2020.2.254      41 ms      212 ms      1962 ms
FFTW/3.3.8           26 ms      136 ms      1303 ms
FFTW/3.3.8-amd       24 ms      103 ms       930 ms

@migueldiascosta
Copy link
Member

migueldiascosta commented Nov 27, 2020

@SebastianAchilles I think that in imkl/2020.2.254 setting MKL_DEBUG_CPU_TYPE doesn't force AVX2 anymore, are you forcing it with, e.g. https://danieldk.eu/Posts/2020-08-31-MKL-Zen.html?

regarding FFT benchmarks, a while back we used gearshifft (https://github.com/mpicbg-scicomp/gearshifft, #8783)

more importantly for us, the FFT related timings in (material science) application benchmarks seemed to show that MKL (when forcing AVX2 with MKL_DEBUG_CPU_TYPE, we have been sticking to imkl/2019.5.281) was consistently faster than FFTW also on AMD, but of course, your mileage may (will) vary

and to be clear, this is a bit orthogonal to the PR - having this toolchain in eb would be very useful in any case

@migueldiascosta
Copy link
Member

a reminder that with EB's HierarchicalMNS, above gompi there will be module name collisions between modules using different math libraries (but this already happens with all goxxx toolchains)

@migueldiascosta
Copy link
Member

Test report by @migueldiascosta
SUCCESS
Build succeeded for 12 out of 12 (12 easyconfigs in total)
sms - Linux centos linux 7.6.1810, x86_64, AMD EPYC 7601 32-Core Processor (zen), Python 2.7.5
See https://gist.github.com/df17aabcf20b56c77957d26c6b8228bd for a full test report.

@boegel boegel added the new label Nov 29, 2020
@boegel
Copy link
Member

boegel commented Nov 29, 2020

@boegelbot please test @ generoso

@boegel boegel added this to the 4.3.2 (next release) milestone Nov 29, 2020
@easybuilders easybuilders deleted a comment from boegelbot Nov 29, 2020
@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on generoso

PR test command 'EB_PR=11761 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_11761 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 11106

Test results coming soon (I hope)...

- notification for comment with ID 735396322 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 12 out of 12 (12 easyconfigs in total)
generoso-c1-s-1 - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/d522a9818c8640520c44349e638b0dcc for a full test report.

@boegel
Copy link
Member

boegel commented Nov 29, 2020

Test report by @boegel
FAILED
Build succeeded for 8 out of 22 (12 easyconfigs in total)
node3573.doduo.os - Linux RHEL 8.2, x86_64, AMD EPYC 7552 48-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/b0ae65ed183f5955401b5e8ed9730c77 for a full test report.

@boegel boegel modified the milestones: 4.3.2 (next release), 4.4.0 Dec 7, 2020
@SebastianAchilles
Copy link
Member Author

I don't understand the dependency conflict. It is not possible to distinguish toolchains with a versionsuffix?

@easybuilders easybuilders deleted a comment from boegelbot Dec 8, 2020
@boegel
Copy link
Member

boegel commented Dec 8, 2020

@SebastianAchilles The easyconfigs you're adding violate a policy we're trying to maintain where there's only a single version of a dependency in each "generation" of easyconfig files. We do this to minimize the amount of conflicts between easyconfigs from the same generation.
There are already exceptions to this rule though; one common one is Python (where we allow both a Python 2.x and 3.x version per generation of easyconfigs).

We should probably add exceptions for BLIS, FFTW, ScaLAPACK and libFLAME for the 2020a generation, since the check we have is a little bit too strict here...

Before we do that, we should agree on the naming scheme we'll use. The -amd ones obviously make sense, but I'm not sure about the -bf one for ScaLAPACK.

Can you post an overview of the toolchains you're adding here in a comment, and how they compare with standard foss, to make the discussion a bit easier?

@SebastianAchilles
Copy link
Member Author

SebastianAchilles commented Dec 8, 2020

@boegel Sure, I try to elaborate.
The general idea of gobff is to use GNU Compiler, OpenMPI, BLIS, libFLAME, FFTW and ScaScaLAPACK. I had the idea to offer two toolchain variants:

  • gobff/2020a
  • gobff/2020a-amd

The gobff/2020a toolchains uses the official vanilla libraries:

  • `BLIS/0.8.0'
  • libFLAME/5.2.0
  • ScaLAPACK/2.1.0-bf
  • FFTW/3.3.8

The ScaLAPACK/2.1.0-bf is using BLIS and libFLAME as dependency (therefore -bf as suffix, but I am not sure if that is a good suffix) for BLAS and LAPACK. The difference to the ScaLAPACK/2.1.0 in foss is that ScaLAPACK/2.1.0 in foss is using OpenBLAS as depency.

The gobff/2020a-amd toolchain on the other hand uses the AMD fork of the libraries:

  • BLIS/2.2-amd
  • libFLAME/2.2-amd
  • ScaLAPACK/2.2-amd
  • FFTW/3.3.8-amd

My initial idea was to offer both toolchains, so that people can choose which variant they prefer. However, I didn't measure performance difference in BLIS and libFLAME. So I think it would be okay, to just use the vanilla libraries. For FFTW I measured a difference in performance. But I guess it would also be possible to include the changes in a patch and then only use the patch on an AMD system. What would be the best way to write a conditional easyconfig?
I didn't had time to test AMD's fork of ScaLAPACK, but I assume it performances better on AMD systems. It might also be possible here to create a patch for the changes and create a conditional easyconfig.

What is your opinion? Do you want to add an exceptions? Or do we want to try making the optimization depending on the system where the toolchain is used, e.g. something like a conditional easyconfig?

Regarding the naming scheme: I am not convinced that the names or suffix I came up with are the best ones. An alternative idea I have is to rename gobff/2020a-amd into goamd/2020a. This would also avoid the dependency conflict.

@easybuilders easybuilders deleted a comment from boegelbot Dec 9, 2020
…s to avoid tests tripping over two BLIS/libFLAME variants
@boegel boegel changed the title {toolchain}[gobff/2020a] gobff v2020a, BLIS v0.8.0, BLIS v2.2, ... {toolchain} gobff/2020.11 + gobff/2020.06-amd (toolchains with BLIS + libFLAME) Dec 9, 2020
@easybuilders easybuilders deleted a comment from boegelbot Dec 9, 2020
…riable for amd-fftw version in source tarball)
@boegel
Copy link
Member

boegel commented Dec 9, 2020

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on generoso

PR test command 'EB_PR=11761 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_11761 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 12267

Test results coming soon (I hope)...

- notification for comment with ID 741991017 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegel boegel modified the milestones: 4.4.0, 4.3.2 (next release) Dec 9, 2020
@boegel
Copy link
Member

boegel commented Dec 9, 2020

Test report by @boegel
SUCCESS
Build succeeded for 12 out of 12 (12 easyconfigs in total)
node3501.doduo.os - Linux RHEL 8.2, x86_64, AMD EPYC 7552 48-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/12d91e70bed6426b3c0e411a79131127 for a full test report.

@boegel
Copy link
Member

boegel commented Dec 9, 2020

This should be good to go now, the necessary exceptions in the tests have been added to allow the -amd variants of BLIS & libFLAME...

I'll squeeze this in for the upcoming EasyBuild v4.3.2, so we can start experimenting with this BLIS-based toolchain.

Thanks a lot for the contribution @SebastianAchilles !

@boegel
Copy link
Member

boegel commented Dec 9, 2020

Test report by @boegel
SUCCESS
Build succeeded for 12 out of 12 (12 easyconfigs in total)
node2656.swalot.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz (haswell), Python 3.6.8
See https://gist.github.com/b5d8c273cdb4aecd1d5db0cb141ed13e for a full test report.

@boegel
Copy link
Member

boegel commented Dec 9, 2020

Going in, thanks @SebastianAchilles!

@boegel boegel merged commit face658 into easybuilders:develop Dec 9, 2020
@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 12 out of 12 (12 easyconfigs in total)
generoso - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/e1256bbbe0255e1b0401f7b905a1fe87 for a full test report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants