Add manually vectorized AVX2 implementation of nbody #115

bernhardmgruber · 2020-10-30T12:46:56Z

No description provided.

psychocoderHPC · 2020-10-30T18:40:29Z

Do you measured the performance? To compare native vs. llama + alpaka?

bernhardmgruber · 2020-11-02T09:13:41Z

Here I changed -march=native to -mavx2 -mfma for a fair comparison and ran the llama-nbody executable on an i7-7820X:

16k particles (448kiB)
LLAMA
alloc took 3.36e-07s
init took 0.00629482s
update took 0.271607s   move took   8.488e-06s
update took 0.256883s   move took   6.671e-06s
update took 0.255247s   move took   6.73e-06s
update took 0.25479s    move took   6.749e-06s
update took 0.244397s   move took   6.747e-06s
AoS
alloc took 0.000149621s
init took 0.00166187s
update took 0.92677s    move took   2.1628e-05s
update took 0.897583s   move took   2.1552e-05s
update took 0.923078s   move took   2.1547e-05s
update took 0.916592s   move took   2.1502e-05s
update took 0.911556s   move took   2.1548e-05s
SoA
alloc took 0.000129929s
init took 0.00169066s
update took 0.248574s   move took   7.307e-06s
update took 0.247945s   move took   7.276e-06s
update took 0.248092s   move took   7.282e-06s
update took 0.248524s   move took   7.303e-06s
update took 0.248615s   move took   7.315e-06s
AoSoA
alloc took 1.4891e-05s
init took 0.00168645s
update took 0.130905s   move took   4.875e-06s
update took 0.130842s   move took   4.829e-06s
update took 0.132571s   move took   4.826e-06s
update took 0.139259s   move took   4.84e-06s
update took 0.138359s   move took   4.834e-06s
AoSoA tiled
alloc took 1.271e-05s
init took 0.00167707s
update took 0.138387s   move took   4.558e-06s
update took 0.138554s   move took   4.569e-06s
update took 0.138749s   move took   4.533e-06s
update took 0.138306s   move took   4.55e-06s
update took 0.13839s    move took   5.778e-06s
AoSoA AVX2 updating 8 particles from 1
 alloc took 1.0699e-05s
init took 0.00171034s
update took 0.108093s   move took   4.468e-06s
update took 0.106618s   move took   4.436e-06s
update took 0.10808s    move took   4.446e-06s
update took 0.107489s   move took   4.424e-06s
update took 0.106915s   move took   4.446e-06s
AoSoA AVX2 updating 1 particle from 8
alloc took 9.851e-06s
init took 0.00166775s
update took 0.0961262s  move took   4.475e-06s
update took 0.0981653s  move took   4.42e-06s
update took 0.095975s   move took   4.454e-06s
update took 0.0959235s  move took   4.401e-06s
update took 0.0961023s  move took   4.804e-06s

The LLAMA version uses an SoA layout. You can see that AoSoA would still be almost twice as fast, but that changes also the loop structure (change 1 loop into 2 nested loops for iterating the blocks and the elements within a block) and so far LLAMA cannot do that yet.

Furthermore, the handvectorized versions are still a fair bit faster than the auto vectorized ones. I still have not figured out why. The assembly is pretty similar.

Please ignore the AoSoA tiled version for now, it does not do the traditional tiling, but tries to use 3 nested loops to iterate on elements in blocks in L1 sized chunks.

bernhardmgruber marked this pull request as draft October 30, 2020 13:12

bernhardmgruber added 2 commits October 30, 2020 15:52

add manually vectorized AVX2 implementation of nbody

838f09e

enable AVX2 and fast math for MSVC

490e4f5

bernhardmgruber force-pushed the avx branch from b4d3ee4 to 490e4f5 Compare October 30, 2020 14:52

add an alternative horizontalSum implementation

79b6519

bernhardmgruber marked this pull request as ready for review October 30, 2020 15:16

bernhardmgruber merged commit acdabc0 into alpaka-group:develop Oct 30, 2020

bernhardmgruber deleted the avx branch October 30, 2020 15:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add manually vectorized AVX2 implementation of nbody #115

Add manually vectorized AVX2 implementation of nbody #115

bernhardmgruber commented Oct 30, 2020

psychocoderHPC commented Oct 30, 2020

bernhardmgruber commented Nov 2, 2020

Add manually vectorized AVX2 implementation of nbody #115

Add manually vectorized AVX2 implementation of nbody #115

Conversation

bernhardmgruber commented Oct 30, 2020

psychocoderHPC commented Oct 30, 2020

bernhardmgruber commented Nov 2, 2020