Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add manually vectorized AVX2 implementation of nbody #115

Merged
merged 3 commits into from
Oct 30, 2020

Conversation

bernhardmgruber
Copy link
Member

No description provided.

@bernhardmgruber bernhardmgruber marked this pull request as draft October 30, 2020 13:12
@bernhardmgruber bernhardmgruber marked this pull request as ready for review October 30, 2020 15:16
@bernhardmgruber bernhardmgruber merged commit acdabc0 into alpaka-group:develop Oct 30, 2020
@psychocoderHPC
Copy link
Member

Do you measured the performance? To compare native vs. llama + alpaka?

@bernhardmgruber
Copy link
Member Author

Here I changed -march=native to -mavx2 -mfma for a fair comparison and ran the llama-nbody executable on an i7-7820X:

16k particles (448kiB)
LLAMA
alloc took 3.36e-07s
init took 0.00629482s
update took 0.271607s   move took   8.488e-06s
update took 0.256883s   move took   6.671e-06s
update took 0.255247s   move took   6.73e-06s
update took 0.25479s    move took   6.749e-06s
update took 0.244397s   move took   6.747e-06s
AoS
alloc took 0.000149621s
init took 0.00166187s
update took 0.92677s    move took   2.1628e-05s
update took 0.897583s   move took   2.1552e-05s
update took 0.923078s   move took   2.1547e-05s
update took 0.916592s   move took   2.1502e-05s
update took 0.911556s   move took   2.1548e-05s
SoA
alloc took 0.000129929s
init took 0.00169066s
update took 0.248574s   move took   7.307e-06s
update took 0.247945s   move took   7.276e-06s
update took 0.248092s   move took   7.282e-06s
update took 0.248524s   move took   7.303e-06s
update took 0.248615s   move took   7.315e-06s
AoSoA
alloc took 1.4891e-05s
init took 0.00168645s
update took 0.130905s   move took   4.875e-06s
update took 0.130842s   move took   4.829e-06s
update took 0.132571s   move took   4.826e-06s
update took 0.139259s   move took   4.84e-06s
update took 0.138359s   move took   4.834e-06s
AoSoA tiled
alloc took 1.271e-05s
init took 0.00167707s
update took 0.138387s   move took   4.558e-06s
update took 0.138554s   move took   4.569e-06s
update took 0.138749s   move took   4.533e-06s
update took 0.138306s   move took   4.55e-06s
update took 0.13839s    move took   5.778e-06s
AoSoA AVX2 updating 8 particles from 1
 alloc took 1.0699e-05s
init took 0.00171034s
update took 0.108093s   move took   4.468e-06s
update took 0.106618s   move took   4.436e-06s
update took 0.10808s    move took   4.446e-06s
update took 0.107489s   move took   4.424e-06s
update took 0.106915s   move took   4.446e-06s
AoSoA AVX2 updating 1 particle from 8
alloc took 9.851e-06s
init took 0.00166775s
update took 0.0961262s  move took   4.475e-06s
update took 0.0981653s  move took   4.42e-06s
update took 0.095975s   move took   4.454e-06s
update took 0.0959235s  move took   4.401e-06s
update took 0.0961023s  move took   4.804e-06s

The LLAMA version uses an SoA layout. You can see that AoSoA would still be almost twice as fast, but that changes also the loop structure (change 1 loop into 2 nested loops for iterating the blocks and the elements within a block) and so far LLAMA cannot do that yet.

Furthermore, the handvectorized versions are still a fair bit faster than the auto vectorized ones. I still have not figured out why. The assembly is pretty similar.

Please ignore the AoSoA tiled version for now, it does not do the traditional tiling, but tries to use 3 nested loops to iterate on elements in blocks in L1 sized chunks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants