-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve noncontiguous data transfers #13
Conversation
@awnawab FYI |
Hello, This is very interesting, but I do not want this to be merged now. We have an important missing feature to implement before we address issues like this. |
field_RANKSUFF_module.fypp
Outdated
${indent}$ & IWIDTH, IHEIGHT, CUDAMEMCPYHOSTTODEVICE, & | ||
${indent}$ & STREAM) | ||
${indent}$ ELSE | ||
${indent}$ IRET = CUDAMEMCPY2D (DEV (${ar('DEV')}$), IDEV_PITCH, & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two questions :
- CUDAMEMCPY2DASYNC seems to require a CUDAMEMCPYHOSTTODEVICE/CUDAMEMCPYDEVICETOHOST argument, while CUDAMEMCPY2D does not; is it normal ?
- the direction of transfers triggered by CUDAMEMCPY2DASYNC/CUDAMEMCPY2D is influenced by the order of HST/DEV : H2D with DEV as first argument, D2H with HST as first argument. Correct ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- You are right, I can do this better. I had to fill in CUDAMEMCPYXTOY because I want to specify the stream, but the better way of course is to do STREAM=STREAM
- yes, all memcpy are memcpy(dst, [...], src, [...]). So for H2D, you have DEV first, and for D2H you have HST first. this is true for the normal C memcpy, but also for all cudaMemcpy variants.
Thanks for doing the merge with master! I merged this back with my branch and did this minor change. I also slightly change the test quoting explicitly which function is supposed to be called.
Hi @lukasm91, Thanks again for this amazing contribution that addresses one of the main bottlenecks in FIELD_API! I would just like to clean up a few small things before approving this:
With your permission, could I please contribute commits to your PR to address the above? |
Hi Ahmad, of course, this makes sense and feel free to contribute those commits in here. |
3f5bf62
to
3bb735f
Compare
@awnawab
I don't think this is a good idea. Checking for CUDASUCCESS is always a good idea and doesn't cost anything (it is a simple comparison). Things can go wrong for many reasons, and especially at large scale, things can go wrong in weird ways, so it is better to catch errors as early as possible. I recommend to not remove the CUDASUCCESS checks. |
Thanks @lukasm91, I've restored them 👍 |
thanks. I also removed the DEBUG flag again because this is not needed now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again Lukas. Good to go from my side. @pmarguinaud if you are also happy, could you please approve and merge?
As promised in #7, t his PR replaces a loop over 1D
acc_copy[in|out]
withcudaMemcpy2D
. There might still be cases where we have to loop over one of the dimensions, but these are rare and not likely to happen in real case because it would require two non-contiguous dimensions.AFTER
argument toX_GET_LAST_CONTIGUOUS
which indicates after which dimension we are looking for the next last contiguous dimensionCOPY_2D{X}_{Y}_CONTIGUOUS
where X and Y are the two last contiguous dimensionsX(:, :, 4:8, 3:3, 3:3, 1:10)
has 5 contiguous dimensions because 3:3 does not make the sub array non-contiguous, yet).-cuda
flag (a ecbuild expert should tell what is the preferred way to do this ...)