Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve noncontiguous data transfers #13

Merged
merged 20 commits into from
Jan 22, 2024

Conversation

lukasm91
Copy link
Collaborator

As promised in #7, t his PR replaces a loop over 1D acc_copy[in|out] with cudaMemcpy2D. There might still be cases where we have to loop over one of the dimensions, but these are rare and not likely to happen in real case because it would require two non-contiguous dimensions.

  • adds optional AFTER argument to X_GET_LAST_CONTIGUOUS which indicates after which dimension we are looking for the next last contiguous dimension
  • adds COPY_2D{X}_{Y}_CONTIGUOUS where X and Y are the two last contiguous dimensions
  • slightly changes the strategy to find the last contiguous dimension, because for 2D copies, we should handle dimensions of size 1 as contiguous if they happen in the end (e.g. X(:, :, 4:8, 3:3, 3:3, 1:10) has 5 contiguous dimensions because 3:3 does not make the sub array non-contiguous, yet).
  • adds the -cuda flag (a ecbuild expert should tell what is the preferred way to do this ...)
  • adds a new pretty large test case to test all this. The new tests only checks for correctness, not whether the code is optimal.

@lukasm91
Copy link
Collaborator Author

@awnawab FYI

@pmarguinaud
Copy link
Collaborator

Hello,

This is very interesting, but I do not want this to be merged now. We have an important missing feature to implement before we address issues like this.

${indent}$ & IWIDTH, IHEIGHT, CUDAMEMCPYHOSTTODEVICE, &
${indent}$ & STREAM)
${indent}$ ELSE
${indent}$ IRET = CUDAMEMCPY2D (DEV (${ar('DEV')}$), IDEV_PITCH, &
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions :

  • CUDAMEMCPY2DASYNC seems to require a CUDAMEMCPYHOSTTODEVICE/CUDAMEMCPYDEVICETOHOST argument, while CUDAMEMCPY2D does not; is it normal ?
  • the direction of transfers triggered by CUDAMEMCPY2DASYNC/CUDAMEMCPY2D is influenced by the order of HST/DEV : H2D with DEV as first argument, D2H with HST as first argument. Correct ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • You are right, I can do this better. I had to fill in CUDAMEMCPYXTOY because I want to specify the stream, but the better way of course is to do STREAM=STREAM
  • yes, all memcpy are memcpy(dst, [...], src, [...]). So for H2D, you have DEV first, and for D2H you have HST first. this is true for the normal C memcpy, but also for all cudaMemcpy variants.

Thanks for doing the merge with master! I merged this back with my branch and did this minor change. I also slightly change the test quoting explicitly which function is supposed to be called.

@lukasm91 lukasm91 changed the base branch from main to accord January 19, 2024 07:38
@lukasm91 lukasm91 changed the base branch from accord to main January 19, 2024 07:38
@pmarguinaud pmarguinaud self-assigned this Jan 19, 2024
@awnawab
Copy link
Collaborator

awnawab commented Jan 22, 2024

Hi @lukasm91,

Thanks again for this amazing contribution that addresses one of the main bottlenecks in FIELD_API! I would just like to clean up a few small things before approving this:

  • Remove the setting of -cuda flags in field_api_compile_flags.cmake as these are already set if CUDA is enabled
  • Extend the 2D copy to also work with unmapped device pointers
  • Revert to the old copy mechanism if ACC is enabled but not CUDA
  • The unit-tests should give us confidence in the 2D copy mechanism, and so we should only perform CUDASUCCESS checks for debug builds

With your permission, could I please contribute commits to your PR to address the above?

@lukasm91
Copy link
Collaborator Author

Hi Ahmad, of course, this makes sense and feel free to contribute those commits in here.

@lukasm91
Copy link
Collaborator Author

@awnawab
Just looked at your proposal again. Regarding

The unit-tests should give us confidence in the 2D copy mechanism, and so we should only perform CUDASUCCESS checks for debug builds

I don't think this is a good idea. Checking for CUDASUCCESS is always a good idea and doesn't cost anything (it is a simple comparison). Things can go wrong for many reasons, and especially at large scale, things can go wrong in weird ways, so it is better to catch errors as early as possible.
The place where a check would only make sense in a debug build is if you have to do an extra synchronization to check for correctness.

I recommend to not remove the CUDASUCCESS checks.

@awnawab
Copy link
Collaborator

awnawab commented Jan 22, 2024

Thanks @lukasm91, I've restored them 👍

@lukasm91
Copy link
Collaborator Author

thanks. I also removed the DEBUG flag again because this is not needed now

Copy link
Collaborator

@awnawab awnawab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again Lukas. Good to go from my side. @pmarguinaud if you are also happy, could you please approve and merge?

@pmarguinaud pmarguinaud merged commit 521069a into ecmwf-ifs:main Jan 22, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants