Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds pinned memory for data transfer during a parallel SpMV. I think this would be a good time to make a par_csr_matvec_device.c and include device code there. It would make this easier to work with rather than all the #defs.
Pinned pointers are added to par_csr_matrix class. These pointers are allocated on demand in par_csr_matvec.c. They are sized according to a maximum in order to be reusable in both the standard matvec and matvecT. To me, they belong in the matrix class because the data is sized according to a particular matrix's parallel SpMV layout.
The pinned memory buffers are allocated in memory.c with cudaHostAlloc with cudaHostAllocMapped. This enables one to write directly into pinned memory from the cuda kernels (either gather (matvec) or spmv (matvecT)). One has to do cudaHostGetDevicePointer and pass that into the kernel execution (done in par_csr_matvec.c). This is a little wonky and needs to be cleaned up.
par_csr_communication has 2 new methods : hypre_ParCSRCommHandleCreate_v3, hypre_ParCSRCommHandleDestroy_v3
In the first method, we have the pinned buffers passed as input. Rather than execute a memcpyDtoH, I simply device synchronize to ensure the pinned data is ready on the host for MPI communication.
In the second method, I execute a cudaMemcpyAsync in order to allow data transfer execute in parallel overlapped kernel execution. cudaMemcpyAsync is called via hypre_TMemcpyAsync. This is a new method and is only implemented for Nvidia and AMD architectures. Tested on Summit and Crusher.