add CUDA distance implementation #123

lkeegan · 2024-01-09T15:50:14Z

python interface changes
- add from_fasta_to_lower_triangular - this constructs lower triangular matrix file directly from fasta file - for now only works on GPU - faster & requires less RAM than doing from_fasta followed by dump_lower_triangular
- add use_gpu option to from_fasta
  - if True and include_x is False then the GPU is used to calcuate distances matrix
- add cuda_gpu_available() utility function
CUDA implementation
- each block of threads calculates a single element of the distances matrix
- a kernel is launched running on a grid of these blocks to calculate a subset of the distances matrix
- I/O is interleaved with computation: the CPU writes the previous kernel results as the next kernel is running
print basic timing info to cout
add libfmt library
build wheels using manylinux2014 image with CUDA installed from https://github.com/ameli/manylinux-cuda
resolves Add CUDA implementation of distance function for NVIDIA GPUs #111

- python interface changes - add `from_fasta_to_lower_triangular` - this constructs lower triangular matrix file directly from fasta file - for now only works on GPU - faster & requires less RAM than doing `from_fasta` followed by `dump_lower_triangular` - requires 1.5GB RAM per 100k genomes on gpu + 1GB buffer to store partial distances matrix - add `use_gpu` option to `from_fasta` - if True and include_x is False then the GPU is used to calcuate distances matrix - add `cuda_gpu_available()` utility function - CUDA implementation - each block of threads calculates a single element of the distances matrix - a kernel is launched running on a grid of these blocks to calculate a subset of the distances matrix - I/O is interleaved with computation: the CPU writes the previous kernel results as the next kernel is running - print basic timing info to cout - add libfmt library - migrate to using v3 of catch2 - build wheels using manylinux2014 image with CUDA 11.8 pre-installed from https://github.com/ameli/manylinux-cuda - add a couple of performance plots - bump version to 1.0.0 - resolves #111

lkeegan force-pushed the cuda branch 3 times, most recently from 08f0f26 to bb33c84 Compare January 10, 2024 14:49

lkeegan force-pushed the cuda branch from bb33c84 to 0c7baae Compare January 10, 2024 15:07

lkeegan merged commit 5919c48 into main Jan 10, 2024
10 checks passed

lkeegan deleted the cuda branch January 10, 2024 15:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add CUDA distance implementation #123

add CUDA distance implementation #123

lkeegan commented Jan 9, 2024

add CUDA distance implementation #123

add CUDA distance implementation #123

Conversation

lkeegan commented Jan 9, 2024