Add CUDA implementation of distance function for NVIDIA GPUs #111

lkeegan · 2023-09-29T12:01:18Z

possible strategies

naive implementation assuming everything fits into gpu ram
- if we restrict to linux and sm60+ we can use unified memory which allows memory oversubscription
- should be relatively straightforward to implement
- likely very poor scaling for large datasets
assuming the gene vectors fit into gpu ram
- split distances matrix into nxn sub-matrices, where one sub-matrix fits in gpu ram along with the gene vectors
- calculate one sub-matrix at a time & copy each to cpu when done
- should scale better
assuming even gene vectors don't fit into gpu ram
- also need to block the gene vectors, then sum all contributions

- python interface changes - add `from_fasta_to_lower_triangular` - this constructs lower triangular matrix file directly from fasta file - for now only works on GPU - faster & requires less RAM than doing `from_fasta` followed by `dump_lower_triangular` - add `use_gpu` option to `from_fasta` - if True and include_x is False then the GPU is used to calcuate distances matrix - add `cuda_gpu_available()` utility function - CUDA implementation - each block of threads calculates a single element of the distances matrix - a kernel is launched running on a grid of these blocks to calculate a subset of the distances matrix - I/O is interleaved with computation: the CPU writes the previous kernel results as the next kernel is running - print basic timing info to cout - add libfmt library - build wheels using manylinux2014 image with CUDA installed from https://github.com/ameli/manylinux-cuda - resolves #111

- python interface changes - add `from_fasta_to_lower_triangular` - this constructs lower triangular matrix file directly from fasta file - for now only works on GPU - faster & requires less RAM than doing `from_fasta` followed by `dump_lower_triangular` - requires 1.5GB RAM per 100k genomes on gpu + 2GB buffer to store distances - add `use_gpu` option to `from_fasta` - if True and include_x is False then the GPU is used to calcuate distances matrix - add `cuda_gpu_available()` utility function - CUDA implementation - each block of threads calculates a single element of the distances matrix - a kernel is launched running on a grid of these blocks to calculate a subset of the distances matrix - I/O is interleaved with computation: the CPU writes the previous kernel results as the next kernel is running - print basic timing info to cout - add libfmt library - migrate to using v3 of catch2 - build wheels using manylinux2014 image with CUDA 11.8 pre-installed from https://github.com/ameli/manylinux-cuda - resolves #111

- python interface changes - add `from_fasta_to_lower_triangular` - this constructs lower triangular matrix file directly from fasta file - for now only works on GPU - faster & requires less RAM than doing `from_fasta` followed by `dump_lower_triangular` - requires 1.5GB RAM per 100k genomes on gpu + 1GB buffer to store partial distances matrix - add `use_gpu` option to `from_fasta` - if True and include_x is False then the GPU is used to calcuate distances matrix - add `cuda_gpu_available()` utility function - CUDA implementation - each block of threads calculates a single element of the distances matrix - a kernel is launched running on a grid of these blocks to calculate a subset of the distances matrix - I/O is interleaved with computation: the CPU writes the previous kernel results as the next kernel is running - print basic timing info to cout - add libfmt library - migrate to using v3 of catch2 - build wheels using manylinux2014 image with CUDA 11.8 pre-installed from https://github.com/ameli/manylinux-cuda - resolves #111

- python interface changes - add `from_fasta_to_lower_triangular` - this constructs lower triangular matrix file directly from fasta file - for now only works on GPU - faster & requires less RAM than doing `from_fasta` followed by `dump_lower_triangular` - requires 1.5GB RAM per 100k genomes on gpu + 1GB buffer to store partial distances matrix - add `use_gpu` option to `from_fasta` - if True and include_x is False then the GPU is used to calcuate distances matrix - add `cuda_gpu_available()` utility function - CUDA implementation - each block of threads calculates a single element of the distances matrix - a kernel is launched running on a grid of these blocks to calculate a subset of the distances matrix - I/O is interleaved with computation: the CPU writes the previous kernel results as the next kernel is running - print basic timing info to cout - add libfmt library - migrate to using v3 of catch2 - build wheels using manylinux2014 image with CUDA 11.8 pre-installed from https://github.com/ameli/manylinux-cuda - add a couple of performance plots - resolves #111

- python interface changes - add `from_fasta_to_lower_triangular` - this constructs lower triangular matrix file directly from fasta file - for now only works on GPU - faster & requires less RAM than doing `from_fasta` followed by `dump_lower_triangular` - requires 1.5GB RAM per 100k genomes on gpu + 1GB buffer to store partial distances matrix - add `use_gpu` option to `from_fasta` - if True and include_x is False then the GPU is used to calcuate distances matrix - add `cuda_gpu_available()` utility function - CUDA implementation - each block of threads calculates a single element of the distances matrix - a kernel is launched running on a grid of these blocks to calculate a subset of the distances matrix - I/O is interleaved with computation: the CPU writes the previous kernel results as the next kernel is running - print basic timing info to cout - add libfmt library - migrate to using v3 of catch2 - build wheels using manylinux2014 image with CUDA 11.8 pre-installed from https://github.com/ameli/manylinux-cuda - add a couple of performance plots - bump version to 1.0.0 - resolves #111

lkeegan added the enhancement New feature or request label Sep 29, 2023

lkeegan mentioned this issue Jan 9, 2024

add CUDA distance implementation #123

Merged

lkeegan closed this as completed in #123 Jan 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CUDA implementation of distance function for NVIDIA GPUs #111

Add CUDA implementation of distance function for NVIDIA GPUs #111

lkeegan commented Sep 29, 2023 •

edited

Loading

Add CUDA implementation of distance function for NVIDIA GPUs #111

Add CUDA implementation of distance function for NVIDIA GPUs #111

Comments

lkeegan commented Sep 29, 2023 • edited Loading

lkeegan commented Sep 29, 2023 •

edited

Loading