Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add CUDA distance implementation #123

Merged
merged 1 commit into from
Jan 10, 2024
Merged

add CUDA distance implementation #123

merged 1 commit into from
Jan 10, 2024

Conversation

lkeegan
Copy link
Member

@lkeegan lkeegan commented Jan 9, 2024

  • python interface changes
    • add from_fasta_to_lower_triangular - this constructs lower triangular matrix file directly from fasta file - for now only works on GPU - faster & requires less RAM than doing from_fasta followed by dump_lower_triangular
    • add use_gpu option to from_fasta
      • if True and include_x is False then the GPU is used to calcuate distances matrix
    • add cuda_gpu_available() utility function
  • CUDA implementation
    • each block of threads calculates a single element of the distances matrix
    • a kernel is launched running on a grid of these blocks to calculate a subset of the distances matrix
    • I/O is interleaved with computation: the CPU writes the previous kernel results as the next kernel is running
  • print basic timing info to cout
  • add libfmt library
  • build wheels using manylinux2014 image with CUDA installed from https://github.com/ameli/manylinux-cuda
  • resolves Add CUDA implementation of distance function for NVIDIA GPUs #111

@lkeegan lkeegan force-pushed the cuda branch 3 times, most recently from 08f0f26 to bb33c84 Compare January 10, 2024 14:49
- python interface changes
  - add `from_fasta_to_lower_triangular`
    - this constructs lower triangular matrix file directly from fasta file
    - for now only works on GPU
    - faster & requires less RAM than doing `from_fasta` followed by `dump_lower_triangular`
    - requires 1.5GB RAM per 100k genomes on gpu + 1GB buffer to store partial distances matrix
  - add `use_gpu` option to `from_fasta`
    - if True and include_x is False then the GPU is used to calcuate distances matrix
  - add `cuda_gpu_available()` utility function
- CUDA implementation
  - each block of threads calculates a single element of the distances matrix
  - a kernel is launched running on a grid of these blocks to calculate a subset of the distances matrix
  - I/O is interleaved with computation: the CPU writes the previous kernel results as the next kernel is running
- print basic timing info to cout
- add libfmt library
- migrate to using v3 of catch2
- build wheels using manylinux2014 image with CUDA 11.8 pre-installed from https://github.com/ameli/manylinux-cuda
- add a couple of performance plots
- bump version to 1.0.0
- resolves #111
@lkeegan lkeegan merged commit 5919c48 into main Jan 10, 2024
10 checks passed
@lkeegan lkeegan deleted the cuda branch January 10, 2024 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add CUDA implementation of distance function for NVIDIA GPUs
1 participant