-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CUDA implementation of distance function for NVIDIA GPUs #111
Labels
enhancement
New feature or request
Comments
lkeegan
added a commit
that referenced
this issue
Jan 9, 2024
- python interface changes - add `from_fasta_to_lower_triangular` - this constructs lower triangular matrix file directly from fasta file - for now only works on GPU - faster & requires less RAM than doing `from_fasta` followed by `dump_lower_triangular` - add `use_gpu` option to `from_fasta` - if True and include_x is False then the GPU is used to calcuate distances matrix - add `cuda_gpu_available()` utility function - CUDA implementation - each block of threads calculates a single element of the distances matrix - a kernel is launched running on a grid of these blocks to calculate a subset of the distances matrix - I/O is interleaved with computation: the CPU writes the previous kernel results as the next kernel is running - print basic timing info to cout - add libfmt library - build wheels using manylinux2014 image with CUDA installed from https://github.com/ameli/manylinux-cuda - resolves #111
lkeegan
added a commit
that referenced
this issue
Jan 9, 2024
- python interface changes - add `from_fasta_to_lower_triangular` - this constructs lower triangular matrix file directly from fasta file - for now only works on GPU - faster & requires less RAM than doing `from_fasta` followed by `dump_lower_triangular` - add `use_gpu` option to `from_fasta` - if True and include_x is False then the GPU is used to calcuate distances matrix - add `cuda_gpu_available()` utility function - CUDA implementation - each block of threads calculates a single element of the distances matrix - a kernel is launched running on a grid of these blocks to calculate a subset of the distances matrix - I/O is interleaved with computation: the CPU writes the previous kernel results as the next kernel is running - print basic timing info to cout - add libfmt library - build wheels using manylinux2014 image with CUDA installed from https://github.com/ameli/manylinux-cuda - resolves #111
lkeegan
added a commit
that referenced
this issue
Jan 10, 2024
- python interface changes - add `from_fasta_to_lower_triangular` - this constructs lower triangular matrix file directly from fasta file - for now only works on GPU - faster & requires less RAM than doing `from_fasta` followed by `dump_lower_triangular` - requires 1.5GB RAM per 100k genomes on gpu + 2GB buffer to store distances - add `use_gpu` option to `from_fasta` - if True and include_x is False then the GPU is used to calcuate distances matrix - add `cuda_gpu_available()` utility function - CUDA implementation - each block of threads calculates a single element of the distances matrix - a kernel is launched running on a grid of these blocks to calculate a subset of the distances matrix - I/O is interleaved with computation: the CPU writes the previous kernel results as the next kernel is running - print basic timing info to cout - add libfmt library - migrate to using v3 of catch2 - build wheels using manylinux2014 image with CUDA 11.8 pre-installed from https://github.com/ameli/manylinux-cuda - resolves #111
lkeegan
added a commit
that referenced
this issue
Jan 10, 2024
- python interface changes - add `from_fasta_to_lower_triangular` - this constructs lower triangular matrix file directly from fasta file - for now only works on GPU - faster & requires less RAM than doing `from_fasta` followed by `dump_lower_triangular` - requires 1.5GB RAM per 100k genomes on gpu + 1GB buffer to store partial distances matrix - add `use_gpu` option to `from_fasta` - if True and include_x is False then the GPU is used to calcuate distances matrix - add `cuda_gpu_available()` utility function - CUDA implementation - each block of threads calculates a single element of the distances matrix - a kernel is launched running on a grid of these blocks to calculate a subset of the distances matrix - I/O is interleaved with computation: the CPU writes the previous kernel results as the next kernel is running - print basic timing info to cout - add libfmt library - migrate to using v3 of catch2 - build wheels using manylinux2014 image with CUDA 11.8 pre-installed from https://github.com/ameli/manylinux-cuda - resolves #111
lkeegan
added a commit
that referenced
this issue
Jan 10, 2024
- python interface changes - add `from_fasta_to_lower_triangular` - this constructs lower triangular matrix file directly from fasta file - for now only works on GPU - faster & requires less RAM than doing `from_fasta` followed by `dump_lower_triangular` - requires 1.5GB RAM per 100k genomes on gpu + 1GB buffer to store partial distances matrix - add `use_gpu` option to `from_fasta` - if True and include_x is False then the GPU is used to calcuate distances matrix - add `cuda_gpu_available()` utility function - CUDA implementation - each block of threads calculates a single element of the distances matrix - a kernel is launched running on a grid of these blocks to calculate a subset of the distances matrix - I/O is interleaved with computation: the CPU writes the previous kernel results as the next kernel is running - print basic timing info to cout - add libfmt library - migrate to using v3 of catch2 - build wheels using manylinux2014 image with CUDA 11.8 pre-installed from https://github.com/ameli/manylinux-cuda - add a couple of performance plots - resolves #111
lkeegan
added a commit
that referenced
this issue
Jan 10, 2024
- python interface changes - add `from_fasta_to_lower_triangular` - this constructs lower triangular matrix file directly from fasta file - for now only works on GPU - faster & requires less RAM than doing `from_fasta` followed by `dump_lower_triangular` - requires 1.5GB RAM per 100k genomes on gpu + 1GB buffer to store partial distances matrix - add `use_gpu` option to `from_fasta` - if True and include_x is False then the GPU is used to calcuate distances matrix - add `cuda_gpu_available()` utility function - CUDA implementation - each block of threads calculates a single element of the distances matrix - a kernel is launched running on a grid of these blocks to calculate a subset of the distances matrix - I/O is interleaved with computation: the CPU writes the previous kernel results as the next kernel is running - print basic timing info to cout - add libfmt library - migrate to using v3 of catch2 - build wheels using manylinux2014 image with CUDA 11.8 pre-installed from https://github.com/ameli/manylinux-cuda - add a couple of performance plots - bump version to 1.0.0 - resolves #111
lkeegan
added a commit
that referenced
this issue
Jan 10, 2024
- python interface changes - add `from_fasta_to_lower_triangular` - this constructs lower triangular matrix file directly from fasta file - for now only works on GPU - faster & requires less RAM than doing `from_fasta` followed by `dump_lower_triangular` - requires 1.5GB RAM per 100k genomes on gpu + 1GB buffer to store partial distances matrix - add `use_gpu` option to `from_fasta` - if True and include_x is False then the GPU is used to calcuate distances matrix - add `cuda_gpu_available()` utility function - CUDA implementation - each block of threads calculates a single element of the distances matrix - a kernel is launched running on a grid of these blocks to calculate a subset of the distances matrix - I/O is interleaved with computation: the CPU writes the previous kernel results as the next kernel is running - print basic timing info to cout - add libfmt library - migrate to using v3 of catch2 - build wheels using manylinux2014 image with CUDA 11.8 pre-installed from https://github.com/ameli/manylinux-cuda - add a couple of performance plots - bump version to 1.0.0 - resolves #111
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
possible strategies
The text was updated successfully, but these errors were encountered: