Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CUDA implementation of distance function for NVIDIA GPUs #111

Closed
lkeegan opened this issue Sep 29, 2023 · 0 comments · Fixed by #123
Closed

Add CUDA implementation of distance function for NVIDIA GPUs #111

lkeegan opened this issue Sep 29, 2023 · 0 comments · Fixed by #123
Labels
enhancement New feature or request

Comments

@lkeegan
Copy link
Member

lkeegan commented Sep 29, 2023

possible strategies

  • naive implementation assuming everything fits into gpu ram
    • if we restrict to linux and sm60+ we can use unified memory which allows memory oversubscription
    • should be relatively straightforward to implement
    • likely very poor scaling for large datasets
  • assuming the gene vectors fit into gpu ram
    • split distances matrix into nxn sub-matrices, where one sub-matrix fits in gpu ram along with the gene vectors
    • calculate one sub-matrix at a time & copy each to cpu when done
    • should scale better
  • assuming even gene vectors don't fit into gpu ram
    • also need to block the gene vectors, then sum all contributions
@lkeegan lkeegan added the enhancement New feature or request label Sep 29, 2023
lkeegan added a commit that referenced this issue Jan 9, 2024
- python interface changes
  - add `from_fasta_to_lower_triangular`
    - this constructs lower triangular matrix file directly from fasta file
    - for now only works on GPU
    - faster & requires less RAM than doing `from_fasta` followed by `dump_lower_triangular`
  - add `use_gpu` option to `from_fasta`
    - if True and include_x is False then the GPU is used to calcuate distances matrix
  - add `cuda_gpu_available()` utility function
- CUDA implementation
  - each block of threads calculates a single element of the distances matrix
  - a kernel is launched running on a grid of these blocks to calculate a subset of the distances matrix
  - I/O is interleaved with computation: the CPU writes the previous kernel results as the next kernel is running
- print basic timing info to cout
- add libfmt library
- build wheels using manylinux2014 image with CUDA installed from https://github.com/ameli/manylinux-cuda
- resolves #111
lkeegan added a commit that referenced this issue Jan 9, 2024
- python interface changes
  - add `from_fasta_to_lower_triangular`
    - this constructs lower triangular matrix file directly from fasta file
    - for now only works on GPU
    - faster & requires less RAM than doing `from_fasta` followed by `dump_lower_triangular`
  - add `use_gpu` option to `from_fasta`
    - if True and include_x is False then the GPU is used to calcuate distances matrix
  - add `cuda_gpu_available()` utility function
- CUDA implementation
  - each block of threads calculates a single element of the distances matrix
  - a kernel is launched running on a grid of these blocks to calculate a subset of the distances matrix
  - I/O is interleaved with computation: the CPU writes the previous kernel results as the next kernel is running
- print basic timing info to cout
- add libfmt library
- build wheels using manylinux2014 image with CUDA installed from https://github.com/ameli/manylinux-cuda
- resolves #111
lkeegan added a commit that referenced this issue Jan 10, 2024
- python interface changes
  - add `from_fasta_to_lower_triangular`
    - this constructs lower triangular matrix file directly from fasta file
    - for now only works on GPU
    - faster & requires less RAM than doing `from_fasta` followed by `dump_lower_triangular`
    - requires 1.5GB RAM per 100k genomes on gpu + 2GB buffer to store distances
  - add `use_gpu` option to `from_fasta`
    - if True and include_x is False then the GPU is used to calcuate distances matrix
  - add `cuda_gpu_available()` utility function
- CUDA implementation
  - each block of threads calculates a single element of the distances matrix
  - a kernel is launched running on a grid of these blocks to calculate a subset of the distances matrix
  - I/O is interleaved with computation: the CPU writes the previous kernel results as the next kernel is running
- print basic timing info to cout
- add libfmt library
- migrate to using v3 of catch2
- build wheels using manylinux2014 image with CUDA 11.8 pre-installed from https://github.com/ameli/manylinux-cuda
- resolves #111
lkeegan added a commit that referenced this issue Jan 10, 2024
- python interface changes
  - add `from_fasta_to_lower_triangular`
    - this constructs lower triangular matrix file directly from fasta file
    - for now only works on GPU
    - faster & requires less RAM than doing `from_fasta` followed by `dump_lower_triangular`
    - requires 1.5GB RAM per 100k genomes on gpu + 1GB buffer to store partial distances matrix
  - add `use_gpu` option to `from_fasta`
    - if True and include_x is False then the GPU is used to calcuate distances matrix
  - add `cuda_gpu_available()` utility function
- CUDA implementation
  - each block of threads calculates a single element of the distances matrix
  - a kernel is launched running on a grid of these blocks to calculate a subset of the distances matrix
  - I/O is interleaved with computation: the CPU writes the previous kernel results as the next kernel is running
- print basic timing info to cout
- add libfmt library
- migrate to using v3 of catch2
- build wheels using manylinux2014 image with CUDA 11.8 pre-installed from https://github.com/ameli/manylinux-cuda
- resolves #111
lkeegan added a commit that referenced this issue Jan 10, 2024
- python interface changes
  - add `from_fasta_to_lower_triangular`
    - this constructs lower triangular matrix file directly from fasta file
    - for now only works on GPU
    - faster & requires less RAM than doing `from_fasta` followed by `dump_lower_triangular`
    - requires 1.5GB RAM per 100k genomes on gpu + 1GB buffer to store partial distances matrix
  - add `use_gpu` option to `from_fasta`
    - if True and include_x is False then the GPU is used to calcuate distances matrix
  - add `cuda_gpu_available()` utility function
- CUDA implementation
  - each block of threads calculates a single element of the distances matrix
  - a kernel is launched running on a grid of these blocks to calculate a subset of the distances matrix
  - I/O is interleaved with computation: the CPU writes the previous kernel results as the next kernel is running
- print basic timing info to cout
- add libfmt library
- migrate to using v3 of catch2
- build wheels using manylinux2014 image with CUDA 11.8 pre-installed from https://github.com/ameli/manylinux-cuda
- add a couple of performance plots
- resolves #111
lkeegan added a commit that referenced this issue Jan 10, 2024
- python interface changes
  - add `from_fasta_to_lower_triangular`
    - this constructs lower triangular matrix file directly from fasta file
    - for now only works on GPU
    - faster & requires less RAM than doing `from_fasta` followed by `dump_lower_triangular`
    - requires 1.5GB RAM per 100k genomes on gpu + 1GB buffer to store partial distances matrix
  - add `use_gpu` option to `from_fasta`
    - if True and include_x is False then the GPU is used to calcuate distances matrix
  - add `cuda_gpu_available()` utility function
- CUDA implementation
  - each block of threads calculates a single element of the distances matrix
  - a kernel is launched running on a grid of these blocks to calculate a subset of the distances matrix
  - I/O is interleaved with computation: the CPU writes the previous kernel results as the next kernel is running
- print basic timing info to cout
- add libfmt library
- migrate to using v3 of catch2
- build wheels using manylinux2014 image with CUDA 11.8 pre-installed from https://github.com/ameli/manylinux-cuda
- add a couple of performance plots
- bump version to 1.0.0
- resolves #111
lkeegan added a commit that referenced this issue Jan 10, 2024
- python interface changes
  - add `from_fasta_to_lower_triangular`
    - this constructs lower triangular matrix file directly from fasta file
    - for now only works on GPU
    - faster & requires less RAM than doing `from_fasta` followed by `dump_lower_triangular`
    - requires 1.5GB RAM per 100k genomes on gpu + 1GB buffer to store partial distances matrix
  - add `use_gpu` option to `from_fasta`
    - if True and include_x is False then the GPU is used to calcuate distances matrix
  - add `cuda_gpu_available()` utility function
- CUDA implementation
  - each block of threads calculates a single element of the distances matrix
  - a kernel is launched running on a grid of these blocks to calculate a subset of the distances matrix
  - I/O is interleaved with computation: the CPU writes the previous kernel results as the next kernel is running
- print basic timing info to cout
- add libfmt library
- migrate to using v3 of catch2
- build wheels using manylinux2014 image with CUDA 11.8 pre-installed from https://github.com/ameli/manylinux-cuda
- add a couple of performance plots
- bump version to 1.0.0
- resolves #111
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant