Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hillst/bionemo noodles #458

Merged
merged 41 commits into from
Nov 28, 2024
Merged

hillst/bionemo noodles #458

merged 41 commits into from
Nov 28, 2024

Conversation

skothenhill-nv
Copy link
Collaborator

@skothenhill-nv skothenhill-nv commented Nov 20, 2024

NvFaidx

Adds a memmapped faidx reader with python bindings.

Context

PyFaidx or any buffered read based index is not process safe, and therefore does not play nice with pytorch dataloaders.

Due to the order of operations, the underlying file handle is shared between processes, when seek() is called to perform random lookups, this can cause unexpected behavior in the forked processes.
Ref: mdshw5/pyfaidx#211

For a good solution we need three things:
1) Safe index creation, in multi-process or multi-node scenarios, this should be restricted to a single node where all workers block until it is complete (not implemented above)
2) Index object instantion must be fast.
3) Read-only use of the index object must be both thread safe and process safe with python.

Memmap backing

Using a backend of Memmaps provide the following benefits (Credit: @malcolmgreaves)

  • The kernel implements this mechanism by using page faults
  • Each read in a mmap'd file results in a page fault: there's nothing in memory to read!
  • The kernel handles this page fault by going to the disk, reading the file in the specified offset + index, then returning to the user process with what it just read, preventing penalties from context switching.

These three usecases are ideal for pytorch DataLoaders!

Perf Benchmarks

nvfaidx query faster by 5.2x
nvfaidx instantiation faster by 16x

Metric pyfaidx (s) nvfaidx (s)
Query/s 1.668e-05 3.239e-06
Instantiation Time 0.01001 0.0006313

TODO:

  • Dockerfile needs to be reworked to play nice with cargo.

Usage:

Import NvFaidx, pass in your fasta file path, and go fast.

from bionemo.noodles import NvFaidx
index = NvFaidx('sub-packages/bionemo-noodles/tests/bionemo/noodles/data/sample.fasta')
assert index['chr1'][0] == 'A'
assert index['chr1'][1:4] == 'CTG'
assert index['chr1'][1:10000] == 'CTGACTGACTG'
assert index['chr1'][0:-1] == 'ACTGACTGACT'
assert index['chr1'][:-1] == 'ACTGACTGACT'
assert index['chr1'][-1:] == 'G'

malcolmgreaves and others added 10 commits November 14, 2024 17:22
Initial commit!

- adds the PyO3 wrapper around noodles-fasta
- adds a python class that mimics the dict-like behavior of pyfaidx
- adds a long test (best used with hg38
https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/), or use
whatever big reference.

Build:
```
cd sub-packages/bionemo-noodles/noodles_fasta_wrapper
maturin develop
```
This should install it into your local python environment, then you can
run the associated tests!

TODO:
- flesh out equality tests with pyfaidx
- profile performance
- get the build system working correctly
Dockerfile Outdated Show resolved Hide resolved
Annotated dockerfile with places we should change
expose the rust reader as a part of __init__.py
@skothenhill-nv skothenhill-nv changed the base branch from mg/bionemo_noodles_pyO3 to main November 22, 2024 23:24
Copy link
Collaborator

@malcolmgreaves malcolmgreaves left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First parital review.

Dockerfile Outdated Show resolved Hide resolved
skothenhill-nv and others added 3 commits November 25, 2024 16:42
Co-authored-by: Malcolm Greaves <malcolmgreaves@users.noreply.github.com>
Signed-off-by: Steven Kothen-Hill <148821680+skothenhill-nv@users.noreply.github.com>
Dockerfile Show resolved Hide resolved
Dockerfile Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
Dockerfile Outdated Show resolved Hide resolved
Dockerfile Show resolved Hide resolved
Copy link
Collaborator

@edawson edawson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some changes requested, but overall in good shape. Recommend refactoring PyRecord to a more descriptive name, adding some small tests, and cleaning up a little branchiness. I also noticed a lot of errors get pushed down to the rust class - is this desired?

@skothenhill-nv
Copy link
Collaborator Author

/build-ci

@skothenhill-nv skothenhill-nv enabled auto-merge (squash) November 27, 2024 23:04
@skothenhill-nv
Copy link
Collaborator Author

/build-ci

@skothenhill-nv
Copy link
Collaborator Author

/build-ci

1 similar comment
@skothenhill-nv
Copy link
Collaborator Author

/build-ci

@skothenhill-nv
Copy link
Collaborator Author

/build-ci

@skothenhill-nv skothenhill-nv merged commit 47ca4a5 into main Nov 28, 2024
4 checks passed
@skothenhill-nv skothenhill-nv deleted the hillst/bionemo_noodles branch November 28, 2024 05:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants