-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hillst/bionemo noodles #458
Conversation
Initial commit! - adds the PyO3 wrapper around noodles-fasta - adds a python class that mimics the dict-like behavior of pyfaidx - adds a long test (best used with hg38 https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/), or use whatever big reference. Build: ``` cd sub-packages/bionemo-noodles/noodles_fasta_wrapper maturin develop ``` This should install it into your local python environment, then you can run the associated tests! TODO: - flesh out equality tests with pyfaidx - profile performance - get the build system working correctly
…ramework into �mg/bionemo_noodles_pyO3
on process safety.
Annotated dockerfile with places we should change expose the rust reader as a part of __init__.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First parital review.
sub-packages/bionemo-noodles/src/bionemo/noodles/test_nvfaidx.py
Outdated
Show resolved
Hide resolved
sub-packages/bionemo-noodles/src/bionemo/noodles/test_nvfaidx.py
Outdated
Show resolved
Hide resolved
sub-packages/bionemo-noodles/src/bionemo/noodles/test_nvfaidx.py
Outdated
Show resolved
Hide resolved
sub-packages/bionemo-noodles/src/bionemo/noodles/test_nvfaidx.py
Outdated
Show resolved
Hide resolved
… method to create faidx objects.
Co-authored-by: Malcolm Greaves <malcolmgreaves@users.noreply.github.com> Signed-off-by: Steven Kothen-Hill <148821680+skothenhill-nv@users.noreply.github.com>
sub-packages/bionemo-noodles/src/bionemo/noodles/test_nvfaidx.py
Outdated
Show resolved
Hide resolved
sub-packages/bionemo-noodles/src/bionemo/noodles/test_nvfaidx.py
Outdated
Show resolved
Hide resolved
sub-packages/bionemo-noodles/src/bionemo/noodles/test_nvfaidx.py
Outdated
Show resolved
Hide resolved
sub-packages/bionemo-noodles/src/bionemo/noodles/test_nvfaidx.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some changes requested, but overall in good shape. Recommend refactoring PyRecord to a more descriptive name, adding some small tests, and cleaning up a little branchiness. I also noticed a lot of errors get pushed down to the rust class - is this desired?
sub-packages/bionemo-noodles/tests/bionemo/noodles/test_nvfaidx.py
Outdated
Show resolved
Hide resolved
/build-ci |
/build-ci |
/build-ci |
1 similar comment
/build-ci |
/build-ci |
NvFaidx
Adds a memmapped faidx reader with python bindings.
Context
PyFaidx or any buffered read based index is not process safe, and therefore does not play nice with pytorch dataloaders.
Due to the order of operations, the underlying file handle is shared between processes, when
seek()
is called to perform random lookups, this can cause unexpected behavior in the forked processes.Ref: mdshw5/pyfaidx#211
For a good solution we need three things:
1) Safe index creation, in multi-process or multi-node scenarios, this should be restricted to a single node where all workers block until it is complete (not implemented above)
2) Index object instantion must be fast.
3) Read-only use of the index object must be both thread safe and process safe with python.
Memmap backing
Using a backend of Memmaps provide the following benefits (Credit: @malcolmgreaves)
These three usecases are ideal for pytorch DataLoaders!
Perf Benchmarks
nvfaidx query faster by 5.2x
nvfaidx instantiation faster by 16x
TODO:
Usage:
Import NvFaidx, pass in your fasta file path, and go fast.