Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

advice on deserializing cram from htsget #3

Open
cmdoret opened this issue Jul 11, 2024 · 2 comments
Open

advice on deserializing cram from htsget #3

cmdoret opened this issue Jul 11, 2024 · 2 comments

Comments

@cmdoret
Copy link

cmdoret commented Jul 11, 2024

Hey!

I've stumbled upon this project as we've started using your excellent htsget-rs server implementation :)

Now that we're trying to lazily consume the stream of CRAM/BCF on the client side (in python using a file-like interface), I think we're running into limitations of pysam, as it can only parse records from the filesystem (pysam-developers/pysam#1297).

I was thinking of (trying to) make python mappings for noodles-htsget to parse the stream lazily on the client side. As you've apparently been working on the problem, I would be interested to hear your thoughts on the matter.

The aim would be essentially to get a python library that exposes a lazy iterator over the htsget stream. In spirit, very similar (I think) to what you started in this repo, maybe some simple interface like:

con = HtsgetConnection.from_url(
  'http://localhost:8080/reads/file?format=CRAM&referenceName=chr1&start=103&end=1320'
)
with con.open() as stream:
  for record in stream:
    print(record.start)
brainstorm added a commit that referenced this issue Jul 15, 2024
…l test files and test out the iterator example for CRAM asked in #3
@brainstorm
Copy link
Member

brainstorm commented Jul 15, 2024

Hi @cmdoret, great question!

I was planning to get a Rust noodles CRAM+Crypt4GH iterator example for you but then I just realised that perhaps the python side is more important to you? If that's the case, I'd look into noodles-htsget crate and PyO3/maturin, here are some resources:

https://pyo3.github.io/pyo3/v0.20.0/getting_started.html

I don't have plans to put together and support those Python bindings myself, but do keep me posted, totally interested and happy to help if you get stuck!

Thanks again for poking into that repo, reminded me I should tilt it back and move it forward to completion :)

/cc @mmalenic

@cmdoret
Copy link
Author

cmdoret commented Jul 17, 2024

Hi @brainstorm, thanks for your answer !

I've started something at https://github.com/cmdoret/htslurp to try and make python bindings.
I got the server to stream bytes from rust to python, but I can't yet figure out how to get a Record iterator on that stream in the noodles api. If you have an idea, an example would be very welcome :))

Then my plan was to make a struct that wraps noodles' CRAM/BCF records and defines python mappings.
My impression is that it's probably easier to implement everything in rust and only expose to python:

  • a function that takes the htsget url
  • a record iterator
    To minimize the interface between languages and also because the client is async and that it is hard to pass this to python.

I guess from python it could look something like (not yet implemented):

import htslurp
iterator = htslurp.stream('https://localhost/htsget/reads/file?format=CRAM')
for rec in iterator:
  type(rec) # -> htslurp.AlignmentRecord

where htslurp.AlignmentRecord would have ~ the same fields as noodles::cram::Record

Does this approach make sense to you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants