Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Considering other libraries for models, is Disq too tightly bound to htsjdk? #87

Open
heuermh opened this issue Feb 18, 2019 · 2 comments

Comments

@heuermh
Copy link
Contributor

heuermh commented Feb 18, 2019

As a thought exercise (I'm not suggesting we actually implement these suggestions at this time), how might Disq be extended to provide distributed RDDs for models from libraries other than htsjdk?

E.g. RDD<SamRecord> from
https://github.com/heuermh/dishevelled-bio/tree/master/alignment/src/main/java/org/dishevelled/bio/alignment/sam

RDD<VcfRecord> from
https://github.com/heuermh/dishevelled-bio/tree/master/variant/src/main/java/org/dishevelled/bio/variant/vcf

RDD<Fastq> from
https://github.com/biojava/biojava/tree/master/biojava-genome/src/main/java/org/biojava/nbio/genome/io/fastq

etc.

Are there changes we can make to the public APIs to make this possible? Note that even non-htsjdk Disq APIs such as ReadsFormatWriteOption are tightly bound to htsjdk.

Is there any implementation code that can be reused to support these other libraries?

@tomwhite
Copy link
Member

Good questions. Definitely something to think about, especially as we might want to support a htsjdk 3 as well one day.

Disq's public API has quite a small surface area, so I wonder how easy it would be to try an experiment where a different model is exposed, mapped to/from htsjdk types to start with. So it would use the Disq htsjdk implementation classes, but convert between htsjdk types and another model (such as dishevelled), so the user only sees the other model.

Reusing the implementation code in Disq is harder, since it is closely entwined with htsjdk classes There may be some parts that it would be valuable to share or reuse though, like BgzfBlockGuesser and BamRecordGuesser, which although they use htsjdk, it's generally not the model classes, but utility classes like SeekableStream and BlockCompressedInputStream.

@heuermh
Copy link
Contributor Author

heuermh commented Feb 19, 2019

Disq's public API has quite a small surface area, so I wonder how easy it would be to try an experiment where a different model is exposed, mapped to/from htsjdk types to start with. So it would use the Disq htsjdk implementation classes, but convert between htsjdk types and another model (such as dishevelled), so the user only sees the other model.

That is the approach taken in ADAM

https://github.com/bigdatagenomics/adam/tree/master/adam-core/src/main/scala/org/bdgenomics/adam/converters

and in downstream libraries

https://github.com/bigdatagenomics/convert
https://github.com/heuermh/dishevelled-bio/tree/master/convert
https://github.com/heuermh/dishevelled-bio/tree/master/convert-htsjdk
https://github.com/heuermh/biojava-adam/tree/master/src/main/java/org/biojava/nbio/adam/convert
https://github.com/heuermh/biojava-legacy-adam/tree/master/src/main/java/org/biojava/adam/convert

I haven't been following the htsjdk 3 conversation, so I don't know where that might be headed. I do know that working with the current htsjdk model APIs is rather frustrating, especially around attributes, and looking forward in ADAM we're likely to bypass model parsing in Java/Scala followed by conversion altogether and do it straight to Dataset/DataFrame with Spark SQL.

Reusing the implementation code in Disq is harder, since it is closely entwined with htsjdk classes There may be some parts that it would be valuable to share or reuse though, like BgzfBlockGuesser and BamRecordGuesser, which although they use htsjdk, it's generally not the model classes, but utility classes like SeekableStream and BlockCompressedInputStream.

I've suggested elsewhere that it would be nice to have BGZF related functionality pulled from htsjdk and donated to Apache Commons IO. I've done some adapting here but it still depends on the whole htsjdk jar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants