-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Considering other libraries for models, is Disq too tightly bound to htsjdk? #87
Comments
Good questions. Definitely something to think about, especially as we might want to support a htsjdk 3 as well one day. Disq's public API has quite a small surface area, so I wonder how easy it would be to try an experiment where a different model is exposed, mapped to/from htsjdk types to start with. So it would use the Disq htsjdk implementation classes, but convert between htsjdk types and another model (such as dishevelled), so the user only sees the other model. Reusing the implementation code in Disq is harder, since it is closely entwined with htsjdk classes There may be some parts that it would be valuable to share or reuse though, like |
That is the approach taken in ADAM and in downstream libraries https://github.com/bigdatagenomics/convert I haven't been following the htsjdk 3 conversation, so I don't know where that might be headed. I do know that working with the current htsjdk model APIs is rather frustrating, especially around attributes, and looking forward in ADAM we're likely to bypass model parsing in Java/Scala followed by conversion altogether and do it straight to Dataset/DataFrame with Spark SQL.
I've suggested elsewhere that it would be nice to have BGZF related functionality pulled from htsjdk and donated to Apache Commons IO. I've done some adapting here but it still depends on the whole htsjdk jar. |
As a thought exercise (I'm not suggesting we actually implement these suggestions at this time), how might Disq be extended to provide distributed
RDD
s for models from libraries other than htsjdk?E.g.
RDD<SamRecord>
fromhttps://github.com/heuermh/dishevelled-bio/tree/master/alignment/src/main/java/org/dishevelled/bio/alignment/sam
RDD<VcfRecord>
fromhttps://github.com/heuermh/dishevelled-bio/tree/master/variant/src/main/java/org/dishevelled/bio/variant/vcf
RDD<Fastq>
fromhttps://github.com/biojava/biojava/tree/master/biojava-genome/src/main/java/org/biojava/nbio/genome/io/fastq
etc.
Are there changes we can make to the public APIs to make this possible? Note that even non-htsjdk Disq APIs such as
ReadsFormatWriteOption
are tightly bound to htsjdk.Is there any implementation code that can be reused to support these other libraries?
The text was updated successfully, but these errors were encountered: