-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support using Spark-BAM to load BAM files #1683
Comments
Pasting some context here from emails:
After 1.4.4 I started publishing to My fork of ADAM core has a most recent release, "0.23.2" (no real relation to upstream versions at this point), that uses the latest versions of lots of the pageant libraries, including I'd recommend seeing if you can bump ADAM up to those versions; I'm not surprised that there's conflicts with older versions of e.g. genomic-loci. I can help with that some, or with trying to get it to work on the older versions if that's necessary for some reason.
haha, I'd forged ahead only supporting Spark 2.x and Scala 2.11, but will think about what it would take to add back in Spark 1.x and Scala 2.10, though I am not that excited about that idea… is ADAM still going to support those for a while? |
@ryan-williams if you're going to fork and publish packages under your own groupId, can you at least relocate the classes that you are publishing? I was able to get things compiling after moving to
What appears to be going on is that you're depending on |
We've committed to support Spark 1.x and Scala 2.10 for this next release and then will drop them. |
I'd be curious to know what you had to register, if it's easy to pass along
Open to discussing this, but it seems to me that the way conflicts like this should ideally be resolved is for the library linking against the conflicting versions (i.e. ADAM in this case) to do the relocating. Of course, if the tooling (sbt-assembly / maven-assembly-plugin) doesn't support this then that would complicate matters; I'm pretty sure sbt-assembly has APIs that imply that it can handle this kind of "relocation within a dependency", but I've not used them personally; I'd like to see whether it's supported by each tool before starting down the "tell users to link against an assembly JAR with everything defensively relocated" path. Either way, sorry that the forked hadoop-bam is causing this issue / that it was surprising. |
Ah I see your registrations on #1686 FWIW you should be able to |
Thanks! Good to know; will switch to that.
Relocating artifacts isn't a great approach in the first place; IMO, relocation really should be a last chance way to deal with runtime dependency conflicts that have come in from transitive dependencies. Beyond that, there's a couple more specific reasons I dislike this strategy:
Your thoughts? |
Our license is explicitly chosen to allow forking the code base. Yeah, I mean, I'd love to have @ryan-williams contributing more here instead of off on his own fork, but I think he's got a different development style and different architectural goals, and I encourage him—or anyone—who feels like they're better served by a fork to go and fork.
I think that release cadence is one of the reasons for @ryan-williams forking, but ultimately a minor one in a sea of reasons (e.g., choosing to only take adam-core, focus only on Spark 2.x/Scala 2.11, tighter integration with Pageant, etc). That said, the release cadence issues are a much larger problem than a single person, and they are a persistent, hard to fix problem: look at how far 0.23.0 has slipped. Its absurd to suggest that @ryan-williams would've been able to singlehandedly fix that. |
thanks for this discussion, just a quick note before I go offline for the day, sorry! @fnothaft possible quick-fix: what if you link ADAM against To summarize this discussion, we need at least one of:
duly noted; may go this route when I have a bit more time at keyboard / we've hashed out the couple alternatives more. |
I was talking specifically about Hadoop-BAM. It is not the greatest codebase I've ever seen, but it is a shared integration point between us, Hammer Lab, and the GATK4 team, so any work done there to minimize or eliminate BAM split-guessing issues helps us all. |
@heuermh its really hard to fully fix the split picking code in Hadoop-BAM without either relying on an index (really the .splitting-bai, even the .bai codepath relies on the split guesser in places) or adding extensive validation to the split guesser. The current split guesset is expensive if on a slow file system, and validation makes it too expensive (IIRC) to really work with Hadoop-BAMs current approach. |
I mostly disagree with this; spark-bam uses more checks than hadoop-bam and that's enough to be correct in every uncompressed position in about 10TB of BAMs that I've run it on (from different sequencing platforms, aligners, insert lengths, etc.). For comparison, hadoop-bam has something like 1e7 false positives on this same corpus, though the vast majority of them would never result in a bad split because they don't come before the first true-positive in their respective BGZF blocks (a common special case of this is BAMs where reads are aligned to the start of BGZF blocks). It would be pretty simple to add some of those checks to hadoop-bam, and I imagine someone will, but that still leaves us with…
My belief/hope is that spark-bam fixes this. Most slowness I've seen comes from:
The actual finding of a record-start from an arbitrary (compressed) BAM offset is very quick in both hadoop-bam and spark-bam (well under 1s, probably more like 10ms otomh?), and basically instantaneous relative to the processing of a e.g. 32MB compressed BAM split (which spark-bam amortizes the split-computation into), so my view is that spark-bam is a big step forward for this particular annoying task, not that the task is particularly unsolvable. I would very much like to test on some of your pathological cases @fnothaft, which afaict involve
@heuermh I would certainly love for other folks to use it! I've only had time for the Feature Work and not the Merge Work, and I also think that some modestly sophisticated tooling would obviate the need for a lot of the Merge Work (e.g. the "relocating" that needs to happen for ADAM's and spark-bam's respective hadoop-bams (used only "privately" on each side) to coexist on one classpath is conceptually trivial, though difficult or maybe impossible with current tools, afawk). |
I'm still waiting on approval from a client to share some of their BAM files that fail, hopefully not more than another week out. In those cases the files were downloaded from s3 to HDFS before accessing with Hadoop-BAM through ADAM.
... is the part I'm not so excited about. How far is your fork of Hadoop-BAM from the current version? Is it possible to pull request the difference and get it in before the next (7.8.1 I believe) release? It is quite likely that we'll bump to that version before our next (0.23.0) release. |
@ryan-williams my apologies if it was unclear, there was supposed to be a "more" in front of "extensive"; anyways, the "extensive validation" I was referring to is the validation that spark-bam adds. |
+1, I'm pretty sure that it has the architecture correct. OOC, WRT:
Is this something that can be resolved more simply by using
Yeah, I believe I have the ADAM integration working now. I got it working, but then we had some cluster maintenance, so I haven't tested it on any of the pathological cases that I have locally yet. I'll report back once I have, though (hopefully tonight). Unfortunately, I can't share most of the BAMs, but I believe I a small one I can. WRT:
Here's the diff. I know
We need Hadoop-BAM 7.8.1 in ADAM 0.23.0 for various critical fixes (e.g., to BGZF codec split guesser, + header read issues fixed in HTSJDK 2.11.0). |
Back from vacation and taking a fresh look at this, one thing that occurs to me is the forked hadoop-bam dep is really only used by my benchmarking/checking code, and that is logically pretty separate from the BAM-loading stuff that downstream libs / ADAM would be interested in. I will think about splitting a couple of separate artifacts out: one for BAM-loading and another that has all of these analysis functionalities; that may cleanly resolve the dep issues. |
Hi @ryan-williams! I hope you had a relaxing vacation. Thanks for looking into splitting the validation code out; if we could get off the forked Hadoop-BAM, that would make it much lower risk on our end. |
JFYI, I should have a spark-bam published this wknd that doesn't use my hadoop-bam fork, for your linking convenience! Unexpected hang-up was finding a good way to share some test
and then have modules depend on that published JAR. However, when you depend on such an artifact, you end up opening Anyway, I may just make a git repo with the test ADAM does something so that e.g. CLI can see core's |
The pattern I use is to put test resources in a test jar artifact, and then copy resources from the classpath to tmp for unit tests. I may have introduced that in places in the ADAM build. |
+1, I've come to like this pattern pretty well. |
Interesting; when do you do the copying? Per-suite? Per-test-case? |
As needed; it's done with the copyResource function in SparkFunSuite. |
I just published lmk if you have a chance to try that out… i'm still trying to get Travis to be happy with it and get a proper merge+release out |
docs microsite coming along at http://www.hammerlab.org/spark-bam/, no new content there atm though if you read the README previously, but it's better organized now |
Closing as WontFix |
CC @ryan-williams, test against hammerlab/spark-bam#5 for now.
The text was updated successfully, but these errors were encountered: