Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support using Spark-BAM to load BAM files #1683

Closed
fnothaft opened this issue Aug 25, 2017 · 25 comments
Closed

Support using Spark-BAM to load BAM files #1683

fnothaft opened this issue Aug 25, 2017 · 25 comments
Assignees
Milestone

Comments

@fnothaft
Copy link
Member

CC @ryan-williams, test against hammerlab/spark-bam#5 for now.

@fnothaft fnothaft self-assigned this Aug 25, 2017
fnothaft added a commit to fnothaft/adam that referenced this issue Aug 25, 2017
@ryan-williams
Copy link
Member

Pasting some context here from emails:

I’ve been working on getting Spark-BAM integrated in with ADAM:

https://github.com/fnothaft/adam/tree/issues/1683-spark-bam

This compiles, but it fails tests due to runtime method access exceptions:

*** RUN ABORTED ***
  java.lang.IllegalAccessError: tried to access method org.hammerlab.genomics.reference.ContigName$.map()Lscala/collection/immutable/Map; from class org.hammerlab.genomics.reference.ContigName$Strict$
  at org.hammerlab.genomics.reference.ContigName$Strict$.normalize(ContigName.scala:122)
  at org.hammerlab.genomics.reference.ContigName$Factory$class.apply(ContigName.scala:92)
  at org.hammerlab.genomics.reference.ContigName$Strict$.apply(ContigName.scala:103)
  at org.hammerlab.bam.header.Header$$anonfun$1.apply(Header.scala:49)
  at org.hammerlab.bam.header.Header$$anonfun$1.apply(Header.scala:42)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.Range.foreach(Range.scala:160)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.AbstractTraversable.map(Traversable.scala:104)
  at org.hammerlab.bam.header.Header$.apply(Header.scala:42)
  at org.hammerlab.bam.header.Header$.apply(Header.scala:21)
  at org.hammerlab.bam.spark.load.CanLoadBam$class.loadReadsAndPositions(CanLoadBam.scala:219)
  at org.hammerlab.bam.spark.package$LoadBamSparkContext.loadReadsAndPositions(package.scala:8)
  at org.hammerlab.bam.spark.load.CanLoadBam$class.loadBam(CanLoadBam.scala:167)
  at org.hammerlab.bam.spark.package$LoadBamSparkContext.loadBam(package.scala:8)
  at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadBam$1.org$bdgenomics$adam$rdd$ADAMContext$$anonfun$$sparkBamRead$1(ADAMContext.scala:1516)

Ryan, have you seen this before? I’m guessing there’s a class conflict with something coming in from org.hammerlab:genomic-loci? BTW, it looks like the genomic-loci repo is tagged with 2.0.0, but the latest tag in Maven Central is 1.4.4. Not sure if you’d been tagging but not pushing, but that would be useful to us.

After 1.4.4 I started publishing to org.hammerlab.genomics:loci, but I'd not updated the badge on the README, thanks for pointing that out. The latest version is 2.0.1; I started trying to better adhere to semver, and technically removed some (minor, IIRC) API in the last release, so I made that 2.0.0.

My fork of ADAM core has a most recent release, "0.23.2" (no real relation to upstream versions at this point), that uses the latest versions of lots of the pageant libraries, including o.h.genomics:loci:2.0.1, in case that's useful to look at.

I'd recommend seeing if you can bump ADAM up to those versions; I'm not surprised that there's conflicts with older versions of e.g. genomic-loci. I can help with that some, or with trying to get it to work on the older versions if that's necessary for some reason.

BTW, would it be possible to get Scala 2.10 and Spark 1 versions of Spark-BAM? If the tests resolve the BAM split problems, then having those would be useful for us.

haha, I'd forged ahead only supporting Spark 2.x and Scala 2.11, but will think about what it would take to add back in Spark 1.x and Scala 2.10, though I am not that excited about that idea… is ADAM still going to support those for a while?

@fnothaft
Copy link
Member Author

@ryan-williams if you're going to fork and publish packages under your own groupId, can you at least relocate the classes that you are publishing? I was able to get things compiling after moving to org.hammerlab.genomics:loci and registering a bunch of serializers, but now I'm running into a new issue:

*** RUN ABORTED ***
  java.lang.NoSuchMethodError: org.seqdoop.hadoop_bam.util.SAMHeaderReader.readSAMHeaderFromStream(Ljava/io/InputStream;Lorg/apache/hadoop/conf/Configuration;)Lhtsjdk/samtools/SAMFileHeader;
  at org.hammerlab.bam.header.ContigLengths$.apply(ContigLengths.scala:31)
  at org.hammerlab.bam.spark.load.CanLoadBam$class.loadReadsAndPositions(CanLoadBam.scala:220)
  at org.hammerlab.bam.spark.package$LoadBamSparkContext.loadReadsAndPositions(package.scala:8)
  at org.hammerlab.bam.spark.load.CanLoadBam$class.loadBam(CanLoadBam.scala:167)
  at org.hammerlab.bam.spark.package$LoadBamSparkContext.loadBam(package.scala:8)

What appears to be going on is that you're depending on org.hammerlab:hadoop_bam:7.9.0, which adds this method. But, the org.hammerlab:hadoop_bam artifact leaves all of the org.seqdoop.hadoop_bam classes at org.seqdoop instead of moving them to org.hammerlab! I mean, I can "fix" this by relocating them on my end, but that's crufty. Its your prerogative if you want to fork everything and publish your own artifacts, but it makes the new artifacts that you are building difficult to integrate.

@fnothaft
Copy link
Member Author

haha, I'd forged ahead only supporting Spark 2.x and Scala 2.11, but will think about what it would take to add back in Spark 1.x and Scala 2.10, though I am not that excited about that idea… is ADAM still going to support those for a while?

We've committed to support Spark 1.x and Scala 2.10 for this next release and then will drop them.

@ryan-williams
Copy link
Member

I was able to get things compiling after moving to org.hammerlab.genomics:loci and registering a bunch of serializers

I'd be curious to know what you had to register, if it's easy to pass along

But, the org.hammerlab:hadoop_bam artifact leaves all of the org.seqdoop.hadoop_bam classes at org.seqdoop instead of moving them to org.hammerlab! I mean, I can "fix" this by relocating them on my end, but that's crufty. Its your prerogative if you want to fork everything and publish your own artifacts, but it makes the new artifacts that you are building difficult to integrate.

Open to discussing this, but it seems to me that the way conflicts like this should ideally be resolved is for the library linking against the conflicting versions (i.e. ADAM in this case) to do the relocating.

Of course, if the tooling (sbt-assembly / maven-assembly-plugin) doesn't support this then that would complicate matters; I'm pretty sure sbt-assembly has APIs that imply that it can handle this kind of "relocation within a dependency", but I've not used them personally; I'd like to see whether it's supported by each tool before starting down the "tell users to link against an assembly JAR with everything defensively relocated" path.

Either way, sorry that the forked hadoop-bam is causing this issue / that it was surprising.

@ryan-williams
Copy link
Member

I'd be curious to know what you had to register, if it's easy to pass along

Ah I see your registrations on #1686

FWIW you should be able to new org.hammerlab.bam.Registrar().registerClasses(kryo) and have it take care of the regs it needs

@fnothaft
Copy link
Member Author

FWIW you should be able to new org.hammerlab.bam.Registrar().registerClasses(kryo) and have it take care of the regs it needs

Thanks! Good to know; will switch to that.

Open to discussing this, but it seems to me that the way conflicts like this should ideally be resolved is for the library linking against the conflicting versions (i.e. ADAM in this case) to do the relocating.

Of course, if the tooling (sbt-assembly / maven-assembly-plugin) doesn't support this then that would complicate matters; I'm pretty sure sbt-assembly has APIs that imply that it can handle this kind of "relocation within a dependency", but I've not used them personally; I'd like to see whether it's supported by each tool before starting down the "tell users to link against an assembly JAR with everything defensively relocated" path.

Relocating artifacts isn't a great approach in the first place; IMO, relocation really should be a last chance way to deal with runtime dependency conflicts that have come in from transitive dependencies. Beyond that, there's a couple more specific reasons I dislike this strategy:

  1. Mostly, all you need to do in your fork to eliminate my problem is to run find . -exec sed -i -e "s/org.seqdoop/org.hammerlab/g" {} \;. Ideally you'd rename the directories as well --> git mv -r src/main/org/seqdoop src/main/org/hammerlab; git mv -r src/test/org/seqdoop src/test/org/hammerlab. As long as you did these two steps in two separate commits, you should still be able to git apply patches to org.seqdoop:hadoop-bam to your fork. You wouldn't need to create an überjar for distribution.
  2. Anyways, even if I am relocating your artifacts, it appears that I need to generate an überjar with the shaded artifacts before I build adam-core, because the conflicts I get show up in test, and they conflict with a dependency that I need to have compile scoped in adam-core. This is awkward, because now I need to publish a fully dependency resolved JAR that contains your artifacts (but again, with just org.seqdoop->org.hammerlab renamed). Maybe I'm missing something? If not, then I think this is a pretty bad practice.
  3. Last but not least, Scala doesn't appear to play well with relocating artifacts. This is a bit less important here, as all I need to relocate is your fork of hadoop-bam. However, I'd ideally relocate all of the classes that refer to the relocated classes, to make it clear that been messing with things. I've spent about 10 hours this last week trying to do this for two different libraries (spark-bam and in [ADAM-1517] Move to Parquet 1.8.2 in preparation for moving to Spark 2.2.0 #1518 to clear a Spark/Avro/Parquet dependency hell), and all I can tell is that something in "companion object"-land gets broken. Its not really clear what magic fields Scala puts into the companion object/the class it accompanies, but you need something more than the changes maven-shade-plugin can do. Perhaps sbt-assembly does this better? It looks like the approach they use is identical, so I'm doubtful. As a result, I can rewrite org.hammerlab:spark-bam to rename the org.seqdoop.hadoop_bam classes in the transitive org.hammerlab:hadoop-bam:7.9.0 dependency, but I can't rename the remaining classes in "org.hammerlab.spark-bam" to make it clear that I am messing with the classes. To me, this smells pretty bad, and means that I now have to publish classes in your namespace that conflict with your artifacts, which again, I'm not happy about.

Your thoughts?

@heuermh
Copy link
Member

heuermh commented Aug 28, 2017

Your thoughts?

I may be old school, but don't fork.

If you need to modify an upstream repository, do it by pull request and wait for a release.

If releases come too slowly, help get the release out the door.

Get off my lawn!

@fnothaft
Copy link
Member Author

Our license is explicitly chosen to allow forking the code base. Yeah, I mean, I'd love to have @ryan-williams contributing more here instead of off on his own fork, but I think he's got a different development style and different architectural goals, and I encourage him—or anyone—who feels like they're better served by a fork to go and fork.

If releases come too slowly, help get the release out the door.

I think that release cadence is one of the reasons for @ryan-williams forking, but ultimately a minor one in a sea of reasons (e.g., choosing to only take adam-core, focus only on Spark 2.x/Scala 2.11, tighter integration with Pageant, etc).

That said, the release cadence issues are a much larger problem than a single person, and they are a persistent, hard to fix problem: look at how far 0.23.0 has slipped. Its absurd to suggest that @ryan-williams would've been able to singlehandedly fix that.

@ryan-williams
Copy link
Member

thanks for this discussion, just a quick note before I go offline for the day, sorry!

@fnothaft possible quick-fix: what if you link ADAM against org.hammerlab:hadoop_bam:7.9.0, i.e. the hadoop-bam that spark-bam is linked against? not necessarily a long-term solution but seems worth exploring here. In that release I only did some minor reworking of BAMSplitGuesser APIs to make them accessible to spark-bam, IIRC; seems like it might Just Work as a drop-in replacement for the version ADAM is directly using atm.

To summarize this discussion, we need at least one of:

  • a single hadoop-bam release that has all the functionality you + your deps (including spark-bam) need
  • shading capabilities – that we (as a field) may or may not have yet – deployed in ADAM and/or spark-bam
  1. … all you need to do in your fork to eliminate my problem is to run …

duly noted; may go this route when I have a bit more time at keyboard / we've hashed out the couple alternatives more.

@heuermh
Copy link
Member

heuermh commented Aug 29, 2017

I was talking specifically about Hadoop-BAM. It is not the greatest codebase I've ever seen, but it is a shared integration point between us, Hammer Lab, and the GATK4 team, so any work done there to minimize or eliminate BAM split-guessing issues helps us all.

@fnothaft
Copy link
Member Author

@heuermh its really hard to fully fix the split picking code in Hadoop-BAM without either relying on an index (really the .splitting-bai, even the .bai codepath relies on the split guesser in places) or adding extensive validation to the split guesser. The current split guesset is expensive if on a slow file system, and validation makes it too expensive (IIRC) to really work with Hadoop-BAMs current approach.

@ryan-williams
Copy link
Member

ryan-williams commented Aug 30, 2017

its really hard to fully fix the split picking code in Hadoop-BAM without either relying on an index (really the .splitting-bai, even the .bai codepath relies on the split guesser in places) or adding extensive validation to the split guesser

I mostly disagree with this; spark-bam uses more checks than hadoop-bam and that's enough to be correct in every uncompressed position in about 10TB of BAMs that I've run it on (from different sequencing platforms, aligners, insert lengths, etc.).

For comparison, hadoop-bam has something like 1e7 false positives on this same corpus, though the vast majority of them would never result in a bad split because they don't come before the first true-positive in their respective BGZF blocks (a common special case of this is BAMs where reads are aligned to the start of BGZF blocks).

It would be pretty simple to add some of those checks to hadoop-bam, and I imagine someone will, but that still leaves us with…

The current split guesser is expensive if on a slow file system, and validation makes it too expensive (IIRC) to really work with Hadoop-BAMs current approach.

My belief/hope is that spark-bam fixes this. Most slowness I've seen comes from:

  1. lack of buffering/caching in the FileSystem-adapter libraries
    • this channels library (which spark-bam uses) solves this in spark-bam (and presumably in general)
    • GCS connectors were buffering 8MB of data underneath each position hadoop-bam checked, but after BAMSplitGuesser moved ahead by e.g. 36 bytes (reading the BAM-record-candidate's "fixed block"), ruled out the position it was inspecting, then moved back by 35 bytes to start over at the next position, the rewind caused the GCS-connector to flush its buffer
    • CachingChannel (cf. use in spark-bam) puts a (default 64KB-) block-LRU-cache over the underlying filesystem API and dramatically improved split-computation latency against gs:// URLs in my apps
  2. not parallelizing, which spark-bam also solves

The actual finding of a record-start from an arbitrary (compressed) BAM offset is very quick in both hadoop-bam and spark-bam (well under 1s, probably more like 10ms otomh?), and basically instantaneous relative to the processing of a e.g. 32MB compressed BAM split (which spark-bam amortizes the split-computation into), so my view is that spark-bam is a big step forward for this particular annoying task, not that the task is particularly unsolvable.

I would very much like to test on some of your pathological cases @fnothaft, which afaict involve s3a:// and/or "huge BAMs"; lmk how we can make that happen (I suppose a proper ADAM integration is the main/only blocker?).

[hadoop-bam] is a shared integration point between us, Hammer Lab, and the GATK4 team, so any work done there to minimize or eliminate BAM split-guessing issues helps us all.

@heuermh I would certainly love for other folks to use it! I've only had time for the Feature Work and not the Merge Work, and I also think that some modestly sophisticated tooling would obviate the need for a lot of the Merge Work (e.g. the "relocating" that needs to happen for ADAM's and spark-bam's respective hadoop-bams (used only "privately" on each side) to coexist on one classpath is conceptually trivial, though difficult or maybe impossible with current tools, afawk).

@heuermh
Copy link
Member

heuermh commented Aug 30, 2017

I would very much like to test on some of your pathological cases @fnothaft, which afaict involve s3a:// and/or "huge BAMs"; lmk how we can make that happen (I suppose a proper ADAM integration is the main/only blocker?).

I'm still waiting on approval from a client to share some of their BAM files that fail, hopefully not more than another week out. In those cases the files were downloaded from s3 to HDFS before accessing with Hadoop-BAM through ADAM.

(e.g. the "relocating" that needs to happen for ADAM's and spark-bam's respective hadoop-bams (used only "privately" on each side) to coexist on one classpath is conceptually trivial, though difficult or maybe impossible with current tools, afawk).

... is the part I'm not so excited about.

How far is your fork of Hadoop-BAM from the current version? Is it possible to pull request the difference and get it in before the next (7.8.1 I believe) release? It is quite likely that we'll bump to that version before our next (0.23.0) release.

@fnothaft
Copy link
Member Author

@ryan-williams my apologies if it was unclear, there was supposed to be a "more" in front of "extensive"; anyways, the "extensive validation" I was referring to is the validation that spark-bam adds.

@fnothaft
Copy link
Member Author

My belief/hope is that spark-bam fixes this. Most slowness I've seen comes from:

+1, I'm pretty sure that it has the architecture correct. OOC, WRT:

GCS connectors were buffering 8MB of data underneath each position hadoop-bam checked, but after BAMSplitGuesser moved ahead by e.g. 36 bytes (reading the BAM-record-candidate's "fixed block"), ruled out the position it was inspecting, then moved back by 35 bytes to start over at the next position, the rewind caused the GCS-connector to flush its buffer

Is this something that can be resolved more simply by using mark/reset on the InputStream?

I would very much like to test on some of your pathological cases @fnothaft, which afaict involve s3a:// and/or "huge BAMs"; lmk how we can make that happen (I suppose a proper ADAM integration is the main/only blocker?).

Yeah, I believe I have the ADAM integration working now. I got it working, but then we had some cluster maintenance, so I haven't tested it on any of the pathological cases that I have locally yet. I'll report back once I have, though (hopefully tonight).

Unfortunately, I can't share most of the BAMs, but I believe I a small one I can.

WRT:

How far is your fork of Hadoop-BAM from the current version? Is it possible to pull request the difference and get it in before the next (7.8.1 I believe) release?

Here's the diff. I know spark-bam needs HadoopGenomics/Hadoop-BAM@9d48c8b. I'm not 100% sure if spark-bam needs the changes to the split guessers as well. If its possible to get spark-bam back on org.seqdoop:Hadoop-BAM, that would be my preference. If you're open to it, @ryan-williams, I'd be glad to do legwork to make that happen.

It is quite likely that we'll bump to that version before our next (0.23.0) release.

We need Hadoop-BAM 7.8.1 in ADAM 0.23.0 for various critical fixes (e.g., to BGZF codec split guesser, + header read issues fixed in HTSJDK 2.11.0).

@ryan-williams
Copy link
Member

Back from vacation and taking a fresh look at this, one thing that occurs to me is the forked hadoop-bam dep is really only used by my benchmarking/checking code, and that is logically pretty separate from the BAM-loading stuff that downstream libs / ADAM would be interested in.

I will think about splitting a couple of separate artifacts out: one for BAM-loading and another that has all of these analysis functionalities; that may cleanly resolve the dep issues.

@fnothaft
Copy link
Member Author

fnothaft commented Sep 3, 2017

Hi @ryan-williams! I hope you had a relaxing vacation. Thanks for looking into splitting the validation code out; if we could get off the forked Hadoop-BAM, that would make it much lower risk on our end.

@ryan-williams
Copy link
Member

JFYI, I should have a spark-bam published this wknd that doesn't use my hadoop-bam fork, for your linking convenience!

Unexpected hang-up was finding a good way to share some test .bam files across a few of the modules; I feel like the best way to do that is to publish an artifact that has:

  • the .bam files in question in src/main/resources
  • a singleton with variables that point to each .bam for easy, string-literal-free referencing of them from tests

and then have modules depend on that published JAR.

However, when you depend on such an artifact, you end up opening InputStreams and Channels on files that are inside a JAR that is on the classpath, and NIO and Hadoop both have issues handling such files! (JDK's default JarFileSystemProvider throws an UnsupportedException in the newByteChannel method, which I was able to work around, and Hadoop Paths appear to not parse URIs correctly, such that java.net.URIs representing files in JARs can't be used to instantiate Hadoop Paths, which I am not able to work around thus far).

Anyway, I may just make a git repo with the test .bams, and add it as a submodule in multiple spark-bam modules, which is probably better than just copy-pasting the files into a few of the modules.

ADAM does something so that e.g. CLI can see core's src/test/resources, right?

@heuermh
Copy link
Member

heuermh commented Sep 9, 2017

The pattern I use is to put test resources in a test jar artifact, and then copy resources from the classpath to tmp for unit tests. I may have introduced that in places in the ADAM build.

@fnothaft
Copy link
Member Author

fnothaft commented Sep 9, 2017

The pattern I use is to put test resources in a test jar artifact, and then copy resources from the classpath to tmp for unit tests. I may have introduced that in places in the ADAM build.

+1, I've come to like this pattern pretty well.

@ryan-williams
Copy link
Member

Interesting; when do you do the copying? Per-suite? Per-test-case?

@fnothaft
Copy link
Member Author

As needed; it's done with the copyResource function in SparkFunSuite.

@ryan-williams
Copy link
Member

I just published org.hammerlab.bam:load:1.0.0-SNAPSHOT which you should be able to link against for the same sc.loadReads-style API as before but backed by org.seqdoop:hadoop-bam:7.9.0.

lmk if you have a chance to try that out… i'm still trying to get Travis to be happy with it and get a proper merge+release out

@ryan-williams
Copy link
Member

docs microsite coming along at http://www.hammerlab.org/spark-bam/, no new content there atm though if you read the README previously, but it's better organized now

fnothaft added a commit to fnothaft/adam that referenced this issue Dec 20, 2017
fnothaft added a commit to fnothaft/adam that referenced this issue Dec 23, 2017
@heuermh
Copy link
Member

heuermh commented Jan 6, 2020

Closing as WontFix

@heuermh heuermh closed this as completed Jan 6, 2020
@heuermh heuermh added this to the 0.31.0 milestone Jan 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants