Add a scan for intervals of high depth and excludes reads from those regions from evidence #4438

cwhelan · 2018-02-22T14:29:19Z

This PR attempts to eliminate long-running, useless assemblies that significantly extend runtime on some samples:

Conducts a scan over the genome to find intervals of excessive depth, defined as an interval where coverage is greater than a lower factor times the average coverage of the sample and containing a coverage peak greater than an upper factor times the average coverage.
Nearby high-coverage regions within one read-length of each other are merged together.
Excludes reads that map exclusively inside high coverage regions from evidence gathering.
Excludes reads that map exclusively inside high coverage regions from QName finding for seeding assemblies.

In addition, after observing that many long-running assemblies occur on non-primary reference contigs, we also exclude reads that map to non-primary contigs (as defined by the "cross-contig to ignore set") from evidence gathering.

Runtime on the CHM mix sample with this change is approximately 38 minutes, and our NA19238 snapshot now takes only 22 minutes, a significant drop in runtime. There are a few changes in the resulting call set but they appear to be minimal.

tedsharpe · 2018-02-26T20:21:53Z

It bothers me a bit that we're doing a shuffle (reduceByKey operation in FBES line 880) on the big int arrays of coverage counts. Would've been so much nicer to process each partition all the way to high-coverage intervals independently, but I understand why it's done this way: to handle counts that cross partition boundaries. Since it's a pretty quick step, and since I can't think of a straightforward way to handle partition boundary crossing any better than this, I'm giving it the thumbs up. I add a few niggles to particular lines and then add another general comment with the approval indication.

tedsharpe

Couple of very minor suggestions. Good to go.

tedsharpe · 2018-02-26T20:35:57Z

.../java/org/broadinstitute/hellbender/tools/spark/sv/evidence/FindBreakpointEvidenceSpark.java

            final FindBreakpointEvidenceSparkArgumentCollection params,
            final JavaSparkContext ctx,
            final Broadcast<ReadMetadata> broadcastMetadata,
            final List<SVInterval> intervals,
            final JavaRDD<GATKRead> unfilteredReads,
            final SVReadFilter filter,
            final Logger logger) {
+
+        final int minHighCovFactor = params.highDepthCoverageFactor;
+        final int maxHighCovFactor = params.highDepthCoveragePeakFactor;


Calling this a "max" is misleading. It's actually a minPeakHighCoverageFactor. (I guess the previous line defines a minFlankingHighCoverageFactor.) Anyway, I think it would be good to get rid of the "max" here, 3 lines below, and in the code that receives this value as an arg (which also uses the "max" name).

Renamed these guys to minFlankingHighCoverageValue and minPeakHighCoverageValue in this method and methods that receive these values as args.

tedsharpe · 2018-02-26T20:43:02Z

.../java/org/broadinstitute/hellbender/tools/spark/sv/evidence/FindBreakpointEvidenceSpark.java

+                                                                                  final Logger logger,
+                                                                                  final Broadcast<ReadMetadata> broadcastMetadata) {
+        final long nRefBases = broadcastMetadata.getValue().getNRefBases();
+        final List<SVInterval> depthIntervals = new ArrayList<>((int) ((float) nRefBases / DEPTH_WINDOW_SIZE));


This is actually an underestimate of the necessary capacity.

Here's a slight overestimate:

final int dictSize = header,getSequenceDictionary().getSequences().size(); final int capacity = (nRefBases + DEPTH_WINDOW_SIZE - 1)/DEPTH_WINDOW_SIZE + dictSize;

Or you could just stream the refSeq records, and sum the exact size necessary for each:

final int capacity = refSeqs.stream().mapToInt(seqRec -> (seqRec.getSequenceLength() + DEPTH_WINDOW_SIZE - 1)/DEPTH_WINDOW_SIZE).sum();

Done, went with the exact size calculation, thanks.

tedsharpe · 2018-02-26T20:45:45Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/evidence/SVReadFilter.java

@@ -53,6 +66,19 @@ public boolean isTemplateLenTestable( final GATKRead read ) {
                read.getStart() - allowedShortFragmentOverhang <= read.getMateStart();
    }

+    public boolean containedInRegionToIgnore(final SVInterval interval, final SVIntervalTree<SVInterval> regionsToIgnore) {
+        if (regionsToIgnore.hasOverlapper(interval)) {


Unnecessary to call this. Just create the overlappers iterator, and if it's empty there aren't any overlappers. (You're doing the binary search twice, whereas if you left this line out you'd get the same results.)

tedsharpe · 2018-02-26T20:49:20Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/evidence/IntervalCoverageFinder.java

        int intervalsIndex = 0;
+        final int intervalsSize = intervals.size();


Comment refers to the following line, not this one:
Should we filter using isMappedToPrimary, rather than isMapped?

Do you mean isMappedPrimary or isMappedToPrimaryContig?

You're right, this could be isMappedToPrimaryContig. Changed to this to save some unnecessary work.

tedsharpe · 2018-02-26T20:54:57Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/evidence/IntervalCoverageFinder.java

+            int intervalsContainingReadIndex = intervalsIndex;
+            SVInterval indexedInterval = intervals.get(intervalsContainingReadIndex);
+
+            while (intervalsContainingReadIndex < intervals.size() && indexedInterval.overlaps(readInterval)) {


while (intervalsContainingReadIndex < intervals.size()) { final SVInterval indexedInterval = intervals.get(intervalsContainingReadIndex); if ( !indexedInterval.overlaps(readInterval) ) break;

And now you can get rid of lines 59-61, too.

tedsharpe · 2018-02-26T20:59:05Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/evidence/IntervalCoverageFinder.java

+                    intervalCoverage[intervalsContainingReadIndex] = new int[indexedInterval.getLength()];
+                }
+                for (int i = overlapInterval.getStart(); i < overlapInterval.getEnd(); i++) {
+                    intervalCoverage[intervalsContainingReadIndex][i - indexedInterval.getStart()] += 1;


I'm sure that the JVM is smart enough to optimize by pulling out the loop invariant, but too many years programming in C makes me suggest pulling out the constant reference to the int array in question:

final int[] coverageArray = intervalCoverage[intervalsContainingReadIndex]; for ( int i = ... ) { coverageArray[i - indexInterval.getStart()] += 1; }

tedsharpe · 2018-02-26T21:01:46Z

src/main/java/org/broadinstitute/hellbender/tools/spark/sv/evidence/QNameFinder.java

-        final SVInterval readInterval = new SVInterval(readContigId, readStart, read.getUnclippedEnd()+1);
-        if ( indexedInterval.isDisjointFrom(readInterval) ) return noName;
+        final SVInterval unclippedReadInterval = new SVInterval(readContigId, read.getUnclippedStart(), read.getUnclippedEnd());
+        final SVInterval clippedReadInterval = new SVInterval(readContigId, read.getStart(), read.getEnd());


Make no interval before its time. Move this line below the next one.

tedsharpe · 2018-02-26T21:06:52Z

.../java/org/broadinstitute/hellbender/tools/spark/sv/evidence/FindBreakpointEvidenceSpark.java

+        final List<SVInterval> depthIntervals = new ArrayList<>((int) ((float) nRefBases / DEPTH_WINDOW_SIZE));
+        for (final SAMSequenceRecord sequenceRecord : header.getSequenceDictionary().getSequences()) {
+            for (int i = 1; i < sequenceRecord.getSequenceLength(); i = i + DEPTH_WINDOW_SIZE) {
+                depthIntervals.add(new SVInterval(readMetadata.getContigID(sequenceRecord.getSequenceName()), i, Math.min(sequenceRecord.getSequenceLength(), i + DEPTH_WINDOW_SIZE)));


readMetadata.getContigID(sequenceRecord.getSequenceName()) is a loop invariant (and involves a hashed lookup). Pull it out as a temporary variable.
(So is sequenceRecord.getSequenceLength(), but that's fast to compute, so it doesn't matter.)

…mbly qnames from reads that align solely within those high-depth intervals

tedsharpe · 2018-02-27T16:51:27Z

Sorry. isMappedToPrimaryContig. Since we’re not looking for evidence on alts, we probably don’t need info on pile-ups in alts, do we?

cwhelan · 2018-02-27T18:33:49Z

Thanks for the comments @tedsharpe , some good code cleanup here.

I've addressed your comments and will merge after tests pass.

I agree about the shuffle -- I'd originally implemented this without it but then found some places where high-depth intervals were getting clipped at the boundaries of depth window partitions due to overlapping counts. In practice it doesn't seem to take a discernible amount of runtime, at least on our development clusters.

codecov-io · 2018-02-27T19:40:17Z

Codecov Report

Merging #4438 into master will decrease coverage by 0.064%.
The diff coverage is 65.789%.

@@               Coverage Diff               @@
##              master     #4438       +/-   ##
===============================================
- Coverage     79.156%   79.093%   -0.064%     
- Complexity     16583     16606       +23     
===============================================
  Files           1049      1049               
  Lines          59510     59663      +153     
  Branches        9747      9785       +38     
===============================================
+ Hits           47106     47189       +83     
- Misses          8620      8678       +58     
- Partials        3784      3796       +12

Impacted Files	Coverage Δ	Complexity Δ
...te/hellbender/tools/spark/sv/utils/SVInterval.java	`84.783% <0%> (-4.106%)`	`29 <0> (-1)`
...ools/spark/sv/evidence/ExtractSVEvidenceSpark.java	`0% <0%> (ø)`	`0 <0> (ø)`	⬇️
...ellbender/tools/spark/sv/evidence/QNameFinder.java	`95.833% <100%> (+0.379%)`	`10 <0> (+1)`	⬆️
...tructuralVariationDiscoveryArgumentCollection.java	`96.875% <100%> (+0.101%)`	`0 <0> (ø)`	⬇️
...spark/sv/evidence/FindBreakpointEvidenceSpark.java	`69.729% <55.172%> (-4.11%)`	`60 <7> (+6)`
...llbender/tools/spark/sv/evidence/SVReadFilter.java	`70.588% <71.429%> (-0.84%)`	`26 <8> (+3)`
...bender/tools/spark/sv/evidence/ReadClassifier.java	`85.542% <75%> (-1.3%)`	`35 <1> (+2)`
...ools/spark/sv/evidence/IntervalCoverageFinder.java	`79.012% <79.104%> (-9.449%)`	`19 <11> (+11)`
...e/hellbender/engine/spark/SparkContextFactory.java	`71.233% <0%> (-2.74%)`	`11% <0%> (ø)`
... and 2 more

cwhelan requested a review from tedsharpe February 22, 2018 14:29

tedsharpe approved these changes Feb 26, 2018

View reviewed changes

Add a scan for intervals of high depth, and exclude evidence and asse…

f829582

…mbly qnames from reads that align solely within those high-depth intervals

implement PR comments

fe02a8a

cwhelan force-pushed the cw_regions_with_high_depth branch from 16bd967 to fe02a8a Compare February 27, 2018 18:30

cwhelan merged commit 63f6f20 into master Feb 27, 2018

cwhelan deleted the cw_regions_with_high_depth branch February 27, 2018 19:53

cwhelan mentioned this pull request Jul 20, 2018

Turn off assemblies of non-primary contigs #3225

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a scan for intervals of high depth and excludes reads from those regions from evidence #4438

Add a scan for intervals of high depth and excludes reads from those regions from evidence #4438

cwhelan commented Feb 22, 2018

tedsharpe commented Feb 26, 2018

tedsharpe left a comment

tedsharpe Feb 26, 2018

cwhelan Feb 27, 2018

tedsharpe Feb 26, 2018

cwhelan Feb 27, 2018

tedsharpe Feb 26, 2018

cwhelan Feb 27, 2018

tedsharpe Feb 26, 2018

cwhelan Feb 27, 2018

cwhelan Feb 27, 2018

tedsharpe Feb 26, 2018

tedsharpe Feb 26, 2018

cwhelan Feb 27, 2018

tedsharpe Feb 26, 2018

cwhelan Feb 27, 2018

tedsharpe Feb 26, 2018

cwhelan Feb 27, 2018

tedsharpe commented Feb 27, 2018 via email

cwhelan commented Feb 27, 2018

codecov-io commented Feb 27, 2018

		int intervalsIndex = 0;
		final int intervalsSize = intervals.size();

Add a scan for intervals of high depth and excludes reads from those regions from evidence #4438

Add a scan for intervals of high depth and excludes reads from those regions from evidence #4438

Conversation

cwhelan commented Feb 22, 2018

tedsharpe commented Feb 26, 2018

tedsharpe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tedsharpe commented Feb 27, 2018 via email

cwhelan commented Feb 27, 2018

codecov-io commented Feb 27, 2018

Codecov Report