Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a scan for intervals of high depth and excludes reads from those regions from evidence #4438

Merged
merged 2 commits into from
Feb 27, 2018

Conversation

cwhelan
Copy link
Member

@cwhelan cwhelan commented Feb 22, 2018

This PR attempts to eliminate long-running, useless assemblies that significantly extend runtime on some samples:

  • Conducts a scan over the genome to find intervals of excessive depth, defined as an interval where coverage is greater than a lower factor times the average coverage of the sample and containing a coverage peak greater than an upper factor times the average coverage.
  • Nearby high-coverage regions within one read-length of each other are merged together.
  • Excludes reads that map exclusively inside high coverage regions from evidence gathering.
  • Excludes reads that map exclusively inside high coverage regions from QName finding for seeding assemblies.

In addition, after observing that many long-running assemblies occur on non-primary reference contigs, we also exclude reads that map to non-primary contigs (as defined by the "cross-contig to ignore set") from evidence gathering.

Runtime on the CHM mix sample with this change is approximately 38 minutes, and our NA19238 snapshot now takes only 22 minutes, a significant drop in runtime. There are a few changes in the resulting call set but they appear to be minimal.

@tedsharpe
Copy link
Contributor

It bothers me a bit that we're doing a shuffle (reduceByKey operation in FBES line 880) on the big int arrays of coverage counts. Would've been so much nicer to process each partition all the way to high-coverage intervals independently, but I understand why it's done this way: to handle counts that cross partition boundaries. Since it's a pretty quick step, and since I can't think of a straightforward way to handle partition boundary crossing any better than this, I'm giving it the thumbs up. I add a few niggles to particular lines and then add another general comment with the approval indication.

Copy link
Contributor

@tedsharpe tedsharpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of very minor suggestions. Good to go.

final FindBreakpointEvidenceSparkArgumentCollection params,
final JavaSparkContext ctx,
final Broadcast<ReadMetadata> broadcastMetadata,
final List<SVInterval> intervals,
final JavaRDD<GATKRead> unfilteredReads,
final SVReadFilter filter,
final Logger logger) {

final int minHighCovFactor = params.highDepthCoverageFactor;
final int maxHighCovFactor = params.highDepthCoveragePeakFactor;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling this a "max" is misleading. It's actually a minPeakHighCoverageFactor. (I guess the previous line defines a minFlankingHighCoverageFactor.) Anyway, I think it would be good to get rid of the "max" here, 3 lines below, and in the code that receives this value as an arg (which also uses the "max" name).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed these guys to minFlankingHighCoverageValue and minPeakHighCoverageValue in this method and methods that receive these values as args.

final Logger logger,
final Broadcast<ReadMetadata> broadcastMetadata) {
final long nRefBases = broadcastMetadata.getValue().getNRefBases();
final List<SVInterval> depthIntervals = new ArrayList<>((int) ((float) nRefBases / DEPTH_WINDOW_SIZE));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually an underestimate of the necessary capacity.

Here's a slight overestimate:

final int dictSize = header,getSequenceDictionary().getSequences().size();
final int capacity = (nRefBases + DEPTH_WINDOW_SIZE - 1)/DEPTH_WINDOW_SIZE + dictSize;

Or you could just stream the refSeq records, and sum the exact size necessary for each:

final int capacity =  refSeqs.stream().mapToInt(seqRec ->
       (seqRec.getSequenceLength() + DEPTH_WINDOW_SIZE - 1)/DEPTH_WINDOW_SIZE).sum();

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, went with the exact size calculation, thanks.

@@ -53,6 +66,19 @@ public boolean isTemplateLenTestable( final GATKRead read ) {
read.getStart() - allowedShortFragmentOverhang <= read.getMateStart();
}

public boolean containedInRegionToIgnore(final SVInterval interval, final SVIntervalTree<SVInterval> regionsToIgnore) {
if (regionsToIgnore.hasOverlapper(interval)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary to call this. Just create the overlappers iterator, and if it's empty there aren't any overlappers. (You're doing the binary search twice, whereas if you left this line out you'd get the same results.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

int intervalsIndex = 0;
final int intervalsSize = intervals.size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment refers to the following line, not this one:
Should we filter using isMappedToPrimary, rather than isMapped?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean isMappedPrimary or isMappedToPrimaryContig?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, this could be isMappedToPrimaryContig. Changed to this to save some unnecessary work.

int intervalsContainingReadIndex = intervalsIndex;
SVInterval indexedInterval = intervals.get(intervalsContainingReadIndex);

while (intervalsContainingReadIndex < intervals.size() && indexedInterval.overlaps(readInterval)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while (intervalsContainingReadIndex < intervals.size()) {
   final SVInterval indexedInterval = intervals.get(intervalsContainingReadIndex);
   if ( !indexedInterval.overlaps(readInterval) ) break;

And now you can get rid of lines 59-61, too.

intervalCoverage[intervalsContainingReadIndex] = new int[indexedInterval.getLength()];
}
for (int i = overlapInterval.getStart(); i < overlapInterval.getEnd(); i++) {
intervalCoverage[intervalsContainingReadIndex][i - indexedInterval.getStart()] += 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure that the JVM is smart enough to optimize by pulling out the loop invariant, but too many years programming in C makes me suggest pulling out the constant reference to the int array in question:

final int[] coverageArray = intervalCoverage[intervalsContainingReadIndex];
for ( int i = ... ) {
    coverageArray[i - indexInterval.getStart()] += 1;
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

final SVInterval readInterval = new SVInterval(readContigId, readStart, read.getUnclippedEnd()+1);
if ( indexedInterval.isDisjointFrom(readInterval) ) return noName;
final SVInterval unclippedReadInterval = new SVInterval(readContigId, read.getUnclippedStart(), read.getUnclippedEnd());
final SVInterval clippedReadInterval = new SVInterval(readContigId, read.getStart(), read.getEnd());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make no interval before its time. Move this line below the next one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

final List<SVInterval> depthIntervals = new ArrayList<>((int) ((float) nRefBases / DEPTH_WINDOW_SIZE));
for (final SAMSequenceRecord sequenceRecord : header.getSequenceDictionary().getSequences()) {
for (int i = 1; i < sequenceRecord.getSequenceLength(); i = i + DEPTH_WINDOW_SIZE) {
depthIntervals.add(new SVInterval(readMetadata.getContigID(sequenceRecord.getSequenceName()), i, Math.min(sequenceRecord.getSequenceLength(), i + DEPTH_WINDOW_SIZE)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readMetadata.getContigID(sequenceRecord.getSequenceName()) is a loop invariant (and involves a hashed lookup). Pull it out as a temporary variable.
(So is sequenceRecord.getSequenceLength(), but that's fast to compute, so it doesn't matter.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

…mbly qnames from reads that align solely within those high-depth intervals
@tedsharpe
Copy link
Contributor

tedsharpe commented Feb 27, 2018 via email

@cwhelan
Copy link
Member Author

cwhelan commented Feb 27, 2018

Thanks for the comments @tedsharpe , some good code cleanup here.

I've addressed your comments and will merge after tests pass.

I agree about the shuffle -- I'd originally implemented this without it but then found some places where high-depth intervals were getting clipped at the boundaries of depth window partitions due to overlapping counts. In practice it doesn't seem to take a discernible amount of runtime, at least on our development clusters.

@codecov-io
Copy link

Codecov Report

Merging #4438 into master will decrease coverage by 0.064%.
The diff coverage is 65.789%.

@@               Coverage Diff               @@
##              master     #4438       +/-   ##
===============================================
- Coverage     79.156%   79.093%   -0.064%     
- Complexity     16583     16606       +23     
===============================================
  Files           1049      1049               
  Lines          59510     59663      +153     
  Branches        9747      9785       +38     
===============================================
+ Hits           47106     47189       +83     
- Misses          8620      8678       +58     
- Partials        3784      3796       +12
Impacted Files Coverage Δ Complexity Δ
...te/hellbender/tools/spark/sv/utils/SVInterval.java 84.783% <0%> (-4.106%) 29 <0> (-1)
...ools/spark/sv/evidence/ExtractSVEvidenceSpark.java 0% <0%> (ø) 0 <0> (ø) ⬇️
...ellbender/tools/spark/sv/evidence/QNameFinder.java 95.833% <100%> (+0.379%) 10 <0> (+1) ⬆️
...tructuralVariationDiscoveryArgumentCollection.java 96.875% <100%> (+0.101%) 0 <0> (ø) ⬇️
...spark/sv/evidence/FindBreakpointEvidenceSpark.java 69.729% <55.172%> (-4.11%) 60 <7> (+6)
...llbender/tools/spark/sv/evidence/SVReadFilter.java 70.588% <71.429%> (-0.84%) 26 <8> (+3)
...bender/tools/spark/sv/evidence/ReadClassifier.java 85.542% <75%> (-1.3%) 35 <1> (+2)
...ools/spark/sv/evidence/IntervalCoverageFinder.java 79.012% <79.104%> (-9.449%) 19 <11> (+11)
...e/hellbender/engine/spark/SparkContextFactory.java 71.233% <0%> (-2.74%) 11% <0%> (ø)
... and 2 more

@cwhelan cwhelan merged commit 63f6f20 into master Feb 27, 2018
@cwhelan cwhelan deleted the cw_regions_with_high_depth branch February 27, 2018 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants