-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a scan for intervals of high depth and excludes reads from those regions from evidence #4438
Conversation
It bothers me a bit that we're doing a shuffle (reduceByKey operation in FBES line 880) on the big int arrays of coverage counts. Would've been so much nicer to process each partition all the way to high-coverage intervals independently, but I understand why it's done this way: to handle counts that cross partition boundaries. Since it's a pretty quick step, and since I can't think of a straightforward way to handle partition boundary crossing any better than this, I'm giving it the thumbs up. I add a few niggles to particular lines and then add another general comment with the approval indication. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of very minor suggestions. Good to go.
final FindBreakpointEvidenceSparkArgumentCollection params, | ||
final JavaSparkContext ctx, | ||
final Broadcast<ReadMetadata> broadcastMetadata, | ||
final List<SVInterval> intervals, | ||
final JavaRDD<GATKRead> unfilteredReads, | ||
final SVReadFilter filter, | ||
final Logger logger) { | ||
|
||
final int minHighCovFactor = params.highDepthCoverageFactor; | ||
final int maxHighCovFactor = params.highDepthCoveragePeakFactor; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling this a "max" is misleading. It's actually a minPeakHighCoverageFactor. (I guess the previous line defines a minFlankingHighCoverageFactor.) Anyway, I think it would be good to get rid of the "max" here, 3 lines below, and in the code that receives this value as an arg (which also uses the "max" name).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed these guys to minFlankingHighCoverageValue
and minPeakHighCoverageValue
in this method and methods that receive these values as args.
final Logger logger, | ||
final Broadcast<ReadMetadata> broadcastMetadata) { | ||
final long nRefBases = broadcastMetadata.getValue().getNRefBases(); | ||
final List<SVInterval> depthIntervals = new ArrayList<>((int) ((float) nRefBases / DEPTH_WINDOW_SIZE)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually an underestimate of the necessary capacity.
Here's a slight overestimate:
final int dictSize = header,getSequenceDictionary().getSequences().size();
final int capacity = (nRefBases + DEPTH_WINDOW_SIZE - 1)/DEPTH_WINDOW_SIZE + dictSize;
Or you could just stream the refSeq records, and sum the exact size necessary for each:
final int capacity = refSeqs.stream().mapToInt(seqRec ->
(seqRec.getSequenceLength() + DEPTH_WINDOW_SIZE - 1)/DEPTH_WINDOW_SIZE).sum();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, went with the exact size calculation, thanks.
@@ -53,6 +66,19 @@ public boolean isTemplateLenTestable( final GATKRead read ) { | |||
read.getStart() - allowedShortFragmentOverhang <= read.getMateStart(); | |||
} | |||
|
|||
public boolean containedInRegionToIgnore(final SVInterval interval, final SVIntervalTree<SVInterval> regionsToIgnore) { | |||
if (regionsToIgnore.hasOverlapper(interval)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary to call this. Just create the overlappers iterator, and if it's empty there aren't any overlappers. (You're doing the binary search twice, whereas if you left this line out you'd get the same results.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
int intervalsIndex = 0; | ||
final int intervalsSize = intervals.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment refers to the following line, not this one:
Should we filter using isMappedToPrimary, rather than isMapped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean isMappedPrimary
or isMappedToPrimaryContig
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, this could be isMappedToPrimaryContig
. Changed to this to save some unnecessary work.
int intervalsContainingReadIndex = intervalsIndex; | ||
SVInterval indexedInterval = intervals.get(intervalsContainingReadIndex); | ||
|
||
while (intervalsContainingReadIndex < intervals.size() && indexedInterval.overlaps(readInterval)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while (intervalsContainingReadIndex < intervals.size()) {
final SVInterval indexedInterval = intervals.get(intervalsContainingReadIndex);
if ( !indexedInterval.overlaps(readInterval) ) break;
And now you can get rid of lines 59-61, too.
intervalCoverage[intervalsContainingReadIndex] = new int[indexedInterval.getLength()]; | ||
} | ||
for (int i = overlapInterval.getStart(); i < overlapInterval.getEnd(); i++) { | ||
intervalCoverage[intervalsContainingReadIndex][i - indexedInterval.getStart()] += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure that the JVM is smart enough to optimize by pulling out the loop invariant, but too many years programming in C makes me suggest pulling out the constant reference to the int array in question:
final int[] coverageArray = intervalCoverage[intervalsContainingReadIndex];
for ( int i = ... ) {
coverageArray[i - indexInterval.getStart()] += 1;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
final SVInterval readInterval = new SVInterval(readContigId, readStart, read.getUnclippedEnd()+1); | ||
if ( indexedInterval.isDisjointFrom(readInterval) ) return noName; | ||
final SVInterval unclippedReadInterval = new SVInterval(readContigId, read.getUnclippedStart(), read.getUnclippedEnd()); | ||
final SVInterval clippedReadInterval = new SVInterval(readContigId, read.getStart(), read.getEnd()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make no interval before its time. Move this line below the next one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
final List<SVInterval> depthIntervals = new ArrayList<>((int) ((float) nRefBases / DEPTH_WINDOW_SIZE)); | ||
for (final SAMSequenceRecord sequenceRecord : header.getSequenceDictionary().getSequences()) { | ||
for (int i = 1; i < sequenceRecord.getSequenceLength(); i = i + DEPTH_WINDOW_SIZE) { | ||
depthIntervals.add(new SVInterval(readMetadata.getContigID(sequenceRecord.getSequenceName()), i, Math.min(sequenceRecord.getSequenceLength(), i + DEPTH_WINDOW_SIZE))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
readMetadata.getContigID(sequenceRecord.getSequenceName()) is a loop invariant (and involves a hashed lookup). Pull it out as a temporary variable.
(So is sequenceRecord.getSequenceLength(), but that's fast to compute, so it doesn't matter.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
…mbly qnames from reads that align solely within those high-depth intervals
Sorry. isMappedToPrimaryContig.
Since we’re not looking for evidence on alts, we probably don’t need info on pile-ups in alts, do we?
|
16bd967
to
fe02a8a
Compare
Thanks for the comments @tedsharpe , some good code cleanup here. I've addressed your comments and will merge after tests pass. I agree about the shuffle -- I'd originally implemented this without it but then found some places where high-depth intervals were getting clipped at the boundaries of depth window partitions due to overlapping counts. In practice it doesn't seem to take a discernible amount of runtime, at least on our development clusters. |
Codecov Report
@@ Coverage Diff @@
## master #4438 +/- ##
===============================================
- Coverage 79.156% 79.093% -0.064%
- Complexity 16583 16606 +23
===============================================
Files 1049 1049
Lines 59510 59663 +153
Branches 9747 9785 +38
===============================================
+ Hits 47106 47189 +83
- Misses 8620 8678 +58
- Partials 3784 3796 +12
|
This PR attempts to eliminate long-running, useless assemblies that significantly extend runtime on some samples:
In addition, after observing that many long-running assemblies occur on non-primary reference contigs, we also exclude reads that map to non-primary contigs (as defined by the "cross-contig to ignore set") from evidence gathering.
Runtime on the CHM mix sample with this change is approximately 38 minutes, and our NA19238 snapshot now takes only 22 minutes, a significant drop in runtime. There are a few changes in the resulting call set but they appear to be minimal.