-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PR series for complex SV, part 3 #3457
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3457 +/- ##
===============================================
- Coverage 80.073% 79.899% -0.174%
- Complexity 17798 17814 +16
===============================================
Files 1194 1198 +4
Lines 64540 64743 +203
Branches 10021 10056 +35
===============================================
+ Hits 51679 51729 +50
- Misses 8859 8998 +139
- Partials 4002 4016 +14
|
Step 4 towards #2703 |
* its name | ||
* its sequence as produced by the assembler (no reverse complement like in the SAM record if it maps to '-' strand), and | ||
* its stripped-down alignment information. | ||
* its name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When formatted as javadoc this will all appear as a single paragraph of text. To write a list in javadoc you need to use some basic HTML:
/**
* Locally assembled contig:
* <ul><li>its name
* <li>its sequence ...
* <li>its stripped-down ...
* </ul>
*/
Note that you don't need need </li>
as it's automatically closed by <li>
.
|
||
public final class FileUtils { | ||
|
||
public static void writeLinesToSingleFile(final Iterator<String> linesToWrite, final String fileName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that this can be replaced with Files.write()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look, @pshapiro4broad. This would also us to write to local FS, hadoop, and google buckets, i.e. more general than Files.write()
.
public static boolean createDirInBucketToWriteTo(final String pathString) { | ||
try { | ||
Utils.nonNull(pathString); | ||
if ( java.nio.file.Files.exists(java.nio.file.Paths.get(pathString)) ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, you should modify your imports so a fully qualified name isn't required for java.io.file.
symbols.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to deal with name clashes between nio Files and hadoop Files.
@@ -36,9 +37,11 @@ | |||
public final int mismatches; | |||
public final int alnScore; | |||
|
|||
// if this is true, fields "mapQual", "mismatches", "alnScore" should be viewed with care as they were simply copied from the | |||
// original alignment (not for "mismatches"), which after the split are wrong (we didn't recompute them because that would require expensive SW re-alignment) | |||
// if any of the following boolean fields are true, fields "mapQual", "mismatches", "alnScore" should be viewed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like at least part of this should be put in a javadoc for this field.
novelAdjacencyReferenceLocations.complication, it.next(), contigAlignments, broadcastReference); | ||
result.add(mateRecord); | ||
} | ||
return result.iterator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not return List<>
here? A caller can easily convert it to an iterator if needed. Returning List
provides more flexibility to the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's constraint by API in Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is only being invoked in a lambda as opposed to via method reference right now. I do think that @pshapiro4broad is right in that returning lists is more flexible if we have them. Why not just return the list and then call iterator
on it in the lambda in dealWithSimpleStrandSwitchBkpts
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
68a5e44
to
12bed02
Compare
@cwhelan this is now ready. Please review. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good but I think it could use a little more commenting and/or descriptions for some of the meaty logic, and I'm a little unsure that I understood all of it.
* Computes overlap between reference span of the two input alignment intervals. | ||
*/ | ||
static int overlapOnRefSpan(final AlignmentInterval one, final AlignmentInterval two) { | ||
Utils.validateArg(AlignedContig.sortAlignments().compare(one, two) < 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you rename the sortAlignments()
method to alignmentIntervalComparator()
or something similar? That method doesn't actually sort the alignments so the name doesn't seem right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
*/ | ||
static int overlapOnRefSpan(final AlignmentInterval one, final AlignmentInterval two) { | ||
Utils.validateArg(AlignedContig.sortAlignments().compare(one, two) < 0, | ||
"assumption that first input AI reside a place earlier than second input is violated: \n" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message might read more clearly: "assumption that the first input AI is upstream of the second is violated". But, why have this requirement? Why not swap the order of comparison depending on which AI actually comes first? The signature of the method doesn't specify that the intervals need to be ordered so this could be frustrating to others trying to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the assertion because it is really unnecessary when computing overlaps (order doesn't matter).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also added test
if ( !one.referenceSpan.getContig().equals(two.referenceSpan.getContig()) ) return 0; | ||
|
||
// dummy number for chr to be used in constructing SVInterval, since input CA has 2 AI & both map to the same chr | ||
final int dummyChr = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if you made this -1 instead of 1? That way it can't be mistaken for a real contig ID by someone who doesn't know what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. though this is a temp var that calling functions couldn't reach
private static final long serialVersionUID = 1L; | ||
|
||
public static Iterator<VariantContext> produceMultipleAnnotatedVcFromNovelAdjacency(final NovelAdjacencyReferenceLocations novelAdjacencyReferenceLocations, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment to this method describing what the parameters should be? It's not clear what should be in the Iterable
s from the signature. And why pass in Iterables
instead of eg lists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added and changed assertion to size 2 exactly
final List<VariantContext> result = new ArrayList<>(); | ||
result.add(record); | ||
// hack for now because up to this point inferredType would have max of 2 only | ||
while (it.hasNext()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand why you're iterating over this if there are only going to be two things in the list. If there were more inferredTypes
, you'd create a bunch of duplicate records that differed only in their type field. Why not: make inferredTypes
a list, validate that there are only two of them, and then just grab the second one and remove the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to asserting only 2 (and checked)
final List<CigarElement> resultCEs = threeSections._1(); | ||
final int a = readBasesConsumed + ce.getLength() - clipLength; | ||
final CigarOperator op = ce.getOperator().isAlignment() ? CigarOperator.M : CigarOperator.S; | ||
if (clipFrom3PrimeEnd) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code might be a little more concise if you put these things in the sublist in the same order all the time, and then reverse the sublist if clipFrom3PrimeEnd
with Collections.reverse()
, and then append the sublist to resultCEs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
resultCEs.addAll(cigarElements.subList(idx+1, cigarElements.size())); | ||
} | ||
if (!threeSections._3().isEmpty()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't really need this if check, you could still add it in if it's empty.
} | ||
|
||
/** | ||
* Taking advantage of the fact that for input read, we know it has only two alignments that map to the same reference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add an assertion to this method too if that's the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
final JavaPairRDD<ChimericAlignment, byte[]> simpleStrandSwitchBkpts = | ||
longReads | ||
.mapToPair(SimpleStrandSwitchVariantDetector::convertAlignmentIntervalToChimericAlignment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me a more straightforward flow would be to filter
them on splitPairStrongEnoughEvidenceForCA
and then convert the remaining ones to ChimericAlignment
s, rather than filtering on nonNull
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
import java.io.IOException; | ||
|
||
public class VariantDetectorFromLongReadAlignmentsForSimpleStrandSwitchUnitTest extends BaseTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth adding a test for extractCigarElements
as well. Perhaps some of the other methods in the class, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added test for extractCigarElements
for now. I am planning to get the prototype code in first, then test the code extensively, just in case I need to leave for several days...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented requested changes (mostly, but not all) in 3 commits. @cwhelan please take another look. Thanks!
* Computes overlap between reference span of the two input alignment intervals. | ||
*/ | ||
static int overlapOnRefSpan(final AlignmentInterval one, final AlignmentInterval two) { | ||
Utils.validateArg(AlignedContig.sortAlignments().compare(one, two) < 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
*/ | ||
static int overlapOnRefSpan(final AlignmentInterval one, final AlignmentInterval two) { | ||
Utils.validateArg(AlignedContig.sortAlignments().compare(one, two) < 0, | ||
"assumption that first input AI reside a place earlier than second input is violated: \n" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the assertion because it is really unnecessary when computing overlaps (order doesn't matter).
*/ | ||
static int overlapOnRefSpan(final AlignmentInterval one, final AlignmentInterval two) { | ||
Utils.validateArg(AlignedContig.sortAlignments().compare(one, two) < 0, | ||
"assumption that first input AI reside a place earlier than second input is violated: \n" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also added test
if ( !one.referenceSpan.getContig().equals(two.referenceSpan.getContig()) ) return 0; | ||
|
||
// dummy number for chr to be used in constructing SVInterval, since input CA has 2 AI & both map to the same chr | ||
final int dummyChr = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. though this is a temp var that calling functions couldn't reach
private static final long serialVersionUID = 1L; | ||
|
||
public static Iterator<VariantContext> produceMultipleAnnotatedVcFromNovelAdjacency(final NovelAdjacencyReferenceLocations novelAdjacencyReferenceLocations, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added and changed assertion to size 2 exactly
.noneMatch(op -> op.equals(CigarOperator.N) || op.isPadding()), | ||
"Input alignment contains padding or skip operations, which is currently unsupported: " + input.toPackedString()); | ||
|
||
final Tuple3<List<CigarElement>, List<CigarElement>, List<CigarElement>> threeSections = extractCigarElements(input); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
else { // enough read bases would be clipped | ||
|
||
if ( !ce.getOperator().isAlignment() && !ce.getOperator().equals(CigarOperator.I)) | ||
throw new GATKException.ShouldNeverReachHereException("Logic error, should not reach here"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
if ( !ce.getOperator().isAlignment() && !ce.getOperator().equals(CigarOperator.I)) | ||
throw new GATKException.ShouldNeverReachHereException("Logic error, should not reach here"); | ||
|
||
// dead with cigar first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should have been "deal with"....
|
||
import java.io.IOException; | ||
|
||
public class VariantDetectorFromLongReadAlignmentsForSimpleStrandSwitchUnitTest extends BaseTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added test for extractCigarElements
for now. I am planning to get the prototype code in first, then test the code extensively, just in case I need to leave for several days...
final List<CigarElement> resultCEs = threeSections._1(); | ||
final int a = readBasesConsumed + ce.getLength() - clipLength; | ||
final CigarOperator op = ce.getOperator().isAlignment() ? CigarOperator.M : CigarOperator.S; | ||
if (clipFrom3PrimeEnd) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
ping @cwhelan . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good now, just a few minor comments.
@@ -67,7 +72,8 @@ public boolean test(final AlignedContig contig) { | |||
|
|||
/** | |||
* Removes overlap from a designated alignment interval, so that the inverted duplicated reference span is minimal. | |||
* If the two alignment intervals are NOT overlapping, return the original read. | |||
* If the two alignment intervals are NOT overlapping, return the original aligned contig. | |||
* For algorithm {@see <a href="https://github.com/broadinstitute/dsde-methods-sv/pull/8"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is our private repo so I'm not sure we should reference it in comments in the public code -- it would just be frustrating for someone external.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed and removed.
But we are more and more in need of a central place for documenting the algorithms used in the various stages of the pipeline.
@@ -77,6 +83,8 @@ private static AlignedContig removeOverlap(final AlignedContig contig) { | |||
} else { | |||
final AlignmentInterval one = contig.alignmentIntervals.get(0), | |||
two = contig.alignmentIntervals.get(1); | |||
// js is for "the shoot-off reference location of a jump that linked to alignment intervals", and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be "linked two alignment intervals"?
I'm not sure I understand what "shoot-off" and "landing" are -- or, I guess I can imagine, but maybe something like "jumpStart" and "jumpEnd" might be more intuitive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!OoO!, yes, it should be "two" instead of "to".
I have in mind a cartoonish depiction of a tiny anime figure sliding along the reference guided by the alignments of the read, sometimes jumping back and forth. Sorry for this too-pictorial naming scheme.
|
||
// deal with cigar first | ||
newMiddleSection.add( new CigarElement(clipLengthOnRead, CigarOperator.S) ); | ||
final int a = readBasesConsumed + ce.getLength() - clipLengthOnRead; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a more descriptive name for a
be basesRemainingOnCigarElementAfterClipping
? Just trying to make sure I understand the algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's right and updated with documentation
@@ -90,7 +90,7 @@ static VariantContext produceAnnotatedVcFromInferredTypeAndRefLocations(final Si | |||
.id(inferredType.getInternalVariantId()) | |||
.attribute(GATKSVVCFConstants.SVTYPE, inferredType.toString()); | |||
|
|||
if (inferredType instanceof SimpleSVType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's anything wrong with using instanceof
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strangely it failed... Plus it might be better if an alt route exists that doesn't use reflection and has equal line of code works
897ef07
to
c10d064
Compare
* added BreakEndVariantType for outputing BND formatted VCF records, and a corresponding method in AnnotatedVariantProducer (no SVLEN for such records); * added code path SimpleStrandSwitchVariantDetector for dealing with simple strand switch BND variants which we now emit; also added utility classes utility classes RDDUtils
c10d064
to
967283e
Compare
This PR deals with long reads with exactly two alignments (no other equally good alignment configuration), mapped to the same chromosome with strand switch, but NOT significantly overlapping each other.
We used to call inversions from such alignments, but it is more appropriate to emit BND records because a lot of times such signal is actually generated from inverted segmental duplications, or simply inverted mobile element insertions. To confidently interpret and distinguish between such events, we need other types of evidence, and is better to be dealt with downstream logic units.
Inverted duplications are NOT dealt with in this PR and is going to be in the next.
NEEDS TO WAIT UNTIL PART 1 & 2 ARE IN.