-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PathSeq Illumina adapter trimming and simple repeat masking #3354
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3354 +/- ##
===============================================
+ Coverage 80.472% 80.501% +0.029%
- Complexity 17507 17535 +28
===============================================
Files 1173 1175 +2
Lines 63376 63451 +75
Branches 9878 9894 +16
===============================================
+ Hits 51000 51079 +79
+ Misses 8428 8423 -5
- Partials 3948 3949 +1
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any major problems so I'm pre-approving. The biggest issue is the comment I made in IlluminaAdapterTrimTransformer.java, which could be a logic error or could be down to lack of domain knowledge on my part.
if (++numMismatches > maxMismatches) break; | ||
} | ||
} | ||
if (numMismatches <= maxMismatches) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with adaptor sequence matching, but this seems like a potential problem. Someone would naively think that increasing maxMismatches will allow more poor matches through without affecting the already passing reads, but in principle this could cause a poorer match to be selected if a "good enough" match is found before the best one. Maybe better to keep a bestNumMismatches variable and a bestAlignmentStart, then update read on bestAlignmentStart if bestNumMismatches < maxMismatches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what you are saying is that if some adapter ends with the sequence of another adapter, or if a particular adapter ends with itself e.g.:
AGCAT
GAGCAT
Both end in AGCAT
but if we had the read ending in GAGCAT
and (2) happens to be tested before (1), it would only trim 4 bases instead of 5. Also if you had an adapter AAAAA
and minimum clip length of 4, it would only clip 4 bases.
This is a very good point I didn't consider. I don't believe this is the case for any standard adapter sequences (Picard's MarkIlluminaAdapters also doesn't seem to account for it), but since this class exposes the use of custom adapters we should check for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In light of this, I'm also renaming the class to AdapterTrimTransformer
final int windowSize, threshAT, threshGC; | ||
|
||
public SimpleRepeatMaskTransformer(final int threshAT, final int threshGC, final int windowSize) { | ||
this.threshAT = threshAT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe good to put in an assert that threshAT and threshGC are less than windowSize (if not, it indicates a typo or logic error).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
import org.broadinstitute.hellbender.transformers.BaseQualityClipReadTransformer; | ||
import org.broadinstitute.hellbender.transformers.BaseQualityReadTransformer; | ||
import org.broadinstitute.hellbender.transformers.DUSTReadTransformer; | ||
import org.broadinstitute.hellbender.transformers.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel strongly about this, but maybe better to import the classes individually so that dependencies can more easily be managed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intellij has this option of "optimizing the imports", under Menu section "Code". Try that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what IntelliJ does
|
||
//Apply simple repeat masking | ||
//See "Low-complexity DNA and simple repeats" at http://www.repeatmasker.org/webrepeatmaskerhelp.html | ||
reads = reads.map(new ReadTransformerSparkifier(new SimpleRepeatMaskTransformer(29, 29, 30))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these numbers may need to be tweaked, it would be good to add them to filterArgs. If they're final numbers, it may still be good to name them as constants (e.g. shortWindowLen, maxShortWindowGCCounts or similar, so that it's easier to figure out what they do).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @TedBrookings . Generally we prefer to have no magic numbers baked in the code. And we do a favor for our future selfs not to have a magic number hiding here and there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added these as constants to the PSFilter class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A improvement comment, and a question on CIGAR in IlluminaAdapterTrimTransformer
that could potentially be a more serious issue.
import org.broadinstitute.hellbender.transformers.BaseQualityClipReadTransformer; | ||
import org.broadinstitute.hellbender.transformers.BaseQualityReadTransformer; | ||
import org.broadinstitute.hellbender.transformers.DUSTReadTransformer; | ||
import org.broadinstitute.hellbender.transformers.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intellij has this option of "optimizing the imports", under Menu section "Code". Try that?
|
||
//Apply simple repeat masking | ||
//See "Low-complexity DNA and simple repeats" at http://www.repeatmasker.org/webrepeatmaskerhelp.html | ||
reads = reads.map(new ReadTransformerSparkifier(new SimpleRepeatMaskTransformer(29, 29, 30))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @TedBrookings . Generally we prefer to have no magic numbers baked in the code. And we do a favor for our future selfs not to have a magic number hiding here and there.
@@ -47,8 +49,9 @@ | |||
* | |||
*/ | |||
@CommandLineProgramProperties(summary = "First, low-quality and repetitive sequences reads are filtered: \n\t(1) Remove secondary and " + | |||
"supplementary reads; \n\t(2) mask repetitive sequences using the sDUST algorithm; \n\t(3) hard clip according to masked " + | |||
"and low-quality bases; \n\t(4) remove reads below minimum length; \n\t(5) mask low-quality bases; \n\t(6) filter reads with" + | |||
"supplementary reads; \n\t(2) trim adapter sequences; \n\t(3) mask sequences with excessive A/T or G/C content; " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a technically-more correct spelling is "secondary and supplementary reads" -> "secondary and supplementary alignments", but may not make more sense in your application.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. Done
@@ -47,8 +49,9 @@ | |||
* | |||
*/ | |||
@CommandLineProgramProperties(summary = "First, low-quality and repetitive sequences reads are filtered: \n\t(1) Remove secondary and " + | |||
"supplementary reads; \n\t(2) mask repetitive sequences using the sDUST algorithm; \n\t(3) hard clip according to masked " + | |||
"and low-quality bases; \n\t(4) remove reads below minimum length; \n\t(5) mask low-quality bases; \n\t(6) filter reads with" + | |||
"supplementary reads; \n\t(2) trim adapter sequences; \n\t(3) mask sequences with excessive A/T or G/C content; " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, this summary is a bit too detailed. It exposes implementation detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm going to leave it as is for now until we get official documentation up, otherwise it's impossible to tell what the parameters do.
* Trims adapter sequences from read ends. If the adapter sequence ends before the end of the read, trims the remaining | ||
* bases as well. | ||
*/ | ||
public class IlluminaAdapterTrimTransformer implements ReadTransformer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you intend this class to be final
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will also make SimpleRepeatMaskTransformer final.
* bases as well. | ||
*/ | ||
public class IlluminaAdapterTrimTransformer implements ReadTransformer { | ||
public static final long serialVersionUID = 1L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops - done.
public static final long serialVersionUID = 1L; | ||
private static final byte MASKED_BASE_QUAL = 2; | ||
|
||
final int windowSize, threshAT, threshGC; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make them private? And, I am personally OK with declaring several primitives on the same line, but it most likely goes against our coding style. So your call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made private and put each variable on a separate line
* Masks read bases with a supra-threshold number of A/T's or G/C's within a given window size. | ||
*/ | ||
public class SimpleRepeatMaskTransformer implements ReadTransformer { | ||
public static final long serialVersionUID = 1L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
|
||
@Override | ||
public GATKRead apply(final GATKRead read) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A suggestion for how to implement it.
I'll start with what my understanding is:
This function, rolls along the provided sequence with a fixed-size window in a base--by-base fashion, calculates the AT/GC (with N) content within the window and mask out the bases and changes the qual accordingly if GC content is high or low.
Now my suggestion:
Since there are many reads to be checked against, I would imagine this could be a hot spot. So making it fast is probably a worthy thing to do:
- instead of summing up base counts for each window, why not store a
previousATNcount
andpreviousGCNcount
, considering that each time we step into a new window, these two variables are going to change only +/-1. This would get rid of thebasesCount
array; - as you have already did in the current implementation, the changes in bases count would be +/-1, why not, instead of setting bases and quals in each window, simply mark if the current window defined by
start
andend
should be masked, and finally maximally extend thestart
andend
, then go change those bases and quals - in
maskWindow
, instead of closed interval, why not make it semi-open [start, end)? this would also get rid of some of the "-1"'s in theapply()
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of summing up base counts for each window, why not store a previousATNcount and previousGCNcount, considering that each time we step into a new window, these two variables are going to change only +/-1. This would get rid of the basesCount array;
We could use two variables instead of an array, but I'm not sure if it's faster. I prefer using the array because it avoids a couple of extra if
statements.
as you have already did in the current implementation, the changes in bases count would be +/-1, why not, instead of setting bases and quals in each window, simply mark if the current window defined by start and end should be masked, and finally maximally extend the start and end, then go change those bases and quals
This could save a lot of time. Done.
in maskWindow, instead of closed interval, why not make it semi-open [start, end)? this would also get rid of some of the "-1"'s in the apply() function.
Done.
//We have a match. Clip from the beginning of the adapter to the end of the read. | ||
final String basesString = read.getBasesString(); | ||
final String qualString = new String(read.getBaseQualities()); | ||
read.setBases(basesString.substring(0, alignmentStart).getBytes()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part confuses me: by "clipping", do you mean set the bases to 'N' and qual to 2? In addition, a read should have a CIGAR, and clipping operations should update the CIGAR (either soft or hard), otherwise the object is in a incoherent state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard clip. Good point about cigars. I found a handy ReadClipper class to use that will take care of this.
cafde17
to
09d2929
Compare
👍 |
Adds additional filtering steps to the PathSeq filter to 1) trim adapter sequences and 2) mimic a simple filter used in RepeatMasker that masks windows with excessive A/T or G/C content.