Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PathSeq Illumina adapter trimming and simple repeat masking #3354

Merged
merged 4 commits into from
Aug 2, 2017

Conversation

mwalker174
Copy link
Contributor

Adds additional filtering steps to the PathSeq filter to 1) trim adapter sequences and 2) mimic a simple filter used in RepeatMasker that masks windows with excessive A/T or G/C content.

@codecov-io
Copy link

codecov-io commented Jul 24, 2017

Codecov Report

Merging #3354 into master will increase coverage by 0.029%.
The diff coverage is 94.667%.

@@               Coverage Diff               @@
##              master     #3354       +/-   ##
===============================================
+ Coverage     80.472%   80.501%   +0.029%     
- Complexity     17507     17535       +28     
===============================================
  Files           1173      1175        +2     
  Lines          63376     63451       +75     
  Branches        9878      9894       +16     
===============================================
+ Hits           51000     51079       +79     
+ Misses          8428      8423        -5     
- Partials        3948      3949        +1
Impacted Files Coverage Δ Complexity Δ
...bender/tools/spark/pathseq/PathSeqFilterSpark.java 70.968% <ø> (ø) 7 <0> (ø) ⬇️
...itute/hellbender/tools/spark/pathseq/PSFilter.java 92.617% <100%> (+0.531%) 33 <1> (+1) ⬆️
...ools/spark/pathseq/PSFilterArgumentCollection.java 80% <100%> (+1.429%) 2 <0> (ø) ⬇️
...ellbender/transformers/AdapterTrimTransformer.java 92.857% <92.857%> (ø) 12 <12> (?)
...nder/transformers/SimpleRepeatMaskTransformer.java 94.286% <94.286%> (ø) 11 <11> (?)
...nstitute/hellbender/utils/clipping/ClippingOp.java 84.365% <0%> (+1.629%) 91% <0%> (+2%) ⬆️
...stitute/hellbender/utils/clipping/ReadClipper.java 71.038% <0%> (+1.639%) 75% <0%> (+2%) ⬆️

Copy link
Contributor

@TedBrookings TedBrookings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any major problems so I'm pre-approving. The biggest issue is the comment I made in IlluminaAdapterTrimTransformer.java, which could be a logic error or could be down to lack of domain knowledge on my part.

if (++numMismatches > maxMismatches) break;
}
}
if (numMismatches <= maxMismatches) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with adaptor sequence matching, but this seems like a potential problem. Someone would naively think that increasing maxMismatches will allow more poor matches through without affecting the already passing reads, but in principle this could cause a poorer match to be selected if a "good enough" match is found before the best one. Maybe better to keep a bestNumMismatches variable and a bestAlignmentStart, then update read on bestAlignmentStart if bestNumMismatches < maxMismatches

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you are saying is that if some adapter ends with the sequence of another adapter, or if a particular adapter ends with itself e.g.:

  1. AGCAT
  2. GAGCAT

Both end in AGCAT but if we had the read ending in GAGCAT and (2) happens to be tested before (1), it would only trim 4 bases instead of 5. Also if you had an adapter AAAAA and minimum clip length of 4, it would only clip 4 bases.

This is a very good point I didn't consider. I don't believe this is the case for any standard adapter sequences (Picard's MarkIlluminaAdapters also doesn't seem to account for it), but since this class exposes the use of custom adapters we should check for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In light of this, I'm also renaming the class to AdapterTrimTransformer

final int windowSize, threshAT, threshGC;

public SimpleRepeatMaskTransformer(final int threshAT, final int threshGC, final int windowSize) {
this.threshAT = threshAT;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe good to put in an assert that threshAT and threshGC are less than windowSize (if not, it indicates a typo or logic error).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

import org.broadinstitute.hellbender.transformers.BaseQualityClipReadTransformer;
import org.broadinstitute.hellbender.transformers.BaseQualityReadTransformer;
import org.broadinstitute.hellbender.transformers.DUSTReadTransformer;
import org.broadinstitute.hellbender.transformers.*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly about this, but maybe better to import the classes individually so that dependencies can more easily be managed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intellij has this option of "optimizing the imports", under Menu section "Code". Try that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what IntelliJ does


//Apply simple repeat masking
//See "Low-complexity DNA and simple repeats" at http://www.repeatmasker.org/webrepeatmaskerhelp.html
reads = reads.map(new ReadTransformerSparkifier(new SimpleRepeatMaskTransformer(29, 29, 30)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these numbers may need to be tweaked, it would be good to add them to filterArgs. If they're final numbers, it may still be good to name them as constants (e.g. shortWindowLen, maxShortWindowGCCounts or similar, so that it's easier to figure out what they do).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @TedBrookings . Generally we prefer to have no magic numbers baked in the code. And we do a favor for our future selfs not to have a magic number hiding here and there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added these as constants to the PSFilter class.

Copy link
Contributor

@SHuang-Broad SHuang-Broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A improvement comment, and a question on CIGAR in IlluminaAdapterTrimTransformer that could potentially be a more serious issue.

import org.broadinstitute.hellbender.transformers.BaseQualityClipReadTransformer;
import org.broadinstitute.hellbender.transformers.BaseQualityReadTransformer;
import org.broadinstitute.hellbender.transformers.DUSTReadTransformer;
import org.broadinstitute.hellbender.transformers.*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intellij has this option of "optimizing the imports", under Menu section "Code". Try that?


//Apply simple repeat masking
//See "Low-complexity DNA and simple repeats" at http://www.repeatmasker.org/webrepeatmaskerhelp.html
reads = reads.map(new ReadTransformerSparkifier(new SimpleRepeatMaskTransformer(29, 29, 30)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @TedBrookings . Generally we prefer to have no magic numbers baked in the code. And we do a favor for our future selfs not to have a magic number hiding here and there.

@@ -47,8 +49,9 @@
*
*/
@CommandLineProgramProperties(summary = "First, low-quality and repetitive sequences reads are filtered: \n\t(1) Remove secondary and " +
"supplementary reads; \n\t(2) mask repetitive sequences using the sDUST algorithm; \n\t(3) hard clip according to masked " +
"and low-quality bases; \n\t(4) remove reads below minimum length; \n\t(5) mask low-quality bases; \n\t(6) filter reads with" +
"supplementary reads; \n\t(2) trim adapter sequences; \n\t(3) mask sequences with excessive A/T or G/C content; " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a technically-more correct spelling is "secondary and supplementary reads" -> "secondary and supplementary alignments", but may not make more sense in your application.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. Done

@@ -47,8 +49,9 @@
*
*/
@CommandLineProgramProperties(summary = "First, low-quality and repetitive sequences reads are filtered: \n\t(1) Remove secondary and " +
"supplementary reads; \n\t(2) mask repetitive sequences using the sDUST algorithm; \n\t(3) hard clip according to masked " +
"and low-quality bases; \n\t(4) remove reads below minimum length; \n\t(5) mask low-quality bases; \n\t(6) filter reads with" +
"supplementary reads; \n\t(2) trim adapter sequences; \n\t(3) mask sequences with excessive A/T or G/C content; " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, this summary is a bit too detailed. It exposes implementation detail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm going to leave it as is for now until we get official documentation up, otherwise it's impossible to tell what the parameters do.

* Trims adapter sequences from read ends. If the adapter sequence ends before the end of the read, trims the remaining
* bases as well.
*/
public class IlluminaAdapterTrimTransformer implements ReadTransformer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you intend this class to be final?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will also make SimpleRepeatMaskTransformer final.

* bases as well.
*/
public class IlluminaAdapterTrimTransformer implements ReadTransformer {
public static final long serialVersionUID = 1L;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops - done.

public static final long serialVersionUID = 1L;
private static final byte MASKED_BASE_QUAL = 2;

final int windowSize, threshAT, threshGC;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make them private? And, I am personally OK with declaring several primitives on the same line, but it most likely goes against our coding style. So your call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made private and put each variable on a separate line

* Masks read bases with a supra-threshold number of A/T's or G/C's within a given window size.
*/
public class SimpleRepeatMaskTransformer implements ReadTransformer {
public static final long serialVersionUID = 1L;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

@Override
public GATKRead apply(final GATKRead read) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A suggestion for how to implement it.

I'll start with what my understanding is:
This function, rolls along the provided sequence with a fixed-size window in a base--by-base fashion, calculates the AT/GC (with N) content within the window and mask out the bases and changes the qual accordingly if GC content is high or low.

Now my suggestion:
Since there are many reads to be checked against, I would imagine this could be a hot spot. So making it fast is probably a worthy thing to do:

  • instead of summing up base counts for each window, why not store a previousATNcount and previousGCNcount, considering that each time we step into a new window, these two variables are going to change only +/-1. This would get rid of the basesCount array;
  • as you have already did in the current implementation, the changes in bases count would be +/-1, why not, instead of setting bases and quals in each window, simply mark if the current window defined by start and end should be masked, and finally maximally extend the start and end, then go change those bases and quals
  • in maskWindow, instead of closed interval, why not make it semi-open [start, end)? this would also get rid of some of the "-1"'s in the apply() function.

Copy link
Contributor Author

@mwalker174 mwalker174 Jul 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of summing up base counts for each window, why not store a previousATNcount and previousGCNcount, considering that each time we step into a new window, these two variables are going to change only +/-1. This would get rid of the basesCount array;

We could use two variables instead of an array, but I'm not sure if it's faster. I prefer using the array because it avoids a couple of extra if statements.

as you have already did in the current implementation, the changes in bases count would be +/-1, why not, instead of setting bases and quals in each window, simply mark if the current window defined by start and end should be masked, and finally maximally extend the start and end, then go change those bases and quals

This could save a lot of time. Done.

in maskWindow, instead of closed interval, why not make it semi-open [start, end)? this would also get rid of some of the "-1"'s in the apply() function.

Done.

//We have a match. Clip from the beginning of the adapter to the end of the read.
final String basesString = read.getBasesString();
final String qualString = new String(read.getBaseQualities());
read.setBases(basesString.substring(0, alignmentStart).getBytes());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part confuses me: by "clipping", do you mean set the bases to 'N' and qual to 2? In addition, a read should have a CIGAR, and clipping operations should update the CIGAR (either soft or hard), otherwise the object is in a incoherent state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard clip. Good point about cigars. I found a handy ReadClipper class to use that will take care of this.

@SHuang-Broad
Copy link
Contributor

👍

@mwalker174 mwalker174 merged commit c5b3c9a into master Aug 2, 2017
@mwalker174 mwalker174 deleted the mw_adapter_repeat branch August 2, 2017 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants