-
Notifications
You must be signed in to change notification settings - Fork 606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correctly parsing paired-end info with Cutadapt module #1198
Comments
On a side note, an additional regex could be provided for the number of reads written: For single-end data: For paired-end data: with the possibility of having an additional barplot that shows the proportion of reads trimmed and written. |
Happy to write more regexes to support more fields 👍 So after looking into making a bar plot showing the proportion of reads handled, I have settled upon a fairly simple one that shows the number of reads or pairs that were written to the file or removed due to various filtering reasons. I haven't bothered duplicating the pair counts, I've just scraped both. If you generate a report with a mixture of SE and PE data you'll get a lot of categories, but usually there will be just one data type. Then the irrelevant categories will just be hidden. So basically doing either SE or PE will show either reads or pairs (with the same colours etc). This essentially mimics the Cutadapt output. For your example files, you don't have any filtered reads or pairs, so you get a fairly dull plot: The SAMPLE1_PE_2.trim:
bp_processed: 14363465
bp_written: 14171979
pairs_processed: 24135
pairs_written: 24135
percent_trimmed: 1.3331462846882698
r1_with_adapters: 10092
r2_with_adapters: 9863
SAMPLE2_PE_2.trim:
bp_processed: 11431354
bp_written: 11290725
pairs_processed: 19202
pairs_written: 19202
percent_trimmed: 1.2302042260260684
r1_with_adapters: 7598
r2_with_adapters: 7481
SAMPLE3_SE.trim:
bp_processed: 13923021
bp_written: 13738774
percent_trimmed: 1.3233263097139623
r_processed: 46659
r_with_adapters: 18941
r_written: 46659 Showing the proportion of reads with adapter contamination is tricky (I think impossible). That is reported separately to the filtered categories, so we don't know which of the filtering categories the adapter contamination counts go into. I could plot the proportion of reads with adapter contamination, but that would be for all raw reads so I'm not sure how interesting it is.. All make sense? Does this now do what you want it to do? Phil |
Yep! This is perfect! Thanks. I can just multiply the number of reads by 2 after parsing the yaml because I am reporting everything by read and not pairs. Just makes things easier for both SE and PE metrics. |
Cutadapt outputs the info for number of reads processed differently for single-end and paired-end data so the
r_processed
value is not being read in and dumped inmultiqc_data/multiqc_cutadapt.yaml
I think these two lines:
https://github.com/ewels/MultiQC/blob/80b88964d5483cbb416a2a1ffd093a4828626c0d/multiqc/modules/cutadapt/cutadapt.py#L67-L68
should be:
where
r_with_adapters
would ber1_with_adapters
+r2_with_adapters
andr_processed
would have to multiplied by 2 to report total reads and not read pairs to keep things consistent with the reporting for single-end reads.Files to test and fix the bug are below for paired-end (
PE
) and single-end (SE
) data:SAMPLE1_PE.cutadapt.log
SAMPLE2_PE.cutadapt.log
SAMPLE3_SE.cutadapt.log
Using
MultiQC v1.8
.Peace!
The text was updated successfully, but these errors were encountered: