Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly parsing paired-end info with Cutadapt module #1198

Closed
drpatelh opened this issue May 26, 2020 · 3 comments
Closed

Correctly parsing paired-end info with Cutadapt module #1198

drpatelh opened this issue May 26, 2020 · 3 comments

Comments

@drpatelh
Copy link
Contributor

drpatelh commented May 26, 2020

Cutadapt outputs the info for number of reads processed differently for single-end and paired-end data so the r_processed value is not being read in and dumped in multiqc_data/multiqc_cutadapt.yaml

SAMPLE1_PE_2:
  bp_processed: 14363465
  bp_written: 14171979
  percent_trimmed: 1.3331462846882698
SAMPLE2_PE_2:
  bp_processed: 11431354
  bp_written: 11290725
  percent_trimmed: 1.2302042260260684
SAMPLE3_SE:
  bp_processed: 13923021
  bp_written: 13738774
  percent_trimmed: 1.3233263097139623
  r_processed: 46659
  r_with_adapters: 18941

I think these two lines:
https://github.com/ewels/MultiQC/blob/80b88964d5483cbb416a2a1ffd093a4828626c0d/multiqc/modules/cutadapt/cutadapt.py#L67-L68

should be:

'r_processed': "Total read pairs processed:\s*([\d,]+)",
'r1_with_adapters': "Read 1 with adapter:\s*([\d,]+)",
'r2_with_adapters': "Read 2 with adapter:\s*([\d,]+)",

where r_with_adapters would be r1_with_adapters + r2_with_adapters and r_processed would have to multiplied by 2 to report total reads and not read pairs to keep things consistent with the reporting for single-end reads.

Files to test and fix the bug are below for paired-end (PE) and single-end (SE) data:
SAMPLE1_PE.cutadapt.log
SAMPLE2_PE.cutadapt.log
SAMPLE3_SE.cutadapt.log

Using MultiQC v1.8.

Peace!

@drpatelh drpatelh changed the title Correclty parsing paired-end info with Cutadapt module Correctly parsing paired-end info with Cutadapt module May 26, 2020
@drpatelh
Copy link
Contributor Author

drpatelh commented May 26, 2020

On a side note, an additional regex could be provided for the number of reads written:

For single-end data:
'r_written': "Reads written (passing filters):\s*([\d,]+)",

For paired-end data:
'r_written': "Pairs written (passing filters):\s*([\d,]+)",

with the possibility of having an additional barplot that shows the proportion of reads trimmed and written.

@ewels
Copy link
Member

ewels commented May 26, 2020

Happy to write more regexes to support more fields 👍

So after looking into making a bar plot showing the proportion of reads handled, I have settled upon a fairly simple one that shows the number of reads or pairs that were written to the file or removed due to various filtering reasons.

I haven't bothered duplicating the pair counts, I've just scraped both. If you generate a report with a mixture of SE and PE data you'll get a lot of categories, but usually there will be just one data type. Then the irrelevant categories will just be hidden. So basically doing either SE or PE will show either reads or pairs (with the same colours etc). This essentially mimics the Cutadapt output.

For your example files, you don't have any filtered reads or pairs, so you get a fairly dull plot:

cutadapt_filtered_reads_plot

The multiqc_data/multiqc_cutadapt.yaml output file is a little more interesting though:

SAMPLE1_PE_2.trim:
  bp_processed: 14363465
  bp_written: 14171979
  pairs_processed: 24135
  pairs_written: 24135
  percent_trimmed: 1.3331462846882698
  r1_with_adapters: 10092
  r2_with_adapters: 9863
SAMPLE2_PE_2.trim:
  bp_processed: 11431354
  bp_written: 11290725
  pairs_processed: 19202
  pairs_written: 19202
  percent_trimmed: 1.2302042260260684
  r1_with_adapters: 7598
  r2_with_adapters: 7481
SAMPLE3_SE.trim:
  bp_processed: 13923021
  bp_written: 13738774
  percent_trimmed: 1.3233263097139623
  r_processed: 46659
  r_with_adapters: 18941
  r_written: 46659

Showing the proportion of reads with adapter contamination is tricky (I think impossible). That is reported separately to the filtered categories, so we don't know which of the filtering categories the adapter contamination counts go into. I could plot the proportion of reads with adapter contamination, but that would be for all raw reads so I'm not sure how interesting it is..

All make sense? Does this now do what you want it to do?

Phil

@drpatelh
Copy link
Contributor Author

Yep! This is perfect! Thanks. I can just multiply the number of reads by 2 after parsing the yaml because I am reporting everything by read and not pairs. Just makes things easier for both SE and PE metrics.

ewels added a commit to MultiQC/test-data that referenced this issue May 26, 2020
@ewels ewels closed this as completed May 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants