Correctly parsing paired-end info with Cutadapt module #1198

drpatelh · 2020-05-26T07:07:09Z

Cutadapt outputs the info for number of reads processed differently for single-end and paired-end data so the r_processed value is not being read in and dumped in multiqc_data/multiqc_cutadapt.yaml

SAMPLE1_PE_2:
  bp_processed: 14363465
  bp_written: 14171979
  percent_trimmed: 1.3331462846882698
SAMPLE2_PE_2:
  bp_processed: 11431354
  bp_written: 11290725
  percent_trimmed: 1.2302042260260684
SAMPLE3_SE:
  bp_processed: 13923021
  bp_written: 13738774
  percent_trimmed: 1.3233263097139623
  r_processed: 46659
  r_with_adapters: 18941

I think these two lines:
https://github.com/ewels/MultiQC/blob/80b88964d5483cbb416a2a1ffd093a4828626c0d/multiqc/modules/cutadapt/cutadapt.py#L67-L68

should be:

'r_processed': "Total read pairs processed:\s*([\d,]+)",
'r1_with_adapters': "Read 1 with adapter:\s*([\d,]+)",
'r2_with_adapters': "Read 2 with adapter:\s*([\d,]+)",

where r_with_adapters would be r1_with_adapters + r2_with_adapters and r_processed would have to multiplied by 2 to report total reads and not read pairs to keep things consistent with the reporting for single-end reads.

Files to test and fix the bug are below for paired-end (PE) and single-end (SE) data:
SAMPLE1_PE.cutadapt.log
SAMPLE2_PE.cutadapt.log
SAMPLE3_SE.cutadapt.log

Using MultiQC v1.8.

Peace!

The text was updated successfully, but these errors were encountered:

drpatelh · 2020-05-26T07:14:38Z

On a side note, an additional regex could be provided for the number of reads written:

For single-end data:
'r_written': "Reads written (passing filters):\s*([\d,]+)",

For paired-end data:
'r_written': "Pairs written (passing filters):\s*([\d,]+)",

with the possibility of having an additional barplot that shows the proportion of reads trimmed and written.

ewels · 2020-05-26T09:04:17Z

Happy to write more regexes to support more fields 👍

So after looking into making a bar plot showing the proportion of reads handled, I have settled upon a fairly simple one that shows the number of reads or pairs that were written to the file or removed due to various filtering reasons.

I haven't bothered duplicating the pair counts, I've just scraped both. If you generate a report with a mixture of SE and PE data you'll get a lot of categories, but usually there will be just one data type. Then the irrelevant categories will just be hidden. So basically doing either SE or PE will show either reads or pairs (with the same colours etc). This essentially mimics the Cutadapt output.

For your example files, you don't have any filtered reads or pairs, so you get a fairly dull plot:

The multiqc_data/multiqc_cutadapt.yaml output file is a little more interesting though:

SAMPLE1_PE_2.trim:
  bp_processed: 14363465
  bp_written: 14171979
  pairs_processed: 24135
  pairs_written: 24135
  percent_trimmed: 1.3331462846882698
  r1_with_adapters: 10092
  r2_with_adapters: 9863
SAMPLE2_PE_2.trim:
  bp_processed: 11431354
  bp_written: 11290725
  pairs_processed: 19202
  pairs_written: 19202
  percent_trimmed: 1.2302042260260684
  r1_with_adapters: 7598
  r2_with_adapters: 7481
SAMPLE3_SE.trim:
  bp_processed: 13923021
  bp_written: 13738774
  percent_trimmed: 1.3233263097139623
  r_processed: 46659
  r_with_adapters: 18941
  r_written: 46659

Showing the proportion of reads with adapter contamination is tricky (I think impossible). That is reported separately to the filtered categories, so we don't know which of the filtering categories the adapter contamination counts go into. I could plot the proportion of reads with adapter contamination, but that would be for all raw reads so I'm not sure how interesting it is..

All make sense? Does this now do what you want it to do?

Phil

See #1198

drpatelh · 2020-05-26T09:11:47Z

Yep! This is perfect! Thanks. I can just multiply the number of reads by 2 after parsing the yaml because I am reporting everything by read and not pairs. Just makes things easier for both SE and PE metrics.

See MultiQC/MultiQC#1198

drpatelh changed the title ~~Correclty parsing paired-end info with Cutadapt module~~ Correctly parsing paired-end info with Cutadapt module May 26, 2020

ewels added the module: change label May 26, 2020

ewels added a commit that referenced this issue May 26, 2020

Cutadapt: New bar plot showing filtered reads or pairs.

e7ed105

See #1198

ewels added a commit to MultiQC/test-data that referenced this issue May 26, 2020

More cutadapt logs.

18706d7

See MultiQC/MultiQC#1198

ewels closed this as completed May 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correctly parsing paired-end info with Cutadapt module #1198

Correctly parsing paired-end info with Cutadapt module #1198

drpatelh commented May 26, 2020 •

edited by ewels

Loading

drpatelh commented May 26, 2020 •

edited

Loading

ewels commented May 26, 2020

drpatelh commented May 26, 2020

Correctly parsing paired-end info with Cutadapt module #1198

Correctly parsing paired-end info with Cutadapt module #1198

Comments

drpatelh commented May 26, 2020 • edited by ewels Loading

drpatelh commented May 26, 2020 • edited Loading

ewels commented May 26, 2020

drpatelh commented May 26, 2020

drpatelh commented May 26, 2020 •

edited by ewels

Loading

drpatelh commented May 26, 2020 •

edited

Loading