Add UMI tools #435

grst · 2020-07-06T14:48:46Z

Adds a umi_tools extract step before trimming and a umi_tools dedup step before QC/quantification.

Closes Add optional UMI handling #73.
Fixes missing reference file channels when using gzipped reference and --skip_rsem. See commit message 14b337a for more details.

PR checklist

PR is to dev rather than master
This comment contains a description of changes (with reason)
If you've fixed a bug or added code that should be tested, add tests!
~~If necessary, also make a PR on the nf-core/rnaseq branch on the nf-core/test-datasets repo~~
Ensure the test suite passes (nextflow run . -profile test,docker).
Make sure your code lints (nf-core lint .).
Documentation in docs is updated
CHANGELOG.md is updated
README.md is updated

Learn more about contributing: https://github.com/nf-core/rnaseq/tree/master/.github/CONTRIBUTING.md

apeltzer · 2020-07-07T11:31:20Z

Could you have a look at this here too while you're on it ? ;-)

#432

The transcriptome BAM for RSEM is handled in a separate process.

The pipeline used to fail with `--skip_rsem` and gzipped reference data. The reason for this was that the entire code block for decompressing GTF and FASTA files was only executed because of RSEM requiring them for building its reference. By using `--skip_rsem` the references were not extracted and the process combining normal+additional references failed with a missing channel. This commit ensures that the references are unzipped whenever an additional fasta file is used. I added `--skip_rsem` as additional test case to the CI script.

grst · 2020-07-09T10:04:04Z

main.nf

+    */
+    if(params.with_umi) {
+        // preseq does not work on deduplicated BAM file. Pass it the raw BAM file. 
+        bam.into {bam_umitools_dedup; bam_preseq}


I'm a bit unsure which of the QC tools should be executed on the deduplicated BAM and which on the raw BAM.

ewels · 2020-07-09T13:27:32Z

Looking great! We've been planning to look into something similar for a while, though our UMIs are part of the barcode read so would be in a third FastQ file per sample. I guess this code is for inline UMIs that are part of the main read - do you see any overlap between this PR and what we will need to implement for barcode UMIs? Just trying to think about future-proofing 😄

grst · 2020-07-09T17:56:52Z

I don't know how exactely you'd run UMI tools with an index read. Possibly one would have to run the extract step twice, once with R1+index and once with R2+index. In any case it should only affect the extract step, the dedup step should remain identical.

ewels · 2020-07-09T18:18:30Z

Yeah, I figured we might need a separate step to get the UMI into the read name in the FastQ files. But it's great that you think that the dedup step should work the same.

amayer21 · 2020-07-13T11:06:21Z

This is great!!! I need the UMI tools and was planning to try to add them into the RNAseq pipeline this week :-)
I've cloned this branch and made the corresponding Singularity container to test it on my data hopefully today

kkamieniecka · 2020-07-13T11:15:42Z

This is great!!! I need the UMI tools and was planning to try to add them into the RNAseq pipeline this week :-)
I've cloned this branch and made the corresponding Singularity container to test it on my data hopefully today

I have same plan for this week, hope I can make it work on my data

TomKellyGenetics

Looks great! Thanks for doing this.

apeltzer · 2020-07-14T08:20:51Z

@kmurat1 @amayer21 - once you've successfully tested, can you ping here?

amayer21 · 2020-07-14T15:43:55Z

Sorry I commented on the test I did on the slack channel and didn't copy the comment here... Here it is:

I've given it a try on small fastq file and end up with an error message coming from the use of RSEM and not from UMI tools...

Error executing process > 'merge_rsem_genes (GEO-417-CTRL24_NGS20-N105_AHCVGYDSXY_S443_L003_R1_001_umi_extracted_deduplicated.genes)'

(full job log here: nfcorernaseq_withUMItools_5335509.log)

This message doesn't appear and all goes well when using the option --skip_rsem.

I'm planning to look into it (I have never used RSEM before) but also trying to solve some majors problems with our cluster with our sys-admin, so haven't been able to put enough time on this yet...

amayer21 · 2020-07-15T13:58:53Z

So, the error message from the merge_rsem_genes step was coming from the fact that I was processing only one fastq file. It ran without problems when I just duplicated my fastq file and ran it on both together. I guess it's pretty unusual to run a pipeline on only 1 sample... but maybe we still want to fix that just in case somebody needs to do so?

grst · 2020-07-15T14:10:42Z

it should absolutely work with only one sample... do you want to try to fix it?

amayer21 · 2020-07-15T21:22:58Z

I will have a look tomorrow morning. I'm a newbie in NextFlow but really keen to learn and put it into practice.

amayer21 · 2020-07-16T19:06:04Z

How do I need to proceed? I don't think I can commit anything in the PR, can I? Do I need to fork grst:add-umi-tools then ask for a PR on that repository? Or shall I let you know what to change? There are only a couple of changes in one section of main.nf script

grst · 2020-07-16T19:36:13Z

Hi @amayer21,

yes a PR against my fork should work.

ewels · 2020-07-16T20:24:31Z

Pasted from Slack for future googlers:

If it's just for small changes, it's best to propose them in comments on the pull request. See the GitHub docs

Basically, make a comment on one or more specific lines, then click the +- icon in the toolbar and write the code that you would like to see. The comment will show a diff which @gregor Sturm or anyone can then easily accept to commit into the PR.

This is what it looks like:

amayer21 · 2020-07-16T21:12:53Z

Thanks Phil. As explained on slack, I realised when trying to add comments to the code of this PR that the changes are in codes that hasn't been modified in this PR, so I can't comment it. This mean that the problem I saw when trying this branch is already present on the dev branch of the pipeline.

To summarise:
(1) I've tested the UMI-tools on single-end data and it worked when processing several fastq files
(2) when I was testing with only one input fastq file, it would fail with the default option but work with the--skip_rsem option. This can be fixed by 2 minors changes in the code of the merge_rsem_genes process.
(3) while looking at this section of the code, I've noticed that the outputs of "merge_rsem_genes" is called "rsem_tpm_gene.txt" while it doesn't contain the TPM (transcript per million) column of RSEM output, but the "expected counts" one. I propose to change the name of these outputs (need to be done in the code and in documentation)

So the UMI-tools seems to work to me (at least for single-end reads, I could test paired-end ones tomorrow if needed).
And for the RSEM issues, I can fork the dev branch and make the corrections there. NB: in the meantime I've also made a PR to Gregor's fork, so that could be an alternative (but I did change the name of rsem output without changing the documentation yet so maybe I should finish that first)?

Sorry I'm very new to collaborating on GitHub so I made things more complicated than they were...

Thank you very much for your help!

merge_rsem_gene was crashing when only 1 sample as input of the pipeline => now fixed I've also corrected the name of the output file of that same process as the column of RSEM output we keep is the "expected count" column and not the "TPM" one. I think it was confusing to call the output "rsem_tpm_gene.txt" as this wasn't "transcripts per million"

grst · 2020-07-17T06:01:46Z

while looking at this section of the code, I've noticed that the outputs of "merge_rsem_genes" is called "rsem_tpm_gene.txt" while it doesn't contain the TPM (transcript per million) column of RSEM output

Oh really!? 😕
It should be the TPM column though, we need to fix that.
(or even store both)

amayer21 · 2020-07-17T07:23:15Z

It's easy to fix if we want to get the TPMs (just need to change -f 5 into -f 6).

$ cat GEO_160_TCR24_S162_L001_R1_umi_extracted_deduplicated.genes.results | head -n 1
gene_id	transcript_id(s)	length	effective_length	expected_count	TPM	FPKM
$ cat GEO_160_TCR24_S162_L001_R1_umi_extracted_deduplicated.genes.results | cut -f 5 | head -n 1
expected_count

Having said that, I've never used RSEM but it look to me that TPMs are easy to compute from counts, and several tools will start from raw counts rather than counts per millions and perform their own normalisation.

In the tutorial (https://github.com/bli25broad/RSEM_tutorial#single), they explain (looking at .isoforms.results):

The sixth column gives the expression level for each isoform in TPM (Transcript per Million). TPM is a relative measure of expression levels. It represents the number of copies each isoform should have supposing the whole transcriptome contains exactly 1 million transcripts. The fifth column provides the expected read count in each transcript, which can be utilized by tools like EBSeq, DESeq and edgeR for differential expression analysis. The format of gene-level result file, LPS_6h.genes.results, is very similar.

amayer21 · 2020-07-17T07:27:57Z

If we keep the expected counts in the final output (instead of or both with TPMs), we also need to update this page:
https://github.com/grst/rnaseq/blob/add-umi-tools/docs/output.md#rsem

grst · 2020-07-17T07:44:26Z

I would keep both.
If I need counts rather than TPM, I would just go with featureCounts instead.

amayer21 · 2020-07-22T10:25:12Z

Hello,

I've made changes in my fork to keep both RSEM counts and TPMs.

Before to make a PR, I wanted to compare RSEM outputs before and after making the changes.
Surprisingly I found some difference in the counts for a few genes (see tsv table and plot). But when I ran it twice on identical samples and compared RSEM outputs, that also gave a difference (see plot with original or final pipeline). So it looks to me that it's more likely to come from the way RSEM estimate these corrected counts than from the modification I did to the pipeline.
My test file is really small (100,000 reads and only 70% mapping).
I've never used RSEM, don't need to use it and don't have time to go deeper into that now...
If you think it's ok, I'll do a PR on Gregor's fork.

amayer21 · 2020-07-22T14:08:17Z

Also, I said I was going to test on paired end data but my only paired end dataset with UMIs has only UMIs on read2. To make it go through the pipeline, I had to swap name between R1 and R2 (don't think it would be difficult to allow UMI on read2 in pipeline though). But then my "pseudo-read1" only had the UMI (so end up with empty reads after extraction of UMIs), so STAR didn't want to process them... The possibility to start with paired-end reads and move to single-end after UMI extraction seems more complicated to implement byt may be something we could think of at some point...

apeltzer · 2020-07-23T07:47:31Z

Maybe you can do a PR on Gregors fork and we all have a look - looks like you invested quite some time now :-)

amayer21 · 2020-07-23T09:14:38Z

done :-)

apeltzer · 2020-07-27T09:48:01Z

Are we good here now? :-)

amayer21 · 2020-07-27T12:37:50Z

I think so (as I said in previous comment, I'm not sure about the fact RSEM output is sligthly different each time I run it on a given input file. But it seems to me it's coming from RSEM itself and not from any modification done to the pipeline here).

@grst can you confirm?

apeltzer

Code looks good to me - @grst please merge if you#re happy too .-)

grst · 2020-07-27T12:46:43Z

tbh, I don't know if RSEM results are expected to have some randomness - in any case, I can't see how they pipeline (and in particular your PR) would be causing this. Therefore, I'm merging this now.

grst added 8 commits July 8, 2020 16:03

Draft UMI-tools extract step.

3334f16

Handle file publishing for UMI extract step

ae29eb2

Fix barcode pattern in gz test config

dd7a930

Add umitools dedup step.

40f32db

The transcriptome BAM for RSEM is handled in a separate process.

Don't run UMI tools on transcriptome BAM when there is none

4cc881e

Don't attempt to run RSEM when no transcriptome BAM available

1c44020

Reformat misaligned-indentation and overly long lines

81cbe14

Fix nf-core#432

c5188f7

grst force-pushed the add-umi-tools branch from 77c2a53 to 7764c93 Compare July 8, 2020 14:29

Fix rebase

dc18540

grst force-pushed the add-umi-tools branch from 7764c93 to dc18540 Compare July 8, 2020 14:32

grst added 2 commits July 8, 2020 17:34

Add umi_tools to scrape software version

b5c6df1

grst commented Jul 9, 2020

View reviewed changes

Update docs

be37bc1

grst force-pushed the add-umi-tools branch from 4236586 to be37bc1 Compare July 9, 2020 11:28

grst marked this pull request as ready for review July 9, 2020 11:32

grst requested review from ewels and a team July 9, 2020 11:55

apeltzer requested review from a team and removed request for a team and ewels July 9, 2020 12:34

TomKellyGenetics approved these changes Jul 14, 2020

View reviewed changes

Alice Mayer and others added 4 commits July 27, 2020 11:22

merge RSEM counts and TPMs - add description count output in doc

863afba

typo in output.md

897860d

Update CHANGELOG.md

dcfb8e2

Merge branch 'dev' into add-umi-tools

8c3b6ae

apeltzer approved these changes Jul 27, 2020

View reviewed changes

grst merged commit 4a0eeb4 into nf-core:dev Jul 27, 2020

drpatelh mentioned this pull request Aug 18, 2020

Add optional UMI handling #73

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add UMI tools #435

Add UMI tools #435

grst commented Jul 6, 2020 •

edited

Loading

apeltzer commented Jul 7, 2020

grst Jul 9, 2020

ewels commented Jul 9, 2020

grst commented Jul 9, 2020

ewels commented Jul 9, 2020

amayer21 commented Jul 13, 2020

kkamieniecka commented Jul 13, 2020

TomKellyGenetics left a comment

apeltzer commented Jul 14, 2020

amayer21 commented Jul 14, 2020

amayer21 commented Jul 15, 2020 •

edited

Loading

grst commented Jul 15, 2020

amayer21 commented Jul 15, 2020

amayer21 commented Jul 16, 2020

grst commented Jul 16, 2020

ewels commented Jul 16, 2020

amayer21 commented Jul 16, 2020

grst commented Jul 17, 2020 •

edited

Loading

amayer21 commented Jul 17, 2020

amayer21 commented Jul 17, 2020

grst commented Jul 17, 2020

amayer21 commented Jul 22, 2020

amayer21 commented Jul 22, 2020

apeltzer commented Jul 23, 2020

amayer21 commented Jul 23, 2020

apeltzer commented Jul 27, 2020

amayer21 commented Jul 27, 2020

apeltzer left a comment

grst commented Jul 27, 2020

Add UMI tools #435

Add UMI tools #435

Conversation

grst commented Jul 6, 2020 • edited Loading

PR checklist

apeltzer commented Jul 7, 2020

grst Jul 9, 2020

Choose a reason for hiding this comment

ewels commented Jul 9, 2020

grst commented Jul 9, 2020

ewels commented Jul 9, 2020

amayer21 commented Jul 13, 2020

kkamieniecka commented Jul 13, 2020

TomKellyGenetics left a comment

Choose a reason for hiding this comment

apeltzer commented Jul 14, 2020

amayer21 commented Jul 14, 2020

amayer21 commented Jul 15, 2020 • edited Loading

grst commented Jul 15, 2020

amayer21 commented Jul 15, 2020

amayer21 commented Jul 16, 2020

grst commented Jul 16, 2020

ewels commented Jul 16, 2020

amayer21 commented Jul 16, 2020

grst commented Jul 17, 2020 • edited Loading

amayer21 commented Jul 17, 2020

amayer21 commented Jul 17, 2020

grst commented Jul 17, 2020

amayer21 commented Jul 22, 2020

amayer21 commented Jul 22, 2020

apeltzer commented Jul 23, 2020

amayer21 commented Jul 23, 2020

apeltzer commented Jul 27, 2020

amayer21 commented Jul 27, 2020

apeltzer left a comment

Choose a reason for hiding this comment

grst commented Jul 27, 2020

grst commented Jul 6, 2020 •

edited

Loading

amayer21 commented Jul 15, 2020 •

edited

Loading

grst commented Jul 17, 2020 •

edited

Loading