-
Notifications
You must be signed in to change notification settings - Fork 709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add UMI tools #435
Add UMI tools #435
Conversation
Could you have a look at this here too while you're on it ? ;-) |
The transcriptome BAM for RSEM is handled in a separate process.
The pipeline used to fail with `--skip_rsem` and gzipped reference data. The reason for this was that the entire code block for decompressing GTF and FASTA files was only executed because of RSEM requiring them for building its reference. By using `--skip_rsem` the references were not extracted and the process combining normal+additional references failed with a missing channel. This commit ensures that the references are unzipped whenever an additional fasta file is used. I added `--skip_rsem` as additional test case to the CI script.
*/ | ||
if(params.with_umi) { | ||
// preseq does not work on deduplicated BAM file. Pass it the raw BAM file. | ||
bam.into {bam_umitools_dedup; bam_preseq} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit unsure which of the QC tools should be executed on the deduplicated BAM and which on the raw BAM.
Looking great! We've been planning to look into something similar for a while, though our UMIs are part of the barcode read so would be in a third FastQ file per sample. I guess this code is for inline UMIs that are part of the main read - do you see any overlap between this PR and what we will need to implement for barcode UMIs? Just trying to think about future-proofing 😄 |
I don't know how exactely you'd run UMI tools with an index read. Possibly one would have to run the |
Yeah, I figured we might need a separate step to get the UMI into the read name in the FastQ files. But it's great that you think that the dedup step should work the same. |
This is great!!! I need the UMI tools and was planning to try to add them into the RNAseq pipeline this week :-) |
I have same plan for this week, hope I can make it work on my data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks for doing this.
@kmurat1 @amayer21 - once you've successfully tested, can you ping here? |
Sorry I commented on the test I did on the slack channel and didn't copy the comment here... Here it is: I've given it a try on small fastq file and end up with an error message coming from the use of RSEM and not from UMI tools...
(full job log here: nfcorernaseq_withUMItools_5335509.log) This message doesn't appear and all goes well when using the option I'm planning to look into it (I have never used RSEM before) but also trying to solve some majors problems with our cluster with our sys-admin, so haven't been able to put enough time on this yet... |
So, the error message from the |
it should absolutely work with only one sample... do you want to try to fix it? |
I will have a look tomorrow morning. I'm a newbie in NextFlow but really keen to learn and put it into practice. |
How do I need to proceed? I don't think I can commit anything in the PR, can I? Do I need to fork grst:add-umi-tools then ask for a PR on that repository? Or shall I let you know what to change? There are only a couple of changes in one section of main.nf script |
Hi @amayer21, yes a PR against my fork should work. |
Pasted from Slack for future googlers:
This is what it looks like: |
Thanks Phil. As explained on slack, I realised when trying to add comments to the code of this PR that the changes are in codes that hasn't been modified in this PR, so I can't comment it. This mean that the problem I saw when trying this branch is already present on the dev branch of the pipeline. To summarise: So the UMI-tools seems to work to me (at least for single-end reads, I could test paired-end ones tomorrow if needed). Sorry I'm very new to collaborating on GitHub so I made things more complicated than they were... Thank you very much for your help! |
merge_rsem_gene was crashing when only 1 sample as input of the pipeline => now fixed I've also corrected the name of the output file of that same process as the column of RSEM output we keep is the "expected count" column and not the "TPM" one. I think it was confusing to call the output "rsem_tpm_gene.txt" as this wasn't "transcripts per million"
Oh really!? 😕 |
It's easy to fix if we want to get the TPMs (just need to change
Having said that, I've never used RSEM but it look to me that TPMs are easy to compute from counts, and several tools will start from raw counts rather than counts per millions and perform their own normalisation. In the tutorial (https://github.com/bli25broad/RSEM_tutorial#single), they explain (looking at .isoforms.results):
|
If we keep the expected counts in the final output (instead of or both with TPMs), we also need to update this page: |
I would keep both. |
Hello, I've made changes in my fork to keep both RSEM counts and TPMs. Before to make a PR, I wanted to compare RSEM outputs before and after making the changes. |
Also, I said I was going to test on paired end data but my only paired end dataset with UMIs has only UMIs on read2. To make it go through the pipeline, I had to swap name between R1 and R2 (don't think it would be difficult to allow UMI on read2 in pipeline though). But then my "pseudo-read1" only had the UMI (so end up with empty reads after extraction of UMIs), so STAR didn't want to process them... The possibility to start with paired-end reads and move to single-end after UMI extraction seems more complicated to implement byt may be something we could think of at some point... |
Maybe you can do a PR on Gregors fork and we all have a look - looks like you invested quite some time now :-) |
done :-) |
Are we good here now? :-) |
I think so (as I said in previous comment, I'm not sure about the fact RSEM output is sligthly different each time I run it on a given input file. But it seems to me it's coming from RSEM itself and not from any modification done to the pipeline here). @grst can you confirm? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good to me - @grst please merge if you#re happy too .-)
tbh, I don't know if RSEM results are expected to have some randomness - in any case, I can't see how they pipeline (and in particular your PR) would be causing this. Therefore, I'm merging this now. |
Adds a
umi_tools
extract step before trimming and aumi_tools dedup
step before QC/quantification.--skip_rsem
. See commit message 14b337a for more details.PR checklist
dev
rather thanmaster
If necessary, also make a PR on the nf-core/rnaseq branch on the nf-core/test-datasets reponextflow run . -profile test,docker
).nf-core lint .
).docs
is updatedCHANGELOG.md
is updatedREADME.md
is updatedLearn more about contributing: https://github.com/nf-core/rnaseq/tree/master/.github/CONTRIBUTING.md