Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running Salmon in alignment mode on UMI-deduplicated BAM file #684

Closed
grst opened this issue Jul 15, 2021 · 1 comment
Closed

Running Salmon in alignment mode on UMI-deduplicated BAM file #684

grst opened this issue Jul 15, 2021 · 1 comment

Comments

@grst
Copy link

grst commented Jul 15, 2021

Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)?
bulk

Describe the bug
I am working with UMI-tagged Lexogen QuantSeq data. Since salmon does not (yet?) support handling UMIs with bulk RNA-seq directly (see #306), I am using umi_tools + STAR to generate a deduplicated transcriptome BAM file and run Salmon in alignment mode as implemented in the nf-core/rnaseq pipeline.

Unfortunately, Salmon does not seem to handle the deduplicated BAM well. A lot of genes have zero reads which shouldn't.

For instance, for ENSMUSG00000029657 I get the following results (last column denotes counts in all cases)

# Salmon on transcriptome BAM, without umit_tools dedup: quant.genes.no_umi.sf
ENSMUSG00000029657.15   3803.74 3650.23 17.3078 438
# Salmon on deduplicated transcriptome BAM: quant.genes.sf
ENSMUSG00000029657.15   1947.36 1614.62 0       0   
# Feature counts on genome BAM, without umi_tools dedup: 
ENSMUSG00000029657.15   [...]  7266     415
# Feature counts on deduplicated genome BAM:
ENSMUSG00000029657.15   [...]  7266    289

Here's a scatterplot of log1p(counts) of the salmon quant results for a single sample with and without umi_tools dedup
image

To Reproduce
Run Salmon quant on the aligned transcriptome BAM file. I provide subsampled versions of both the deduplicated and non-deduplicated BAM files. If you need the full BAM files, LMK and we can arrange a transfer.

Specifically, please provide at least the following information:

  • Which version of salmon was used? 1.5.1
  • How was salmon installed (compiled, downloaded executable, through bioconda)? bioconda
  • Which reference (e.g. transcriptome) was used? gencode.vM25.primary_assembly.annotation.gtf
  • Which read files were used? Lexogen QuantSeq 3' UMI
  • Which which program options were used? --noLengthCorrection

Expected behavior
Correctly quantify results on deduplicated BAM.

Desktop (please complete the following information):

  • OS: CentOS 7
  • Version 3.10.0-1160.11.1.el7.x86_64 #1 SMP Fri Dec 18 16:34:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Additional context
There's already an issue with RSEM described in the UMI-tools repository (CGATOxford/UMI-tools#465), maybe that's related.

CC @chripla

@grst
Copy link
Author

grst commented Jul 22, 2021

Closing this as it's not an issue with Salmon.
The main problem was that the nf-core/rnaseq pipeline didn't call umi_tools with the --paired flag.

Some more fine-tuning of the UMI-tools output might be necessary to make sure unpaired reads are properly counted by Salmon. For this, see CGATOxford/UMI-tools#465.

@grst grst closed this as completed Jul 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant