Question on UMI deduplication / quantification of 3' tag data from bulk samples #306

tomsing1 · 2018-10-26T16:29:05Z

tl;dr: 3-tag sequencing methods for bulk RNA samples contain known sample indices and UMIs and thus resembles sc-RNA-seq read formats. Do you have a recommendation on how to use Salmon and / or Alevin to quantify gene expression for this data type?

Congratulations on the recent alevin preprint! The new algorithm to deduplicate UMIs looks awesome. I am wondering if you had a recommendation on how to leverage it for 3' tag sequencing of bulk samples.

There are a number of protocols that focus on the 3' ends of transcripts to allow for cheap quantification of gene expression, e.g.

These methods combine conventional (known) sample-indices to label samples (or wells) with unique molecular identifiers (UMIs). (I found one question on this topic in the salmon issue tracker from back in 2016)

Here is the Drug-seq approach, for example:

The resulting read data resembles that of single-cell approaches and requires deduplication of UMIs and quantification based on reads with a strong 3' bias. It seems analysis of this data could benefit a lot from the algorithms implemented in Alevin.

Can this data be analyzed with Salmon and / or Alevin? Are there any pitfalls that I should be aware off?

Many thanks for any feedback - and thanks again for making these great tools available to the community.

k3yavi · 2018-11-16T21:16:41Z

Hi @tomsing1 ,
Apologies for the slow response, I was out of country for a while.

Thanks for your kind words and starting a very interesting suggestion.
It’s fascinating to see, how methods being used in single-cell RNA-seq is coming full circle back to the bulk RNA-seq experiments. We have to do some more digging to say clearly about the caveats of using Alevin with the mentioned 3’ bulk RNA-seq experiments but given the understanding from the picture of the shared image we don’t see any obvious show stoppers; although below mentioned concerns should be kept in mind while using Alevin for bulk data deduplication:

Alevin solves the problem pretty well for protocols where fragmentation of the cDNA molecule happens post PCR amplification. There might be some concerns about over-deduplication of the UMI if fragmenation happens before amplification. Although in current form, Illumina sample index can be given as an external whitelist to Alevin but user should be aware that Alevin performs a sequence correction step before starting any optimizations.
Alevin is designed for droplets based protocols, where one end of Paired end read is just the CB/UMI (i.e. no read sequence) and therefore Alevin can’t optimally use the full paired end information of the bulk 3' protocol if its both end has read-sequence for example the ambiguous mapping resolution based on a previously/empirically known approximate fragment length.

We would be more than happy to help/discuss, how does the results look in bulk 3’ tagged protocols or if you have particular suggestions about what improvements can be done in Alevin.

antgomo · 2018-12-11T11:35:28Z

I am also interested in this approach. I have paired-end bulk-RNAseq with UMIs in order to avoid duplicates. I have three fastq's per sample : 1 UMI, 2 and 3 paired-end FASTQ My aim is if I can use alevin in this way

salmon alevin -l ISR -1 UMI.fq.gz -2 Sample_read_1.fq.gz Sample_read_2.fq.gz

Thanks in advance

ChenfuShi · 2019-07-22T15:24:20Z

Is there any plan to support this in salmon? We also have data generated using the quant-seq with UMIs and we have quite a few duplicates. What would you do?
Thanks!

nsmackler · 2019-10-11T19:25:52Z

I second this. Any chance this will be possible? All it requires is passing a UMI fastq and a R1 (or R2) fastq from the 3' sequence. The additional bells and whistles for cellular barcodes can be dropped, so basically it's like a combination of salmon align and alevin to remove duplicate UMIs from reads mapped to the same gene/transcript.

karl616 · 2020-11-17T15:55:17Z

I would also be interested in a feature like this.

k3yavi assigned rob-p Oct 26, 2018

k3yavi added the alevin issue is primarily related to alevin label Oct 26, 2018

k3yavi self-assigned this Nov 16, 2018

k3yavi added the question label Nov 30, 2018

gnaisha mentioned this issue Dec 29, 2020

deduplicated counts from bulk RNA-Seq data with UMIs #610

Open

grst mentioned this issue Jul 15, 2021

Running Salmon in alignment mode on UMI-deduplicated BAM file #684

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on UMI deduplication / quantification of 3' tag data from bulk samples #306

Question on UMI deduplication / quantification of 3' tag data from bulk samples #306

tomsing1 commented Oct 26, 2018

k3yavi commented Nov 16, 2018

antgomo commented Dec 11, 2018

ChenfuShi commented Jul 22, 2019 •

edited

Loading

nsmackler commented Oct 11, 2019

karl616 commented Nov 17, 2020

Question on UMI deduplication / quantification of 3' tag data from bulk samples #306

Question on UMI deduplication / quantification of 3' tag data from bulk samples #306

Comments

tomsing1 commented Oct 26, 2018

k3yavi commented Nov 16, 2018

antgomo commented Dec 11, 2018

ChenfuShi commented Jul 22, 2019 • edited Loading

nsmackler commented Oct 11, 2019

karl616 commented Nov 17, 2020

ChenfuShi commented Jul 22, 2019 •

edited

Loading