Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on UMI deduplication / quantification of 3' tag data from bulk samples #306

Open
tomsing1 opened this issue Oct 26, 2018 · 5 comments
Assignees
Labels
alevin issue is primarily related to alevin question

Comments

@tomsing1
Copy link

tl;dr: 3-tag sequencing methods for bulk RNA samples contain known sample indices and UMIs and thus resembles sc-RNA-seq read formats. Do you have a recommendation on how to use Salmon and / or Alevin to quantify gene expression for this data type?

Congratulations on the recent alevin preprint! The new algorithm to deduplicate UMIs looks awesome. I am wondering if you had a recommendation on how to leverage it for 3' tag sequencing of bulk samples.

There are a number of protocols that focus on the 3' ends of transcripts to allow for cheap quantification of gene expression, e.g.

These methods combine conventional (known) sample-indices to label samples (or wells) with unique molecular identifiers (UMIs). (I found one question on this topic in the salmon issue tracker from back in 2016)

Here is the Drug-seq approach, for example:

Drug-seq

The resulting read data resembles that of single-cell approaches and requires deduplication of UMIs and quantification based on reads with a strong 3' bias. It seems analysis of this data could benefit a lot from the algorithms implemented in Alevin.

Can this data be analyzed with Salmon and / or Alevin? Are there any pitfalls that I should be aware off?

Many thanks for any feedback - and thanks again for making these great tools available to the community.

@k3yavi k3yavi added the alevin issue is primarily related to alevin label Oct 26, 2018
@k3yavi k3yavi self-assigned this Nov 16, 2018
@k3yavi
Copy link
Member

k3yavi commented Nov 16, 2018

Hi @tomsing1 ,
Apologies for the slow response, I was out of country for a while.

Thanks for your kind words and starting a very interesting suggestion.
It’s fascinating to see, how methods being used in single-cell RNA-seq is coming full circle back to the bulk RNA-seq experiments. We have to do some more digging to say clearly about the caveats of using Alevin with the mentioned 3’ bulk RNA-seq experiments but given the understanding from the picture of the shared image we don’t see any obvious show stoppers; although below mentioned concerns should be kept in mind while using Alevin for bulk data deduplication:

Alevin solves the problem pretty well for protocols where fragmentation of the cDNA molecule happens post PCR amplification. There might be some concerns about over-deduplication of the UMI if fragmenation happens before amplification. Although in current form, Illumina sample index can be given as an external whitelist to Alevin but user should be aware that Alevin performs a sequence correction step before starting any optimizations.
Alevin is designed for droplets based protocols, where one end of Paired end read is just the CB/UMI (i.e. no read sequence) and therefore Alevin can’t optimally use the full paired end information of the bulk 3' protocol if its both end has read-sequence for example the ambiguous mapping resolution based on a previously/empirically known approximate fragment length.

We would be more than happy to help/discuss, how does the results look in bulk 3’ tagged protocols or if you have particular suggestions about what improvements can be done in Alevin.

@antgomo
Copy link

antgomo commented Dec 11, 2018

I am also interested in this approach. I have paired-end bulk-RNAseq with UMIs in order to avoid duplicates. I have three fastq's per sample : 1 UMI, 2 and 3 paired-end FASTQ My aim is if I can use alevin in this way

salmon alevin -l ISR -1 UMI.fq.gz -2 Sample_read_1.fq.gz Sample_read_2.fq.gz

Thanks in advance

@ChenfuShi
Copy link

ChenfuShi commented Jul 22, 2019

Is there any plan to support this in salmon? We also have data generated using the quant-seq with UMIs and we have quite a few duplicates. What would you do?
Thanks!

@nsmackler
Copy link

I second this. Any chance this will be possible? All it requires is passing a UMI fastq and a R1 (or R2) fastq from the 3' sequence. The additional bells and whistles for cellular barcodes can be dropped, so basically it's like a combination of salmon align and alevin to remove duplicate UMIs from reads mapped to the same gene/transcript.

@karl616
Copy link

karl616 commented Nov 17, 2020

I would also be interested in a feature like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
alevin issue is primarily related to alevin question
Projects
None yet
Development

No branches or pull requests

7 participants