-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question on UMI deduplication / quantification of 3' tag data from bulk samples #306
Comments
Hi @tomsing1 , Thanks for your kind words and starting a very interesting suggestion. Alevin solves the problem pretty well for protocols where fragmentation of the cDNA molecule happens post PCR amplification. There might be some concerns about over-deduplication of the UMI if fragmenation happens before amplification. Although in current form, Illumina sample index can be given as an external whitelist to Alevin but user should be aware that Alevin performs a sequence correction step before starting any optimizations. We would be more than happy to help/discuss, how does the results look in bulk 3’ tagged protocols or if you have particular suggestions about what improvements can be done in Alevin. |
I am also interested in this approach. I have paired-end bulk-RNAseq with UMIs in order to avoid duplicates. I have three fastq's per sample : 1 UMI, 2 and 3 paired-end FASTQ My aim is if I can use alevin in this way salmon alevin -l ISR -1 UMI.fq.gz -2 Sample_read_1.fq.gz Sample_read_2.fq.gz Thanks in advance |
Is there any plan to support this in salmon? We also have data generated using the quant-seq with UMIs and we have quite a few duplicates. What would you do? |
I second this. Any chance this will be possible? All it requires is passing a UMI fastq and a R1 (or R2) fastq from the 3' sequence. The additional bells and whistles for cellular barcodes can be dropped, so basically it's like a combination of salmon align and alevin to remove duplicate UMIs from reads mapped to the same gene/transcript. |
I would also be interested in a feature like this. |
tl;dr: 3-tag sequencing methods for bulk RNA samples contain known sample indices and UMIs and thus resembles sc-RNA-seq read formats. Do you have a recommendation on how to use Salmon and / or Alevin to quantify gene expression for this data type?
Congratulations on the recent alevin preprint! The new algorithm to deduplicate UMIs looks awesome. I am wondering if you had a recommendation on how to leverage it for 3' tag sequencing of bulk samples.
There are a number of protocols that focus on the 3' ends of transcripts to allow for cheap quantification of gene expression, e.g.
These methods combine conventional (known) sample-indices to label samples (or wells) with unique molecular identifiers (UMIs). (I found one question on this topic in the salmon issue tracker from back in 2016)
Here is the Drug-seq approach, for example:
The resulting read data resembles that of single-cell approaches and requires deduplication of UMIs and quantification based on reads with a strong 3' bias. It seems analysis of this data could benefit a lot from the algorithms implemented in Alevin.
Can this data be analyzed with Salmon and / or Alevin? Are there any pitfalls that I should be aware off?
Many thanks for any feedback - and thanks again for making these great tools available to the community.
The text was updated successfully, but these errors were encountered: