-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arriba for plate-based scRNA-seq #206
Comments
Hi Silvia, I haven't run Arriba run on SMART-Seq2 data. I will give this a try and report back. Arriba automatically adjusts to the sequencing depth and has quite sensitive filters in general, so you might be able to run it with the default parameters. One important note upfront: If your library has UMIs, you should use the flag Regards, |
Hi Sebastian, Thanks so much for getting back to me! I went ahead and tried Arriba on my Smart-seq2 data (no UMIs), using the default settings. As suggested in the Arriba manual, I undertook further filtering to refine the results. Specifically, I:
I also analysed matched bulk RNA-seq data using Arriba. I was quite encouraged to see significant overlap in the fusion findings between the single-cell and bulk RNA-seq datasets. However, two observations caught my attention: a) There were quite a lot of novel transcripts forming fusion genes in the single-cell data. Interestingly, these particular fusions involving novel transcripts or TMSB4X were absent from the bulk RNA-seq data. What's more, the TMSB4X fusions are shared across the three different experimental conditions I have. However, since they are absent in the bulk data, and these fusions haven't been reported in the literature or in fusion databases, I'm kind of scratching my head trying to figure out if these fusions are real or artifacts caused by the Smart-seq2 protocol. Do you have any idea why this might have happened? Any insights would be awesome! If you have a chance to run Arriba on another Smart-seq2 dataset, I'd love to hear what you find. Thanks again, S. |
Hi Sebastian, I am also interested in using Arriba to find fusions for my single cell dataset since I could not find any reliable option for single cell data right now and in some benchmarking articles (here and here) I have seen people used arriba on single cell data to benchmark with their method and it shows very faire and reliable outcomes. I appreciate your input and help, |
Hi Silvia, I have seen you used Arriba for the single cell data, I am also interested in using it since the single cell algorithms did not work previously like scFusion, ... Best, |
Hi Sogi, Do you have 10X Genomics scRNA-seq data or full-length transcripts? You're using Cell Ranger for alignment which suggests the former. The problem with 10X data is that the method only captures the 3' ends of transcripts. Most fusion breakpoints are closer to the 5' end, however. You will miss the vast majority of fusions with this technology, regardless of which tool you use for fusion calling. The evidence for fusions is simply not captured by the technology. In contrast, if you're using full-length data, then you should be able to use the standard Arriba workflow as in the script Let me know if you need more detailed instructions. Best regards, |
If it's a 5' library, things are slight better, albeit not much I guess. While it is true that there are more fusion breakpoints near the 5' end, the main limitation of 10X remains: you only get coverage around the start/end of the transcript, but not the full transcript; unless you have long reads, which I'm guessing you haven't. As a quick check whether you can detect fusions from the data at all, you can run the following command to search the FastQ files for evidence of BCR-ABL1 fusions directly:
If this returns nothing, then I doubt there is any way to detect fusions at all. Note that there are several variants of BCR-ABL1 with different breakpoints. The above command searches for only one of them (and its reverse complement). You may want to search for others, too. Frankly, I do not know if the alignments from STARsolo or Cell Ranger can be used for fusion detection. I'm pretty sure Cell Ranger's cannot, because as far as I know Cell Ranger does not make use of the STAR parameters for chimeric alignments. Perhaps, STARsolo could be run manually with the parameters enabled. Specifically, you will need to set the parameters listed here: https://arriba.readthedocs.io/en/latest/workflow/#demo-script |
Hi Sebastian, Sorry I accidentally deleted my previous message! Thanks a lot for the useful explanations and the very handy quick test, yes my ph+ AML sample returned reads that had the sequence so I guess this is promising. Thanks so much for your helps and valuable tool you provided. Best, |
Yes, it should be fine to use the regular STAR for fusion calling. Just make sure NOT to pass the FastQ containing the UMIs+barcodes to STAR. Only pass the transcript read, which would effectively be a regular single-end alignment workflow then. In a more sophisticated version of this workflow, you could use the UMIs+barcodes for more accurate duplicate marking, which should boost sensitivity. |
Thanks! in the second part (sophisticated version) you mean using STARsolo? since UMIs are involved? |
What I mean by that is that the alignments in the BAM file need to be marked as duplicates (flag When duplicates are marked this way, you can disable duplicate detection by Arriba using the parameter This should be more accurate than having Arriba mark the duplicates, because Arriba only inspects the mapping coordinates. It ignores UMIs/cell barcodes. |
Hi Sebastian, Thanks a lot for your explanations. I already tried regular STAR + Arriba and it gave me some fusions as expected and some new (that can be either novel or false positive) using only fastqs containing no barcodes (R2) as you suggested. Thanks, |
Probably, the reads need trimming of 10X adapters before alignment. Otherwise, STAR may discard them, because only part of the reads can be aligned. STARsolo will probably deal with this automatically, but I don't know if it is compatible with the chimeric alignment parameters. |
Yes probably is the case! Thanks for your thoughts on this, it is very helpful! I’ll probably check STRsolo and will report back how it works! Unfortunately I did not get any responses from STAR github regarding my questions. best |
Hi Sebastian, Thanks again for providing such a nice tool. I could successfully run STAR(regular) and Arriba on my scRNA-seq (from 10X) data and I got nice fusion genes that actually many of them were in agreement with the cytogenetic information of my samples. In the next step as we discussed I am interested in adding those fusion genes from Arriba output to my reference using cellranger make reference function so I can use the fusion genes in my downstream analysis and navigate them on cells. I very appreciate your input and suggestion on this. Thanks for your help. |
Is your plan to append the sequences of the fusion genes as separate contigs to the FastA file and then define GTF records for these new contigs. |
yes I wanna generate one .fa file that contains all selected fusions and create the .gtf file for them and will add them to my regular .fa and .gtf file that is being used to cellranger as described here: https://www.10xgenomics.com/support/software/cell-ranger/latest/tutorials/cr-tutorial-mr#add-marker-gene |
In that case, it would be the easiest to run Arriba with This approach should be much better than building the fusion sequence yourself from the breakpoints and strand info. Building the GTF file is also a no-brainer then: All you need to do is define the entire fusion transcript as one big exon and you can always use the + strand. The sequence represents the spliced transcript. So there is no need to annotate introns and you don't have to worry about which strand it is, because the sequence always represents the sense (+) strand. |
Hi Sebastian, Thanks so much this was very helpful. I will rerun the Arriba with -I and use its transcript after some cleaning. Just one last question, what did you mean by: "define the entire fusion transcript as one big exon" in the .gtf file? chromosome/contig: I choose this as name of fusion (gene1_gene2) source:unknown Feature:exon start:1 end:922 source: . strand: + Frame: . attributes: gene_id "GFP"; transcript_id "GFP"; gene_name "GFP"; gene_biotype "protein_coding"; Thanks a lot, |
Let's say one of the fusion sequences is 1000 bases long. You first need to add the sequence as an additional contig to the FastA file. Then, you need to add a GTF record with start and end coordinates 1 and 1000. Actually, you need three: one for the gene, one for the transcript and one for the exon. All three will have the same coordinates (1-1000). Does that make sense? |
Dear Sebastian, Yes I guess it makes a lot sense but in the cell ranger, it is noted that cell ranger count only uses rows with Features as exon and it won't read other lines, I should only specify exon.
The 10X told me it is safe to choose gene_biotype "protein_coding" so it won't filter it or ... Best, |
Yes, this looks good to me! |
Thanks so much Sebastian! Best, |
Hiya,
Can Arriba be used to detect fusions in single-cell RNAseq data (e.g. Smart-seq2). I expect to see a greater number of false positives compared to bulk RNAseq, and with a lower number of reads supporting the fusions. I'm wondering what parameters should be optimized for scRNAseq data, and how the list of candidate fusions should be further filtered.
Thank you,
S.
The text was updated successfully, but these errors were encountered: