Feature request - Rhapsody whole genome sequences #106

stela2502 · 2023-02-14T11:55:56Z

Hi, with the help of Rob I managed to get a Rust BD Rhapsody analysis tool to work (https://github.com/stela2502/Rustody).
It supports all versions of BD's beads at the moment, but not the (new) whole genome part.

Rhapsody data consists of a variety of data: Antibody tags that are coded in R2 as well as Sample Tags coded there, too. In a new version they also have a TCR and BCR sequencing options. I have coded the targeted sequence matches using 32bp kmers from the provided fasta files ( only ~ 500 fasta sequences). But I am not very interested to blow this up to whole genome analysis.
Hence my question: Is there any way you could implement the BD cell ids in alevin fry? Could be as simple as to use my cellids class. Or could I use your library to map all sequences my tool can not handle?
The second option would of cause be my favorite. Like only use your genome representation and the mapper. It would be absolutely epic if you could help me here!

Thank you!

rob-p · 2023-02-15T17:58:53Z

Hi, with the help of Rob I managed to get a Rust BD Rhapsody analysis tool to work (https://github.com/stela2502/Rustody).

Awesome!

Regarding the actual feature request, could I ask for a bit more clarification? I'm not sure what the requested feature is exactly, or what the requirement would be for alevin-fry to do. Specifically, I think it would be most useful if you could give an idea of what the provided input would be, and what the requested output would be in either of these cases to help us understand the request better. Also, I'm looping in @DongzeHE so he's in the know ;P.

--Rob

stela2502 · 2023-02-16T07:41:15Z

Hi Rob, what I would dream up would be a way to use likely salmon here https://github.com/stela2502/Rustody/blob/8befa2e774caba0f5037b57ceb23bac7d18bac8d/src/bin/quantify_rhapsody.rs#L348.
I have seen that salmon is actually a cpp library. The point in my program is where I have tried to match the R2 read to any gene the tool knows of. As there was no hit it now should look genome wide. And I would like to NOT implement a genome wide search :-D
As I know of no Rust library that could help me here I tried to think further with storing my gene data as some kind of Index file, but failed horribly. I am not even able to read the data I wrote before :-(. The test here https://github.com/stela2502/Rustody/blob/4f36750ceaf8068c90813a94182f5f6a0f381d0e/this/src/geneids.rs#L580 fails. I seams that (1) the km (u64) ids from the file are not the same that I wrote and (2) I am also unable to read any gene name back (not utf8 formated). Although my Linux system shows the gene names just fine both with a zcat and vim.
I tried yesterday and even asked ChatGPG but could not fix that. Possible you spot my error immediately? If you can please help me. Otherwise I pause for some time now. Would be cool to get a genome wide mapper in Rust, but I do not have the time for something like that at the moment. I have never done that either so it would be quite a mission for me. If you (or anybody else) have interest in that I would be very happy for any help I can get.

stela2502 · 2023-02-16T07:44:05Z

So to sum the long one up once more: I need to somehow map the R2 read to a genome wide index and do not have the time to implement that as I fail at the most basic stuff. I am no trained informatics guy after al :-(
And I would like to utilize whatever alevin-fry uses. I fear this will be complicated as the mapper is coded in cpp (if I am not mistaken). Hence I also think about implementing a genome wide mapping functionality. But I fear that is too complicated for me.

jashapiro · 2023-06-13T19:30:52Z

I just found this issue when investigating whether we would easily be able to adapt our pipeline to accommodate BD Rhapsody data. It looks like the R1 sequence structure (as described in their doc (pdf) could be accommodated by salmon --bc-geometry for the "Original" version, but the "Enhanced 3'" files have variable positions (the "diversity insert") that I am not sure how we would specify using that flag, if it is possible at all.

rob-p · 2023-06-13T20:48:52Z

Hi @jashapiro,

Is your use-case for single-cell transcriptomics? We are currently working on a "general" solution to such problems — with increasingly complicated barcoding mechanisms. Currently the simpleaf -> alevin-fry pipeline has somewhat more generic support due to it's ability to specify the geometry with the fragment geometry description language. However, there are even more involved solutions necessary in some cases. Our general purpose approach isn't ready yet, but, in the meantime, might it be possible to use a tool like Interstellar to transform the data into an appropriately "normalized" format prior to processing with existing single-cell tools?

stela2502 · 2023-06-14T09:38:44Z

Hi, at the end I just (re-)implemented a whole genome enabled mapper. That thing now uses a u16 representation of a 8bp fragment of the read to identify a most likely region in a u16::MAX long vector of 8pb-32bp downstream mappers. This does work on the targeted approach and should be able to scale it up to whole genome. You can look at it here: https://github.com/stela2502/Rustody/blob/new_mapper/this/src/fast_mapper.rs.
For the cell barcodes I simply use a partial match and get the highest probability for a sequence to be linking to one cell. Not a full length match as that would also generate some issues with sequencing errors. This allows then for some fuzziness in the matching regions, too. From a first glance at interstellar - are you sure it does implement a way to convert from a variable to a fixed format?

stela2502 · 2023-06-14T09:40:53Z

I normally see up to 80% PCR duplicates in the data. So I am not sure if thinking about catching each and every read is even worth it. I would not assume that the final counts change in a meaningful way.

colindaven · 2023-08-15T11:10:53Z

@jashapiro I would be interested in any approach for dealing with variable bases or "diversity inserts" in modern (2023) BD Rhapsody data. The library structure is detailed here as well: https://teichlab.github.io/scg_lib_structs/methods_html/BD_Rhapsody.html

The official CWL pipeline has not been too satisfactory for us.

rob-p · 2023-08-15T14:20:52Z

cc @noahcape & @Daniel-Liu-c0deb0t: Could this modern BD Rhapsody data be a usecase for seqproc+ANTISEQUENCE? Can we see what would be required to perform this transformation?

stela2502 · 2024-12-04T07:56:29Z

HI all. After some time developing a BAM output and therfore also implementing a CIRAG class my mapping speed dropped significantly - to a value I dropped the project.
Hence I am back asking the same question - have you figured the BD primer structures out? Or is Alevin-Fry not supporting BS's data? Thank you for your support so far!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request - Rhapsody whole genome sequences #106

Feature request - Rhapsody whole genome sequences #106

stela2502 commented Feb 14, 2023

rob-p commented Feb 15, 2023

stela2502 commented Feb 16, 2023

stela2502 commented Feb 16, 2023

jashapiro commented Jun 13, 2023

rob-p commented Jun 13, 2023

stela2502 commented Jun 14, 2023

stela2502 commented Jun 14, 2023

colindaven commented Aug 15, 2023

rob-p commented Aug 15, 2023

stela2502 commented Dec 4, 2024

Feature request - Rhapsody whole genome sequences #106

Feature request - Rhapsody whole genome sequences #106

Comments

stela2502 commented Feb 14, 2023

rob-p commented Feb 15, 2023

stela2502 commented Feb 16, 2023

stela2502 commented Feb 16, 2023

jashapiro commented Jun 13, 2023

rob-p commented Jun 13, 2023

stela2502 commented Jun 14, 2023

stela2502 commented Jun 14, 2023

colindaven commented Aug 15, 2023

rob-p commented Aug 15, 2023

stela2502 commented Dec 4, 2024