-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Storing sub-obs (variable length per observation data) #609
Comments
Yes @keller-mark! We really want a way to store this spot-level data in |
Hey, I have been talking about this a bit with @hspitzer and @giovp. I was wondering if something like Awkward array would be useful here? E.g. a data structure for ragged arrays, allowing variable length arrays per observation? This should be easy-ish to support since they've already got tooling for serialization. |
@ivirshup Do you know how this would work with zarr? Would each observation have its own zarr array (as opposed to one shared one)? |
Zarr allows for ragged arrays which is how I would probably try to do it, if I were doing it from scratch. There have been implementations relying on |
🤯 I had no idea!! This looks extremely cool! |
@ilan-gold thanks a lot for pointing me to ome/ngff#64 (review) as well as the hackathon, looks really interesting work! From what I gathered, the effort over ome/gff is to use anndata for tabular representations for FISH-based spatial data. In that case, observations are measured molecules instead of cell. Beside being an interesting approach, and definitely useful for storing annotations as coordinates of the decoded molecules (alongside the image in zarr), I think what I have in mind here is slightly different. I might be completely wrong so I'll explain: What we are missing in the current anndata/squidpy analysis toolkit is a way to represent the original (processed) data from FISH-based assays in the cell-level anndata representation. The original data, after processing/decoding, is essentially in the form described before (and in the PR). However, there is no direct way to index/slice/subset the decoded molecules to the cell-level observation. From my understanding of the problem, the cell-level observation is the basic unit for downstream analysis, and the one most useful for analysts. However, it could be desirable for EDA to eventually go back to the original molecule-level representation (e.g. given a clustering results, visualize all molecules of gene X in cell type Y across the tissue). For this, we essentially need a lookup table between cells (obs) and molecules (sub-obs). We already have something like this working but it's really ugly (essentially we store the sub-obs info as a Pandas series of lists, see this section of the tangram tutorial https://squidpy.readthedocs.io/en/latest/external_tutorials/tutorial_tangram.html#Deconvolution-and-mapping ). This look up table can then be used to index/slice/subset the sub-obs annotation table (which we could store actually in Ciao! p.s. I'll reply at rest of email later on |
@giovp You're totally right about this, I got a little carried away. And there's no rush on the email - I need to start coding and stop emailing you all so much 😄 Let me collect my thoughts on this and I'll post soon! |
Closed by #647 |
Hi,
We have a use case related to #237 but slightly different.
We would like to store a second
obs
array ("sub-obs", where an observation is an individual transcript in a MERFISH experiment), but related to the firstobs
array (where an observation is an individual cell).Has your team thought about how to deal with this use case?
I am thinking something like this:
where the transcript ID and cell ID columns in the sub-obs can be like foreign keys into the main obs dataframe.
Is it possible to add a differently-shaped
obs
andobsm
to the same AnnData store?cc @ilan-gold
The text was updated successfully, but these errors were encountered: