-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New multiple table in-memory design #298
Comments
thanks @LucaMarconato for this detailed post. I will try to answer point by point, while I agree that current implementation is not optimal, I am still not convinced by the multiple table idea. I agree nonetheless that is prob worth to explore alternatives. Response to multiple tables arguments
this is partially true. While this is still very useful and indeed currently complicated to implement, another big motivation for single table was that in practical use cases the user will do most of the molecular analyses in the single table level. e.g. sc.pp.normalize_total(sdata.table) and not for table in sdata.tables:
sc.pp.normalize_total(table) which in fact has a drastically different outcome.
this is also not completely true, as we implicitly discourage to store additional annotations in the geodataframe. For example, no
Again I feel pretty strongly against this, and while it's true that for points we don;'t have an alternative, it's also true that for points the type of annotation is "different" than what we want for regions (shapes, labels).
I agree that in the above point having a single mudata object would be useful. The data duplication while maybe not impactful from memory perspective, is still quite confusing IMHO. Response to spatialdata_attrs arguments
While I understand the critique, I would be very interested to see a different proposal on this. In particular in the case where regions are labels . We thought quite hard and long about this while discussing the ngff proposal. It is true that it is suboptimal in the case where regions are shapes, but I really don't see an alternative in the labels case. Separate proposalI feel like a lot of the complexity that is raised in the above points, as well as in practical experience with spatialdata, is indeed the fact that we don't have a global index. If we were able to store the shapes region as geopandas dataframe directly into anndata, than a lot of the indexing confusion would be resolved, and also situations like
Labels would still be stored separately since they are raster types, but I feel there is less problem if the index mismatch happens with labels. I am not sure how coordinate systems and transformations could be handled in this case yet. Nevertheless, my very short proposal is the following: store shapes information directly into anndata (potentially obsm) |
Thanks for adding your comments to the proposal. I answer by point. Response to multiple tables argumentsOn concatenating tables
I agree, the user would need to concatenate and deconcatenate the table, it would be one of the drawback of the proposal, a cost that I think it's still worth to pay. But I also see as a solution to this: the users, instead of contatenating and deconcatenating the table, can instead concatenate both the shapes elements and the table. Extra columns in dataframes
While I was reluctant to have this, I actually realized when working on the data and talking to users (see for instance here #311) that it would be convenient to provide this flexibility. The
|
Another aspect I didn't find when skimming through the proposal: Parallel table updates As far as I remember, one of the strong points for Zarr as storage backend was parallel write access. Imagine a SpatialData object with n images. A user creates n parallel processes where each segments an image and stores a label (no write conflict, except maybe Currently, when "appending" a table, I have to check whether the current table is None and create one or do
Whether we have a single table or multiple tables, there should be a thread-safe way to append a batch of new rows to it (and only write the new data). |
Next release of anndata will have support for concatenating stored anndata objects. |
User Interface
Hard agree. For me it was impossible to add a table without referring to I really like the approach where each element can be equipped with a table. This organisation leads to a very obvious mental models of the contents of a spatialdata object. This is less relevant for homogenous sdata objects, so let me give you an example for a combined Visium + Xenium experiment:
In this multitable world the relationship between table entries and spatial elements is plain obvious. "There should be one — and preferably only one — obvious way to do it."
Users repeatedly storing additional annotation in the geodataframe points to unmet needs on the user side. An obvious and flexible user interface is the best preventative measure against users experiencing the need to bend spatialdata to their will by falling back on the flexiblity of the basal objects. I think multible tables can provide both the flexiblity and the obviousness to guide users to the one way to do things. |
Whether I prefer one table /multiple tables depends on the datasets I am using.
When having multiple fields of view in xenium that match one visium experiment, in the ideal world you would be able to group the xenium experiments in one table and have the visium in another table. However, I think this will become hard to keep intuitive. I would prefer to move everything to different tables. A way to overcome the first issue would be (as @berombau already suggested somewhere) to have the function sdata.table that concatenates all different sdata tables. Ideally, for me, you would be able to perform analysis steps on this concatenated table, and have them also updated on the subtables (but to me, this sounds like a lot of work to program). The indices, however, possibly become a mess, as cells with the same index can be in the table. You would need a key (how different to region_key?) or multi-index to keep them separated.
If the different tables and shapes refer to different images, with different labels layers and contain duplicated indices, it will be difficult without the key (that you would want to deprecate) to match the cells back to the correct image, which is necessary for all downstream spatial analysis (plotting, neighborhood analysis,...) . Matching tables to imagesIn the whole discussion, you mention matching the shapes, labels and table layer
I was one of them because when you start analysing data, you start with one field of view, and one analysis pipeline. In that case, you don't need region_key, because all your cells are belonging to the same image and coordinate system. I think they are useful for analysis that go beyond one image analysis. Maybe make it possible to not use them, so you only have to think about it when working with multiple images? Or explain clearly why they are there?
Make sure if you do this, that when you update one table, the duplicated table is updated as well. shapes in anndata
ALtohugh this is intuitive, this is not always practical, for sure not if you want to include shapes layers that aren't cells (veins, regions,...), these you don't want to save in your shapes layer. When taking subsets of the data, you might still want to plot all other shapes. (plot the shapes that are in the anndata in colour, all the other ones are just the outline for example). Also, if you would do it, solve the issues with interoper shapes annotationI wouldn't necessarily use this to annotate the shapes of the cells, but to annotate shapes of all object that aren't cells, like the veins, or necrotic layers, or annotated layers. Now if you want to annotate youregions, they all need a separate layer. I think these things are still very interesting to save in the same sdata object, as you use them to calculate features of the sdata tables, and you visualize them on the same images mudataI am currently not yet working on this type of data, but this will be very handy for multiome data. However, not all cells always will match up. The reason why I don't think this will solve the issue with transcriptomics is related to the fixed dimensions. It happens that a cell is picked up based on membrane staining but has no nucleus. In this case, the cell will only be present in the anndata of the whole cell and not in the one of the nucleus, and I don't know how good mudata an handle that (it happens in both directions btw, because of the membrane stainings not yet being perfect). |
Changing the on-disk format seems out of scope for this in-memory discussion. Nevertheless, it might be useful to imagine the ideal on-disk format, and incrementally build towards it by first having an in-memory format that reflects it. (Also, as a user interested in accessing the data from JS for visualization, the on-disk representation is the primary way I am interacting with SpatialData). Apologies if any of this is redundant with things already discussed above. IdeaGeneralization of MuData to allow for storing separate tables by E.g., index values would only need to be unique per-region. If index values are shared across elements (or modalities) within a region, they would be assumed to refer to the same entity. SpatialData could provide helper functions for augmenting index values with region/element IDs to account for this (if using multi-index is not feasible). APIs could look like:
or
Modality-agnostic:
If we omit region or element information, we can concatenate the tables by adding extra columns:
A functional form could allow using custom names for these appended columns:
or
Constructor syntax proposal:
On-disk format: The on-disk format could have a deterministic structure, hierarchical like
This splitting of the tables on-disk, while out of scope here, would be more favorable to the use case of loading subsets of data via HTTP requests. |
Related, from the perspective of any one SpatialElement, is the user able to easily understand which other subset of data it is related to? Currently, there is a way to "navigate" from a table to the other elements. I can check Once there are multiple tables, can I still "navigate" from Perhaps this question only becomes relevant with a different on-disk format though, since in-memory, it should be easy to iterate over multiple tables metadata. A solution to using string keys to match region information could be to use the extent of the coordinates of the region. This could help allow matching data between regions that do not 100% overlap. For example, if the tables for each region are stored separately on-disk, then the extents of their coordinates can also be stored easily on-disk. Then users can use "collision detection" to understand which subsets of data overlap across elements. E.g., if a user of our visualization tool specifies that they are interested in viewing the Visium spot shapes, then we look up the extent of the coordinates of the region containing those spots, and then check if there is any tabular data that both annotates shapes and overlaps with this region in space. |
Howdy @LucaMarconato @giovp @ivirshup @kevinyamauchi @Others! I'd like to share my thoughts on why using multiple tables could be a great idea for future:
Here’s just a few reasons I could name, but I’m sure there’s more. I'd be happy to hear why people are rooting for one table only, and discuss further! |
Thank you everyone for the time to write your very precious feedback! I have discussed the content of this issue and all the proposals with @berombau @SilverViking @lopollar @melonora @ivirshup @ArneDefauw in-person and all the participants agreed that adding multiple tables support is a priority. The proposal that I have added in the first post was well received, I here summarize it very briefly for who didn't read the full discussion:
The main technical challenge for implementing the proposal is the fact that AnnData does not support integer indices, but I will try to work a solution around this, at worst representing the same integer as strings in AnnData and as string/int elsewhere. The three main design points that were highlighted as delicate were the following:
The proposal from @keller-mark would help with 2 because by requiring the user to specify Another advantage of the proposal from Mark would be a tighter integration of Muon. Since Muon would be supported in a follow up PR and not in the initial implementation of the multiple tables support, it will be better to postpone the Muon-like APIs to that moment. The issue will remain open for additional comments and feedbacks, I plan a few weeks of work to merge the other PRs and finalize other work across the framework. If no blockers or strong objections arise, I plan to start implementing this after that. |
Hey @LucaMarconato, Great to see that you're actively working on this! It wasn't exactly clear to me if the preferred format for multi-modal data is multiple AnnDatas or a single MuData. Would be nice to get your suggestion on this :) Daniel |
Hi @dbdimitrov, ultimately we want to support |
Hi, Principles:
Use case: One SpatialData for all wells:
Single-table for all elements:
Multiple-table (1 table per element):
In my view, the new proposal should work for us, but since it is about in-memory representation without changing the single on-disk table, it still needs incremental table updates. The need to add some rows/columns without Pandas operations that return a new (merged) table is pressing for us. |
With the single-table design, I see a huge limitation. I don't know whether this has been already considered or how a new in-memory API on top of the single table on disk solves this. When concatenating SpatialData tables (for example multiple tables for different regions), not all columns might be available in all tables. There will be missing values, and a single table must support missing values for any data type. With multiple tables, we can simply avoid empty columns and minimize the chance of missing values. As it is now, Pandas/AnnData fill them with NaN, changing the data type if necessary (!). For example an integer column concatenated with an empty column becomes a column containing floats or NaN. That means whoever wants to read values from the supposed integer column must precautionarily convert to int to avoid follow up errors. |
Hey @aeisenbarth ! I agree the single table design has serious limitations. Thank you for sharing your thoughts and use case. I just wanted to mention that @melonora is going to be taking on adding multiple table support starting next week. |
@aeisenbarth thank you for sharing your use multi-well use case in your comment.
I agree with you on the benefits that the new multiple table design would have. In addition to this, also the last point should not be a problem because the new design will allow for singleton tables (not associated to elements) to be stored, so it will be possible to store for instance one of such table, containing only the obsp, for each well. One note on this is that if one index is changed in one "regionprobs" table, the change will not be forwarded to the corresponding "obsp" table. Anyway, I think that this should not be a problem because in the context of constructing a data incrementally, the indices would not change after constructing the object. One additional note relative to your use case is that we are going to allow for nested NGFF hierarchies. So it will be possible to open and write a single |
I also comments on the benefits of the other two approaches that you mentioned: "One SpatialData for all wells" and "Single-table for all elements". I will not comment on the limitations since the multiple table design would relax them.
Concatenation and deconcatenation will be required, this will introduce some overhead but I believe that this is acceptable considering all the other implied benefits of the multiple table design. Also we could mitigate this by providing some convenient helper functions.
The reality is that also with a single table one needs to choose what to annotates (for instance labels or shapes) and to switch one would have to update the table metadata. We do this for example in the screenshot here (from the Xenium Visium notebook. With the new design we will also not allow to have one table annotating multiple elements, but it will be easier to map one table to another element since now the table will be matched to the shapes/labels by table.obs.index and not by the combination of the
As discussed in my previous message, storing tables containing |
I'm glad to see the progress on this issue! Just documenting another point that can/will be solved through multiple tables: Per-region variable/feature aggregation metrics table.var["sum"] = table.loc[table.obs.region=="region1"].X.sum(axis=0) ⚡ table.var["sum"] = table.loc[table.obs.region=="region2"].X.sum(axis=0) These are metrics computed for each matrix column (feature), like |
Is there a plan for the multiple tables to be reflected in the on-disk format? (or is that not intended?) I did not see any issues related to this but only did a quick search |
In the end the original idea brainstormed in the first post didn't make it into the final implementation. This came from first trying a draft implementation where we tried avoiding using region, region_key and instance_key but we then realized that indices were not suitable for the safe storage of the relationship between the tables and instances (because reindexing operations are common). Long story short, now the multiple table in-memory design still matches 100% the original storage format, simply we allow for multiple tables being present (which was never a restriction on-disk). |
Two more comments. First, to improve the user experience (which was one of the scope of the original design prosed in this issue) we implemented a series of APIs, such as the Second, in-memory we realized we can actually drop |
Scope of the proposal
After receiving feedback from
@EliHei2 @lopollar @melonora @LLehner @AdemSaglamRB
, hearing users asking for multiple table support as an important requirement for the adoption ofSpatialData
in their tools (Maks, Louis
) and having personally examined the development implications of the current table specification, I propose a new in-memory design for supporting multiple tables inSpatialData
, that I have been maturing in the past months. Importantly, the on-disk table specification will not change.Timeline and feedback
I propose to implement the new in-memory representation after the following work: aggregation and query refactoring (me), io refactoring (@giovp), new pipeline for daily tasting all the notebook (me), synthetic datasets for R/JS interoperability (me), napari bugfix (me and others). So ideally we would start in 1-1.5 month, enough time for receiving feedback and improving the design. In particular I kindly ask for feedback to @kevinyamauchi @ivirshup @gtca @ilia-kats. I put some usernames in
raw
not to spam them now, but I will tag everybody after the first draft is refined, and additionally@timtreis @berombau @joshmoore
.Disk specification
The disk storage will remain the one described in ome/ngff#64. What I propose to change is the in-memory representation of the data: instead of directly representing in-memory the metadata that we have on-disk, I propose to put the data in a form that is more ergonomic and that follows the
(geo)pandas
,anndata
andmuon
indexing common practices.In-memory representation
I propose the following.
SpatialData
object and instead allowing, for each object, to have either anAnnData
table, either aMuData
table associated (or also having no table associated to it). For keeping the implementation smaller, I would start just with oneAnnData
table, and allow to use also aMuData
table in a follow up PR. On disk, theMuData
table would be saved as a list of tables, so that the current specs are still valid.region
,region_key
,instance_key
in-memory, and instead relying on the same shared indexing between the table and the (single) corresponding shapes/labels element. Code examples below. This may require allowingAnnData
to use custom indices and not just string (or at least integers), or would force the (geo)dataframes to have string indices. We will find a way around this.DaskDataFrame
(and soon alsoDataFrame
) and shapes, that areGeoDataFrame
(and soon maybe alsoDaskGeoDataFrame
), the user can choose to store metadata as additional columns of the dataframe, without having to add a new table. I wrote an helper function (get_values()
), similar toscanpy.get.obs_df()
, that allows to retrieve a column from the df, from obs or from var. So having information located either in the dataframe or in the table is not a problem. Further note, the implementation ofget_values()
would become much simpler with the new table representation.Code example 1: 3 visium slides, one common table.
Let's see how the current code and the new code would compare.
Current approach
which gives
Common operations on the table, such as finding all the rows that corresponds to an element (such as
visium1
), sorted in the same order, are available via helper functions, in this case the following:which gives
But even a simple operation as this one requires laborious code (see here and the function called internally here). This requires lot of work for us to implement and maintain the code, and hinders the ability to the user to easily extend the codebase, or code a custom implementation to cover specific use cases.
In addition, I got told by some users that they don't feel natural/get the
region
,region_key
,instance_key
approach, and they end up not using that metadata.New approach (pseudocode)
which is much simpler than the previous approach. The code would give
Getting the table corresponding to
visium1
, making the order of the rows match, does not require helper functions and would be as simple asConcatenating and splitting
If the user needs to concatenate the various subtable to a unique one, a way to do this would be something like this (still a bit laborious but way less than using
region
,region_key
andinstance_key
and not introducing a knowledge entry barrier to the user). Also, we could bundle it in an helper function.Merging the table into a global one:
Splitting back the merged table
Code example 2: Visum + Xenium
Currently for representing the Visium + Xenium data we need multiple
SpatialData
objects and also usingAnnData
layers and.obsm
to store the result of aggregation.For instance this is the
visium_roi_sdata
object produced by the Visium + Xenium notebook "00":Note in particular that:
SpatialData
objects, one for the Xenium replicate 1 and 2.table.layers['xe_rep1_cells']
table.layers['xe_rep1_tx']
table.obsm['xe_rep1_celltype_major']
Using
MuData
tables would keep things more organized, something like thisNotes:
SpatialData
objects.MuData
but onlyAnnData
, we can still obtain something similar to the above by duplicating theshapes
layer. Suboptimal, but still the duplicated data has low memory impact compared to the iages, so it's a good temporary compromise.Further considerations
Rows annotating multiple regions
Sometimes we would like to have rows of a table annotating multiple instances in different regions (e.g. the same cell represented as a circle, as a polygon or as a label). We had some discussions in #34 and #99 and on Zulip, but I came to the conclusion that the easiest solution is just to duplicate the table, and having the same table annotating the regions object. In particular this solves the problem of having different indices or having more or less instances in one regions objects. Matching indices can always be recomputed on the fly using
geopandas.sjoin()
or equivalent functions for raster data that we will need to write anyway to extendaggregate()
.Subset by var/obs
One of the arguments to have one single table instead than multiple tables was to support syntaxes like this one (see more here #43)
But the reality is that we haven't implemented such function yet and that implementing these things can be really messy (as shown in the helper function discussed above to subset and reorder a table by a regions object). Furthermore the implementation would be made also complex by the fact that the
cell_type
information is not necessary found in the table, but could also be located in a column of theGeoDataFrame
as discussed above.The syntax above would instead simply become something like this, which also feels very natural
Other linked issues
Linked discussion issues:
Issues that will be not relevant anymore with this new design:
Table.validate()
#130 reason: we drop theregion
,region_key
andinstance_key
metadata for the in-memory representation (they will be used for IO only).The text was updated successfully, but these errors were encountered: