-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scverse datastructure for AIRR data #327
Comments
Hi All, I am active in the AIRR Standards group, can try and answer questions. We are looking at integrating scirpy into the iReceptor Gateway (gateway.ireceptor.org) to do analysis on AIRR Cell data - so have a vested interest... 8-) The AIRR Cell schema was just announce at the AIRR meeting this week (https://www.antibodysociety.org/the-airr-community/meetings/airr-community-meeting-vi-exploring-new-frontiers/), with a tagged (v1.4) release expected on github in the next few weeks. This is an "Experimental" schema, meaning that we think it is pretty stable, but will probably need some minor changes before it is finalized, probably in an AIRR Standard v2.0 release sometime in the "near" future. |
The Cell schema is intended to represent a |
Here are the current docs on the master branch, which will be tagged with a v1.4 release soon. |
One of the outstanding things we need to resolve in the AIRR Standard is what if any "efficient" file format should be supported. Currently the default standard for all AIRR objects is a JSON representation of the standard objects. For We have an about to be release version of our AIRR repository (https://github.com/sfu-ireceptor/turnkey-service-php/tree/production-v4) that can store Cell and CellExpression data, but currently what you get out is a very big, verbose JSON file. It works, but... |
BTW, I think it would be interesting to decorate a cell's adata.obs with AIRR |
Hi @bcorrie, thanks for reaching out, I'm definitely interested in working with the AIRR community to make this as interoperable as possible. Actually scirpy internally uses something very very similar to the Cell schema already. Every file format read into scirpy is first represented as a list of The internal representation is one side of the medal, the other is representing the AIRR data as part of either an
Well, how about Note that reading/writing gene expression formats would be out-of-scope for scirpy, this should be handled by scanpy or AnnData, or possibly by a future scverse IO package. It's also worth noting here that the CZI is currently looking into standardizing Feature Observation Matrices (FOM) for single-cell data. Maybe it would be worth for the AIRR community and them getting in touch for talking about a AIRR-specific extension. Their public presence is still very sparse, but maybe @ivirshup can tell more about this initiative and make the contact. Best, |
Yes, AnnData seems logical and along the lines of what I was thinking... Definitely do not want YAF (yet another format) 8-) Like you, this is likely out of scope for the AIRR Community - at least in the sense that the obvious thing for us to do would be to have the AIRR python library (https://github.com/airr-community/airr-standards/tree/master/lang/python) use scanpy to read and write these formats. The CZI FOM is interesting as well... |
Very nice - one would hope that these would be closely aligned, and hopefully we can now make sure they are completely aligned 8-) |
Draft proposal for an awkward-array-based data structureI've been playing around with the draft implementation of awkward array support in AnnData (scverse/anndata#647). This allows to store in An "awkward array" can be directly created from a list of AirrRearrangement dictionaries, e.g. import awkward as ak
import scirpy as ir
airr_cells = ir.io.to_airr_cells(adata)
chains_as_awkarr = ak.Array([cell.chains for cell in airr_cells])
To get, e.g. all
It is therefore straightforward and computationally efficient to retrieve individual elements. In summary: We simply take AIRR compliant data, put it in an efficient data structure and store it an AnnData. Not a lot new to define from our side Caveats and possible solutions1. Primary and Secondary chainsThe list of chains does not know the concept of primary and secondary chains. This can be solved by adding an index array:
The awkward array can then be sliced
to retrieve e,g. the junction AA sequences for the primary VJ chains. 2. Getting values for plottingThe current solution to put everything in obs was amongst others a pragmatic solution to make all AIRR data immediately available to scanpy plotting functions. This is not possible anymore with the proposed new implementation. Arguably, most of those columns are used very rarely for those purposes anyway. If they are, here are some solutions to do so
3. Where to store the dataThe current draft implementation only supports awkward arrays in It was also previously suggested to use mudata as AIRR data can be considered a modality. Since we don't have any data for I'm looking forward to feedback, especially from people that develop ecosystem packages for |
@bcorrie, this should make it possible that we merely define a mapping how the AIRR standard is represented as AnnData object rather than converting between a
|
@bcorrie from discussion with the SOMA team, the intent was to keep anndata and SOMA (formerly FOM) pretty consistent with each other. Ideally they are the same thing eventually. Though I haven't heard much on SOMA development recently, so hopefully we are still on the same page here.
@grst, I've recently heard people wanting to treat their re-arrangment data as a separate modality. E.g. have an AnnData with just re-arrangement info which could then be merged with other modalities. I think the main point here was to avoid having to arbitrarily associate AIRR data with another modality. I think @bio-la and @crichgriffin could elaborate on this. |
Wouldn't it anyway work with both anndata and mudata to use # anndata
ir.tl.something(adata)
# anndata, override obsm_key
ir.tl.something(adata, airr_key="ir")
# mudata
ir.tl.something(mdata.mod["AIRR"]) Or would you prefer something like # mudata only, use `.mod["AIRR"]` by default
ir.tl.something(mdata)
# ... or override default modality
ir.tl.something(mdata, mod="myairr") |
I don't remember the details, but one of the issues was with subsetting. Wanting to filter AIRR obs, but not the RNA. I think it would work with AnnData right now, I think the issue was more about want to use scirpy with a |
In the context of multimodal data and using MuData, I like the idea of keeping the AIRR data separate from any specific modality. My current workflow is to store the scirpy anndata object as a separate modality in muon, where I also have rna and adt data, as separate modalities. But I agree that it doesn't make a lot of sense, since the scirpy anndata object is mostly empty. I agree with @grst that storing airr data could potentially work in both AnnData and MuData by using |
fill the corresponding rows with I think it should be easy to support both. Just need to think what to advertise as default in the tutorial. |
all of this sounds awesome. Just have some naive question about how well/fast does it deal with situation when there's >100k+ cells? would |
That's a fair point, I need to try this out, but it should be very fast. Awkward array claims to be similarly scalable as numpy arrays. |
Hi All, sorry for the lack of input - both fighting "the big C" and travelling to meetings (I don't recommend trying to both at the same time) the last several weeks. I am not an AnnData expert - yet - but my observations thus far. We are experimenting with Conga and it annotates each AnnData cell with heavy and light chain VDJ/CDR3 in the .obs of the object. Other tools like CellTypist also populate the .obs with other data. This seems messy, and .obsm seems like a good option for adding specific annotations to a cell object on a per "tool" basis. So there might be a 'AIRR' obsm object, but also a 'Conga' and 'CellTypist' object. This would separate these cell annotations cleanly - would that be considered good practice. It seems to make sense to me that the community might encourage this??? |
Yes, it occurred to me that plotting might be challenging using obsm. I personally think this would be quite important. I would anticipate using the obsm['airr'] to store not only rearrangement annotations, but also what we call repertoire annotations (by repertoire AIRR means study/subject/sample metadata). For example, it is quite easy for us to generate a pool of cell data from many subjects in a study, possibly across disease conditions (healthy, mild covid, severe covid). We would pool that data and create a single h5ad file. We would want the 'disease_state' and 'subject_id' to be in the AIRR obsm object so we could visualize cells based on these fields. Same for the AIRR 'tissue' field and a range of others potentially. It seems like a bad idea to have this difficult. |
A quick example - this is from a combined Conga/CellTypist analysis, with the .obs annotated with Conga and Cell Typist fields. This is trivial to visualize VDJ annotations per cell next to CellTypist majortiy voting - at the same time - because the columns are added to .obs. If they were in .obsm and the visualizations could not make use of this easily, this would be challenging... 'vb', 'jb', and 'va' are from Conga, while 'majority_voting' is from CellTypist. I just threw these data sets together today to try and explore this data for our early experiments with CellTypist. So please be patient with the data oddities with TR call against projected B-cells. I have yet to validate anything from this image in terms of it making sense - and any and all errors are mine 8-) The main point is that these experimentations are easy because the visualizations are easy. |
This seems to make sense to me as a first pass, although I would probably suggest that adata.obsm['airr'] might contain more than just info about rearrangement chains (see #327 (comment)) as AIRR Repertoire metadata (disease_state, tissue, age, sex, ...) on a cell basis will often be very valuable. One could have adata.obsm['airr_rearrangement'] and adata.obsm['airr_repertoire'] - but maybe too clunky? Another related link that cells potentially have is to "Receptors" - which are known B/T cell receptors that have a specific antigen/epitope specificity (think of the B and T cell specificity info in IEDB). Again, maybe worth capturing. |
For what it's worth, on the R side we've been using MultiAssayExperiments with a "rearrangement" experiment that includes the AIRR Rearrangement data for storing multimodal single-cell data that includes AIRR data, GEX, CITE-seq, and/or whatever people dream up. The AIRR assay data (equivalent to If I'm understanding the awkward array correctly, and I may not be, this would be the same as using the "record" array implementation to populate PS: |
where to store the data (
|
One of the tools we are integrating with (and therefore exporting to) is Conga - that is exactly our use case 8-) |
@grst said:
If I had my druthers, I would also have that index be the chain type instead of the locus, because that more naturally fits with how you're likely to work with the data (avoiding the awkwardness of always indexing on
However, this has nomenclature problems and is very B cell centric. There's no comparable terminology to I'm also not certain in makes sense to combine analyses for TRA and TRG. Hence,
In-house only. It doesn't appear in a public package (that I'm aware of). And it is not part of the AIRR Standards efforts or the AIRR Data Commons stuff @bcorrie is talking about. So there's no need to conform to this data structure if you have something better in mind. |
Very interesting discussion and thanks for looping me in! As of right now, scRepertoire is appending a portion of the filtered TCR/BCR alignments to the meta data of a single cell object (either Seurat or SingleCellExperiment). It is not necessarily ideal as it does not preserve AIRR format or other information that users have been requesting, like cdr1/2 sequences. In terms of implementing a change for consistency - from the R side of things, SingleCellExperiment functions as a SummarizeExperiment and is similar to the python equivalent. However, I don't think Seurat data format would be compatible with the array you are proposing. But I can do some investigating |
I decided to go with the If anyone has reservations against this approach, now would be a good time to speak up, otherwise it might be too late. |
I'm finally making some progress on the new datastructure (#356). Here is how it is currently working out:
I still need to iron out a few details, and most importantly, update the documentation and tutorial to reflect all these changes. There will be an automatic check and a conversion function to transfer data from the previous format into the new data structure. |
The new datastructure is now available in the You can install it using
Please also check out
|
@grst this is amazing thank you so much for implementing this ❤️ I can see in the tutorials how I can access Happy to open a PR to make this happen 🙂 |
I agree - great to see this moving forward - thanks @grst and others. @gszep we talked about Repertoire metadata earlier (#327 (comment)) but I think currently scverse libraries only load AIRR Rearrangements. Would love to hear that I am wrong, but I believe that to be the case. |
There is an example in the docs (https://scverse.org/scirpy/tags/v0.13.0rc1/tutorials/tutorial_io.html#Combining-multiple-samples) where multiple samples are combined, and the equivalent of the AIRR sample_id is set for multiple samples in the obs data. In this case the obs_name is used to assign a sample obs. So the trick would be to load in an AIRR Repertoire file and process it to do something similar based on the repertoire_id in the Rearrangement data. One day soon I hope to get to doing something like this for the Single Cell downloads from the iReceptor Gateway. Ideally one could request an h5ad download of multiple samples and get a single h5ad file with all the GEX, Cell, Rearrangement, and Repertoire metadata embedded in that single h5ad file. Currently when you do a download you get 4 separate files, one for each type of data. |
Thank you for your reply. I am still learning the best practices for anndata. Say we have a usecase of 20 repertoires each with 100,000 cells or so. In this case, would it make sense to have 20 Anndata objects and in each anndata object there would be the 100,000 'Rearrangement' rows stored under the |
I am unfortunately not an anndata/scirpy expert either, I come from the AIRR side. My simple answer would be it depends on if you want to compare the repertoires. If yes, then having them in a single object is probably best (as in the example - you have to worry about batch effects). That way you can slice and dice across any of the metadata fields from the Repertoire. I don't yet have any experience with how well scirpy and related tools will scale when you are throwing together data sets of this size. Any scirpy experts want to comment. |
Thanks. Scaling is my main concern. If I want to filter my Repertoires before using their Rearrangements I'm not sure it's efficient to use a single object. I guess one might be able to use the concatenate feature which will put multiple AnnData objects into a single object if you require it. It would be great if someone from the scirpy side could offer a more informed view than mine! |
Yes, I would guess keeping them separate until you want to compare them, then concatenate them as required for specific comparisons. Still an interesting scalability question as to how many samples can you practically compare in a single scirpy object. |
i can help answer some of these questions (disclaimer - i still need time to familiarise myself with the new scirpy data structure but i'm familiar with the other main bits):
The main thing is that the anndata.obs slot index is
Yes. the thing to look out for in the airr table is that your
The |
The HDF5 file format was designed to have lazy constant time random access to any contiguous slices of your data. If these advantages are exposed in
|
Thanks for all your comments!
These are usually just stored as additional columns in
I'm definitely open to add more reader functions to load other AIRR schemas than I'll separately comment on scalability later. |
Regarding scalability: Different steps in the scirpy pipeline are subject to different limitations. My goal is to enable improve scalability of scirpy such that analysis of (few) millions of cells is conveniently possible on a single workstation (e.g. >200GB RAM, >30 cores). This is tracked here: #370. For anything >10M cells, I believe we need to move to out-of-memory and out-of-core approaches. Solutions for this are still being figured out on the AnnData side as @gszep has pointed out. AnnData's current "backed mode" does not support To get a better idea of current limitations of scalability, I'll share here my experiments with omniscope's longitudonal COVID19 dataset with 8M TCR-beta chains (and no gene expression data):
The real bottlenecks of the scirpy workflow are further downstream:
|
Any chance you have an iReceptor Gateway account 8-) I just did a download of the data from one subject from a cancer study from Yost et al (http://doi.org/DOI:10.1038/s41591-019-0522-3). The download contains three Repertoires from a single subject at different time points, two pre-treatment and one post-treatment. The data will have a rearrangement file, a cell file (AIRR JSON format), and a GEX file (AIRR JSON format) (as well as some other files). Each will contain all of the data of one type from all three Repertoires. That is the rearrangement file will contain rearrangements from all three repertoires (actually six repertoires in the rearrangement case, as they are split into TRA and TRB for each time point). It is a 10X study with ~4500 Cells across all three time points. The ZIP file is 180MB so probably want to share it outside of github. |
Happy to announce that the new datastructure is now rolled out as part of scirpy v0.13. |
Excellent! |
Now that scirpy is part of scverse, we could think of an improved data structure for scAIRR data. See also the discussion at scverse/scanpy#1387.
The challenge with scAIRR data is that
1
cell can haven
chains. Up to four of them are biologically meaningful but there could be more for technical reasons.The current pragmatic solution is to store all fields in
adata.obs
.adata.obs
. Also serializing excess chains is not really elegant.New options are
adata.obs
.The new representation should also aim at being a community standard for the scverse ecosystem and should build upon the AIRR rearrangement standard. Ideally, we could get additional stakeholders onboard, including conga, dandelion, tcrdist3 and possibly members of the AIRR community.
The text was updated successfully, but these errors were encountered: