-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
anndata.concat with backed anndata objects #793
Comments
See here for a quick demo of how this can be done with a little hacking: https://discourse.scverse.org/t/concat-anndata-objects-on-disk/400/2. This mainly gets around loading |
As an update, this should now be possible using dask backed arrays. I think we need to consider what exactly we want to get here. I suspect we may want to avoid backed mode entirely for this process. Instead we could just say that we're concatenating two stores, which we would then load using dask or a form of backed mode. I'd really like to be able to use cc: @syelman |
Hi, @ivirshup |
Hi @ivirshup,
|
Sorry for missing this. I think reading from dask could be a very reasonable solution, but I'm not sure if that's going to work easily for sparse arrays.
I don't think this function actually needs to take AnnData objects as arguments right now. I think this could start off with: As some more context: I would kinda like to deprecate backed mode, and replace it with something a little more general. E.g. not only having It could be worth trying to work off the branch Ilan is using for a lazy representation of dataframes.
We do! See https://anndata.readthedocs.io/en/latest/tutorials/notebooks/%7Bread%2Cwrite%7D_dispatched.html
This could be worth it. I'd be curious how materializing the dataframe into the resultant store works. Up to you if you want to give this a shot. |
Ok, I think I now understand what we need with this. To clarify, we don't necessarily need a lazy intermediate object, right? Because when saying out of core, I was thinking we wouldn't load to the memory also unless we really need it. But this doesn't seem to be the case. We will definitely load all these objects in memory, albeit not at once?
Since we don't necessarily need lazy intermediate objects, not using dask would be smoother imo. I will read more about read_remote thing to see how it might help for our case. UPDATE: I have some additional questions below
I am asking because I am trying to understand in which aspect postdata and remote anndata would help my decisions. I also discovered this read_dispatch thing.
|
These aren't necessary no. It could be useful to have an object interface for the on-disk store, but you can also absolutely just work on the stores directly.
At the moment, yes. The basic feature here is allowing us to concatenate AnnData's that we would not be able to concatenate in memory. So the goal is lowest peak memory usage possible. An important consideration here: we probably never want to load a complete In the future some of this should be possible without needing to copy any data. Using things like kerchunk or hdf5 virtual datasets we should eventually be able to create a new combined object, which is composed of references to the input objects. But for the first go-around: new file, copies of data. Tbh, I think you could even start with doing an in-memory concatenation of everything except
Very likely could be the case. It's just that dask could help handle some of the "chunk by chunk" operations. Also whether you want to implement the logic of
It is a representation of an AnnData that aims to avoid loading in any data. This means you have the structure of the object, which can be helpful here. It also has a dataframe representation which has important features that dask lacks, like being able to tell what shape it is. Also I believe the SparseDataset class is used in AnnDataRemove, which already has some code for doing out of core concatenation. As mentioned above: no need to use this if you think it overcomplicates things.
I think this depends a bit on how you intend to use ...
elif iospec.encoding_type == "array":
return da.from_zarr(elem)
elif iospec.encoding_type == "dataframe":
return read_anndata_df(elem)
... However, it would not help if you want to actually construct an anndata, due to the issues we've had support dask dataframes.
Specifically responding to:
Can they? I don't think I saw that in the tutorials... Could you confirm and share a link?
I would assume so? I expect going through something like
I'm not sure I'm exactly getting the question. What do you mean by "a pair" here? |
I meant a join pair.
I asked this question to try to understand what is differentiating postdata and remote anndata. If it was just the access interface or not. Anyway I will start with getting more familiar with read_dispatch and stuff. |
This has been implemented for anndata 0.10 with:
Waiting on some usage docs to close this issue for good. |
Fixed in #1161, if you also want to get the gist’s contents somewhere, maybe we should add a new issue |
I would like to be able to concatenate anndata objects without loading all the underlying data into memory.
The text was updated successfully, but these errors were encountered: