anndata.concat with backed anndata objects #793

joshua-gould · 2022-06-24T19:06:56Z

I would like to be able to concatenate anndata objects without loading all the underlying data into memory.

ivirshup · 2022-06-25T17:21:26Z

See here for a quick demo of how this can be done with a little hacking: https://discourse.scverse.org/t/concat-anndata-objects-on-disk/400/2.

This mainly gets around loading X, but that should be the main source of memory usage. I think we can go for a early implementation of this in the experimental module in a near future release.

ivirshup · 2023-01-23T13:48:08Z

As an update, this should now be possible using dask backed arrays.

I think we need to consider what exactly we want to get here. I suspect we may want to avoid backed mode entirely for this process. Instead we could just say that we're concatenating two stores, which we would then load using dask or a form of backed mode.

I'd really like to be able to use kerchunk here, and avoid copying data. However, this may require being able to have zarr stores with irregular chunk sizes, something I hope we can push forward early this year (see ZEP 0003, discussion)

cc: @syelman

zacharylau10 · 2023-02-07T10:52:32Z

See here for a quick demo of how this can be done with a little hacking: https://discourse.scverse.org/t/concat-anndata-objects-on-disk/400/2.

This mainly gets around loading X, but that should be the main source of memory usage. I think we can go for a early implementation of this in the experimental module in a near future release.

Hi, @ivirshup
concat h5ad file on disk is very helpful with large file, but slice operation speed is different between h5ad files generated in-memory and on-disk.
Here are 3 situations I generated h5ad files.
1. directly on-disk
2. concat on-disk, then read into memory and save again
3. concat in memory

here are codes to test the slice speed:

selmanozleyen · 2023-02-14T14:44:53Z

Hi @ivirshup,
I will start working on this. To specify the requirements, I have some questions:

From what I understand, you'd like to let Dask handle the concatenation. Is this correct? And for this we would need to find a way to load data into dask without copies.
How are we going to avoid backed mode, or what do you mean by that?
Wouldn't it be easier if we supported reading .zarr and h5ad as Dask arrays?
Additional Question: For dataframes can't we add partial support from Dask dataframes just for this functionality?

ivirshup · 2023-03-01T15:21:46Z

you'd like to let Dask handle the concatenation. Is this correct? And for this we would need to find a way to load data into dask without copies.

Sorry for missing this. I think reading from dask could be a very reasonable solution, but I'm not sure if that's going to work easily for sparse arrays.

How are we going to avoid backed mode, or what do you mean by that?

I don't think this function actually needs to take AnnData objects as arguments right now. I think this could start off with: concatenate_from_disk(["adata1.h5ad", "adata2.h5ad"], out="result.h5ad").

As some more context: I would kinda like to deprecate backed mode, and replace it with something a little more general. E.g. not only having X be backed, probably not allowing partial updates of backed objects. This could be like what @ilan-gold is working on with #924 or like the Shadows from @gtca's postdata.

It could be worth trying to work off the branch Ilan is using for a lazy representation of dataframes.

Wouldn't it be easier if we supported reading .zarr and h5ad as Dask arrays?

We do! See https://anndata.readthedocs.io/en/latest/tutorials/notebooks/%7Bread%2Cwrite%7D_dispatched.html

Additional Question: For dataframes can't we add partial support from Dask dataframes just for this functionality?

This could be worth it. I'd be curious how materializing the dataframe into the resultant store works. Up to you if you want to give this a shot.

selmanozleyen · 2023-03-02T13:06:04Z

I don't think this function actually needs to take AnnData objects as arguments right now. I think this could start off with: concatenate_from_disk(["adata1.h5ad", "adata2.h5ad"], out="result.h5ad").

Ok, I think I now understand what we need with this. To clarify, we don't necessarily need a lazy intermediate object, right? Because when saying out of core, I was thinking we wouldn't load to the memory also unless we really need it. But this doesn't seem to be the case. We will definitely load all these objects in memory, albeit not at once?

Sorry for missing this. I think reading from dask could be a very reasonable solution, but I'm not sure if that's going to work easily for sparse arrays.

Since we don't necessarily need lazy intermediate objects, not using dask would be smoother imo.

I will read more about read_remote thing to see how it might help for our case.

UPDATE: I have some additional questions below

How would AnnDataRemote help with out of core concatenation?
It would help with loading dask arrays remotely but what about the other attributes? Can we somehow do the same for dataframes?
Please correct me if I am wrong. Shadow objects can do the concatenation on disk while AnnDataRemote can do it remotely. If we can somehow access to the remote directory from the filesystem Shadow Objects would also be working remotely right?

I am asking because I am trying to understand in which aspect postdata and remote anndata would help my decisions.

I also discovered this read_dispatch thing.

Do you think it would be easy with the current API to write a function that calls a write partially every time it reads a pair partially?

ivirshup · 2023-03-02T18:10:19Z

To clarify, we don't necessarily need a lazy intermediate object, right?

These aren't necessary no. It could be useful to have an object interface for the on-disk store, but you can also absolutely just work on the stores directly.

We will definitely load all these objects in memory, albeit not at once?

At the moment, yes. The basic feature here is allowing us to concatenate AnnData's that we would not be able to concatenate in memory. So the goal is lowest peak memory usage possible. An important consideration here: we probably never want to load a complete X or layer into memory at once. Ideally we are able to go chunk by chunk from the input files.

In the future some of this should be possible without needing to copy any data. Using things like kerchunk or hdf5 virtual datasets we should eventually be able to create a new combined object, which is composed of references to the input objects.

But for the first go-around: new file, copies of data.

Tbh, I think you could even start with doing an in-memory concatenation of everything except X and layers, saving that, then adding those inplace to the result.

Since we don't necessarily need lazy intermediate objects, not using dask would be smoother imo.

Very likely could be the case. It's just that dask could help handle some of the "chunk by chunk" operations. Also whether you want to implement the logic of pd.concat, or leave that to dask. I think we should just start with how="inner", so this shouldn't be very complicated.

How would AnnDataRemote help with out of core concatenation?

It is a representation of an AnnData that aims to avoid loading in any data. This means you have the structure of the object, which can be helpful here. It also has a dataframe representation which has important features that dask lacks, like being able to tell what shape it is. Also I believe the SparseDataset class is used in AnnDataRemove, which already has some code for doing out of core concatenation.

As mentioned above: no need to use this if you think it overcomplicates things.

It would help with loading dask arrays remotely but what about the other attributes? Can we somehow do the same for dataframes?

I think this depends a bit on how you intend to use read_dispatched. If you want to call it on individual arrays and dataframes (like a read_elem that returns out of core objects) then dataframes should be fine. You could have lines like:

    ...
    elif iospec.encoding_type == "array":
        return da.from_zarr(elem)
    elif iospec.encoding_type == "dataframe":
        return read_anndata_df(elem)
    ...

However, it would not help if you want to actually construct an anndata, due to the issues we've had support dask dataframes.

Please correct me if I am wrong. Shadow objects can do the concatenation on disk while AnnDataRemote can do it remotely. If we can somehow access to the remote directory from the filesystem Shadow Objects would also be working remotely right?

Specifically responding to:

Shadow objects can do the concatenation on disk

Can they? I don't think I saw that in the tutorials... Could you confirm and share a link?

If we can somehow access to the remote directory from the filesystem Shadow Objects would also be working remotely right?

I would assume so? I expect going through something like fsspec would be easier than mounting all sorts of stores on the filesystem.

I also discovered this read_dispatch thing.
Do you think it would be easy with the current API to write a function that calls a write partially every time it reads a pair partially?

I'm not sure I'm exactly getting the question. What do you mean by "a pair" here?

selmanozleyen · 2023-03-06T15:32:44Z

I'm not sure I'm exactly getting the question. What do you mean by "a pair" here?

I meant a join pair.

I would assume so? I expect going through something like fsspec would be easier than mounting all sorts of stores on the filesystem.

I asked this question to try to understand what is differentiating postdata and remote anndata. If it was just the access interface or not.

Anyway I will start with getting more familiar with read_dispatch and stuff.

ivirshup · 2023-10-04T15:51:26Z

This has been implemented for anndata 0.10 with:

ad.experimental.concat_on_disk
as well as support for dask sparse arrays, as demonstrated in this gist

Waiting on some usage docs to close this issue for good.

flying-sheep · 2023-10-06T09:26:10Z

Fixed in #1161, if you also want to get the gist’s contents somewhere, maybe we should add a new issue

ivirshup added enhancement topic: combining topic: backed labels Jun 25, 2022

ivirshup added this to the 0.9 milestone Jun 25, 2022

ivirshup modified the milestones: 0.9, 0.10.0 Jan 23, 2023

gszep mentioned this issue Apr 15, 2023

scverse datastructure for AIRR data scverse/scirpy#327

Closed

1 task

flying-sheep closed this as completed Oct 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

anndata.concat with backed anndata objects #793

anndata.concat with backed anndata objects #793

joshua-gould commented Jun 24, 2022

ivirshup commented Jun 25, 2022

ivirshup commented Jan 23, 2023

zacharylau10 commented Feb 7, 2023

selmanozleyen commented Feb 14, 2023 •

edited

Loading

ivirshup commented Mar 1, 2023

selmanozleyen commented Mar 2, 2023 •

edited

Loading

ivirshup commented Mar 2, 2023

selmanozleyen commented Mar 6, 2023

ivirshup commented Oct 4, 2023

flying-sheep commented Oct 6, 2023

anndata.concat with backed anndata objects #793

anndata.concat with backed anndata objects #793

Comments

joshua-gould commented Jun 24, 2022

ivirshup commented Jun 25, 2022

ivirshup commented Jan 23, 2023

zacharylau10 commented Feb 7, 2023

selmanozleyen commented Feb 14, 2023 • edited Loading

ivirshup commented Mar 1, 2023

selmanozleyen commented Mar 2, 2023 • edited Loading

ivirshup commented Mar 2, 2023

selmanozleyen commented Mar 6, 2023

ivirshup commented Oct 4, 2023

flying-sheep commented Oct 6, 2023

selmanozleyen commented Feb 14, 2023 •

edited

Loading

selmanozleyen commented Mar 2, 2023 •

edited

Loading