Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

anndata.concat with backed anndata objects #793

Closed
joshua-gould opened this issue Jun 24, 2022 · 10 comments
Closed

anndata.concat with backed anndata objects #793

joshua-gould opened this issue Jun 24, 2022 · 10 comments

Comments

@joshua-gould
Copy link
Contributor

I would like to be able to concatenate anndata objects without loading all the underlying data into memory.

@ivirshup
Copy link
Member

See here for a quick demo of how this can be done with a little hacking: https://discourse.scverse.org/t/concat-anndata-objects-on-disk/400/2.

This mainly gets around loading X, but that should be the main source of memory usage. I think we can go for a early implementation of this in the experimental module in a near future release.

@ivirshup ivirshup modified the milestones: 0.9, 0.10.0 Jan 23, 2023
@ivirshup
Copy link
Member

As an update, this should now be possible using dask backed arrays.

I think we need to consider what exactly we want to get here. I suspect we may want to avoid backed mode entirely for this process. Instead we could just say that we're concatenating two stores, which we would then load using dask or a form of backed mode.

I'd really like to be able to use kerchunk here, and avoid copying data. However, this may require being able to have zarr stores with irregular chunk sizes, something I hope we can push forward early this year (see ZEP 0003, discussion)

cc: @syelman

@zacharylau10
Copy link

See here for a quick demo of how this can be done with a little hacking: https://discourse.scverse.org/t/concat-anndata-objects-on-disk/400/2.

This mainly gets around loading X, but that should be the main source of memory usage. I think we can go for a early implementation of this in the experimental module in a near future release.

Hi, @ivirshup
concat h5ad file on disk is very helpful with large file, but slice operation speed is different between h5ad files generated in-memory and on-disk.
Here are 3 situations I generated h5ad files.
1. directly on-disk
2. concat on-disk, then read into memory and save again
3. concat in memory

here are codes to test the slice speed:
image

@selmanozleyen
Copy link
Member

selmanozleyen commented Feb 14, 2023

Hi @ivirshup,
I will start working on this. To specify the requirements, I have some questions:

  • From what I understand, you'd like to let Dask handle the concatenation. Is this correct? And for this we would need to find a way to load data into dask without copies.
  • How are we going to avoid backed mode, or what do you mean by that?
  • Wouldn't it be easier if we supported reading .zarr and h5ad as Dask arrays?
  • Additional Question: For dataframes can't we add partial support from Dask dataframes just for this functionality?

@ivirshup
Copy link
Member

ivirshup commented Mar 1, 2023

you'd like to let Dask handle the concatenation. Is this correct? And for this we would need to find a way to load data into dask without copies.

Sorry for missing this. I think reading from dask could be a very reasonable solution, but I'm not sure if that's going to work easily for sparse arrays.

How are we going to avoid backed mode, or what do you mean by that?

I don't think this function actually needs to take AnnData objects as arguments right now. I think this could start off with: concatenate_from_disk(["adata1.h5ad", "adata2.h5ad"], out="result.h5ad").

As some more context: I would kinda like to deprecate backed mode, and replace it with something a little more general. E.g. not only having X be backed, probably not allowing partial updates of backed objects. This could be like what @ilan-gold is working on with #924 or like the Shadows from @gtca's postdata.

It could be worth trying to work off the branch Ilan is using for a lazy representation of dataframes.

Wouldn't it be easier if we supported reading .zarr and h5ad as Dask arrays?

We do! See https://anndata.readthedocs.io/en/latest/tutorials/notebooks/%7Bread%2Cwrite%7D_dispatched.html

Additional Question: For dataframes can't we add partial support from Dask dataframes just for this functionality?

This could be worth it. I'd be curious how materializing the dataframe into the resultant store works. Up to you if you want to give this a shot.

@selmanozleyen
Copy link
Member

selmanozleyen commented Mar 2, 2023

I don't think this function actually needs to take AnnData objects as arguments right now. I think this could start off with: concatenate_from_disk(["adata1.h5ad", "adata2.h5ad"], out="result.h5ad").

Ok, I think I now understand what we need with this. To clarify, we don't necessarily need a lazy intermediate object, right? Because when saying out of core, I was thinking we wouldn't load to the memory also unless we really need it. But this doesn't seem to be the case. We will definitely load all these objects in memory, albeit not at once?

Sorry for missing this. I think reading from dask could be a very reasonable solution, but I'm not sure if that's going to work easily for sparse arrays.

Since we don't necessarily need lazy intermediate objects, not using dask would be smoother imo.

I will read more about read_remote thing to see how it might help for our case.

UPDATE: I have some additional questions below

  • How would AnnDataRemote help with out of core concatenation?
  • It would help with loading dask arrays remotely but what about the other attributes? Can we somehow do the same for dataframes?
  • Please correct me if I am wrong. Shadow objects can do the concatenation on disk while AnnDataRemote can do it remotely. If we can somehow access to the remote directory from the filesystem Shadow Objects would also be working remotely right?

I am asking because I am trying to understand in which aspect postdata and remote anndata would help my decisions.

I also discovered this read_dispatch thing.

  • Do you think it would be easy with the current API to write a function that calls a write partially every time it reads a pair partially?

@ivirshup
Copy link
Member

ivirshup commented Mar 2, 2023

To clarify, we don't necessarily need a lazy intermediate object, right?

These aren't necessary no. It could be useful to have an object interface for the on-disk store, but you can also absolutely just work on the stores directly.

We will definitely load all these objects in memory, albeit not at once?

At the moment, yes. The basic feature here is allowing us to concatenate AnnData's that we would not be able to concatenate in memory. So the goal is lowest peak memory usage possible. An important consideration here: we probably never want to load a complete X or layer into memory at once. Ideally we are able to go chunk by chunk from the input files.

In the future some of this should be possible without needing to copy any data. Using things like kerchunk or hdf5 virtual datasets we should eventually be able to create a new combined object, which is composed of references to the input objects.

But for the first go-around: new file, copies of data.

Tbh, I think you could even start with doing an in-memory concatenation of everything except X and layers, saving that, then adding those inplace to the result.

Since we don't necessarily need lazy intermediate objects, not using dask would be smoother imo.

Very likely could be the case. It's just that dask could help handle some of the "chunk by chunk" operations. Also whether you want to implement the logic of pd.concat, or leave that to dask. I think we should just start with how="inner", so this shouldn't be very complicated.


  • How would AnnDataRemote help with out of core concatenation?

It is a representation of an AnnData that aims to avoid loading in any data. This means you have the structure of the object, which can be helpful here. It also has a dataframe representation which has important features that dask lacks, like being able to tell what shape it is. Also I believe the SparseDataset class is used in AnnDataRemove, which already has some code for doing out of core concatenation.

As mentioned above: no need to use this if you think it overcomplicates things.

  • It would help with loading dask arrays remotely but what about the other attributes? Can we somehow do the same for dataframes?

I think this depends a bit on how you intend to use read_dispatched. If you want to call it on individual arrays and dataframes (like a read_elem that returns out of core objects) then dataframes should be fine. You could have lines like:

    ...
    elif iospec.encoding_type == "array":
        return da.from_zarr(elem)
    elif iospec.encoding_type == "dataframe":
        return read_anndata_df(elem)
    ...

However, it would not help if you want to actually construct an anndata, due to the issues we've had support dask dataframes.


  • Please correct me if I am wrong. Shadow objects can do the concatenation on disk while AnnDataRemote can do it remotely. If we can somehow access to the remote directory from the filesystem Shadow Objects would also be working remotely right?

Specifically responding to:

Shadow objects can do the concatenation on disk

Can they? I don't think I saw that in the tutorials... Could you confirm and share a link?

If we can somehow access to the remote directory from the filesystem Shadow Objects would also be working remotely right?

I would assume so? I expect going through something like fsspec would be easier than mounting all sorts of stores on the filesystem.


I also discovered this read_dispatch thing.
Do you think it would be easy with the current API to write a function that calls a write partially every time it reads a pair partially?

I'm not sure I'm exactly getting the question. What do you mean by "a pair" here?

@selmanozleyen
Copy link
Member

I'm not sure I'm exactly getting the question. What do you mean by "a pair" here?

I meant a join pair.

I would assume so? I expect going through something like fsspec would be easier than mounting all sorts of stores on the filesystem.

I asked this question to try to understand what is differentiating postdata and remote anndata. If it was just the access interface or not.

Anyway I will start with getting more familiar with read_dispatch and stuff.

@ivirshup
Copy link
Member

ivirshup commented Oct 4, 2023

This has been implemented for anndata 0.10 with:

Waiting on some usage docs to close this issue for good.

@flying-sheep
Copy link
Member

Fixed in #1161, if you also want to get the gist’s contents somewhere, maybe we should add a new issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants