(feat): allow `join=outer` for `concat_on_disk` #1504

ilan-gold · 2024-05-22T13:55:52Z

Closes concat_on_disk outer join #1503
Tests added
Release note added (or unnecessary)

codecov · 2024-05-22T14:16:00Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.45%. Comparing base (2c9e400) to head (ff68f99).
Report is 73 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1504      +/-   ##
==========================================
+ Coverage   84.44%   84.45%   +0.01%     
==========================================
  Files          36       36              
  Lines        5747     5745       -2     
==========================================
- Hits         4853     4852       -1     
+ Misses        894      893       -1

Files with missing lines	Coverage Δ
src/anndata/experimental/merge.py	`91.79% <ø> (+0.42%)`	⬆️

ilan-gold · 2024-05-22T14:20:03Z

src/anndata/experimental/merge.py

-    if join != "inner":
-        raise NotImplementedError("only inner join is currently supported")


I am trying to become more familiar with the code, but it's shocking this was all that was needed. That being said, the tests do pass and I looked at the results so 🤷

I mean, looking at the code path, it makes some sense. concat_on_disk constructs sensible-looking reindexers for the alt_dim axis and then passes those on so there shouldn't be a need for the join argument anywhere else despite the option to use it

zimea · 2024-05-22T15:52:07Z

In a first run, the object didn't contain the "var" dimension. Could you check whether you can replicate it?

ilan-gold · 2024-05-23T07:12:03Z

@zimea #1505 is what you want but there are clearly some problems there. So if you need you var you can do something like this for each anndata:

my_adatas = []
for path in my_paths:
    adata = ad.read_h5ad(path)
    del adata.X
    my_adatas += [adata]
merged_in_memory = concat(my_adatas, merge="...")
merged_on_disk.var = merged_in_memory.var

which should be manageable unless your obsm is also very large, in which case I would del that as well. This isn't pretty but until I also fix #1505, it should work. Also you will need to choose a merge strategy based on what you want: https://anndata.readthedocs.io/en/stable/generated/anndata.concat.html

ilan-gold · 2024-05-27T08:19:06Z

Ok @ivirshup I think I have a slightly better sense of what you mean when you talk about concat_on_disk + dask. It seems like concat_on_disk won't disappear but will be vastly simplified by using dask for which we need #1469.

For example, I have been looking into adding merge support and the main stumbling block there appears to be that doing anything with SparseDataset classes is difficult i.e., reindexing + writing, which is why

anndata/src/anndata/experimental/merge.py

Lines 60 to 90 in 105f354

    
           def _gen_slice_to_append( 
        
               datasets: Sequence[BaseCompressedSparseDataset], 
        
               reindexers, 
        
               max_loaded_elems: int, 
        
               axis=0, 
        
               fill_value=None, 
        
           ): 
        
               for ds, ri in zip(datasets, reindexers): 
        
                   n_slices = ds.shape[axis] * ds.shape[1 - axis] // max_loaded_elems 
        
                   if n_slices < 2: 
        
                       yield (csr_matrix, csc_matrix)[axis]( 
        
                           ri(to_memory(ds), axis=1 - axis, fill_value=fill_value) 
        
                       ) 
        
                   else: 
        
                       slice_size = max_loaded_elems // ds.shape[1 - axis] 
        
                       if slice_size == 0: 
        
                           slice_size = 1 
        
                       rem_slices = ds.shape[axis] 
        
                       idx = 0 
        
                       while rem_slices > 0: 
        
                           ds_part = None 
        
                           if axis == 0: 
        
                               ds_part = ds[idx : idx + slice_size, :] 
        
                           elif axis == 1: 
        
                               ds_part = ds[:, idx : idx + slice_size] 
        
                           yield (csr_matrix, csc_matrix)[axis]( 
        
                               ri(ds_part, axis=1 - axis, fill_value=fill_value) 
        
                           ) 
        
                           rem_slices -= slice_size 
        
                           idx += slice_size

exists. So I think to move forward with the various merge types we should merge #1469 and then integrate read_dask_as_elem here. Sound good?

selmanozleyen

Looks good. Happy to see this was somewhat smooth.

tests/test_concatenate_disk.py

flying-sheep

Looks good. It tests correctness, and if performance turns out to be a problem, we can do that later

for more information, see https://pre-commit.ci

lumberbot-app · 2024-07-04T08:54:51Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

git checkout 0.10.x
git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

git cherry-pick -x -m1 686786db35cafd8cbd7ebcc0128d5239fb06bce7

You will likely have some merge/cherry-pick conflict here, fix them and commit:

git commit -am 'Backport PR #1504: (feat): allow `join=outer` for `concat_on_disk`'

Push to a named branch:

git push YOURFORK 0.10.x:auto-backport-of-pr-1504-on-0.10.x

Create a PR against branch 0.10.x, I would have named this PR:

"Backport PR #1504 on branch 0.10.x ((feat): allow join=outer for concat_on_disk)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

…1556)

ilan-gold added 4 commits May 22, 2024 14:44

(feat): allow join=outer for concat_on_disk

1c1783c

(recactor): don't need to redeclare types

95e3b0e

(chore): release note

c6ac455

(chore): add more rigorous union/diff check

f504aa7

ilan-gold added enhancement topic: combining backend: hdf5 topic: backed backend: zarr skip-gpu-ci labels May 22, 2024

ilan-gold added this to the 0.10.8 milestone May 22, 2024

ilan-gold marked this pull request as ready for review May 22, 2024 14:19

ilan-gold requested review from ivirshup and selmanozleyen May 22, 2024 14:19

ilan-gold commented May 22, 2024

View reviewed changes

Merge branch 'main' into ig/concat_on_disk_outer_join

4f4b91c

ilan-gold self-assigned this May 29, 2024

flying-sheep modified the milestones: 0.10.8, 0.10.9 Jun 18, 2024

Merge branch 'main' into ig/concat_on_disk_outer_join

749a7b2

selmanozleyen approved these changes Jun 20, 2024

View reviewed changes

flying-sheep reviewed Jun 21, 2024

View reviewed changes

tests/test_concatenate_disk.py Outdated Show resolved Hide resolved

ilan-gold and others added 4 commits June 26, 2024 10:34

Merge branch 'main' into ig/concat_on_disk_outer_join

936666b

simplify

68c0b71

(chore): clean up interleaving naming of axis

b9aaa3e

Merge branch 'main' into ig/concat_on_disk_outer_join

e15f3d6

ilan-gold requested a review from flying-sheep June 28, 2024 14:21

simplify

4bf9450

flying-sheep approved these changes Jul 1, 2024

View reviewed changes

ilan-gold and others added 2 commits July 2, 2024 17:06

Merge branch 'main' into ig/concat_on_disk_outer_join

db20e48

[pre-commit.ci] auto fixes from pre-commit.com hooks

ff68f99

for more information, see https://pre-commit.ci

ilan-gold merged commit 686786d into main Jul 4, 2024
14 of 15 checks passed

ilan-gold deleted the ig/concat_on_disk_outer_join branch July 4, 2024 08:54

lumberbot-app bot added the Still Needs Manual Backport label Jul 4, 2024

ilan-gold added a commit that referenced this pull request Jul 4, 2024

Backport PR #1504: (feat): allow join=outer for concat_on_disk

2469259

ilan-gold removed the Still Needs Manual Backport label Jul 4, 2024

ilan-gold added a commit that referenced this pull request Jul 4, 2024

Backport PR #1504: (feat): allow join=outer for concat_on_disk (#…

1ca762a

…1556)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(feat): allow `join=outer` for `concat_on_disk` #1504

(feat): allow `join=outer` for `concat_on_disk` #1504

ilan-gold commented May 22, 2024

codecov bot commented May 22, 2024 •

edited

Loading

ilan-gold May 22, 2024

ilan-gold May 22, 2024

zimea commented May 22, 2024

ilan-gold commented May 23, 2024

ilan-gold commented May 27, 2024

selmanozleyen left a comment

flying-sheep left a comment

lumberbot-app bot commented Jul 4, 2024

		if join != "inner":
		raise NotImplementedError("only inner join is currently supported")

(feat): allow join=outer for concat_on_disk #1504

(feat): allow join=outer for concat_on_disk #1504

Conversation

ilan-gold commented May 22, 2024

codecov bot commented May 22, 2024 • edited Loading

Codecov Report

ilan-gold May 22, 2024

Choose a reason for hiding this comment

ilan-gold May 22, 2024

Choose a reason for hiding this comment

zimea commented May 22, 2024

ilan-gold commented May 23, 2024

ilan-gold commented May 27, 2024

selmanozleyen left a comment

Choose a reason for hiding this comment

flying-sheep left a comment

Choose a reason for hiding this comment

lumberbot-app bot commented Jul 4, 2024

(feat): allow `join=outer` for `concat_on_disk` #1504

(feat): allow `join=outer` for `concat_on_disk` #1504

codecov bot commented May 22, 2024 •

edited

Loading