Run test_count_call_alleles on Cubed (alternative approach) #1254

tomwhite · 2024-09-06T11:36:12Z

This takes an alternative approach to the one in #1249, by using map_blocks on both Dask and Cubed, rather than Xarray's apply_ufunc. This avoids the possible issue of the Dask graph changing (originally from #871) that was seen in #1249 by keeping the Dask code path the same.

I think this is the more pragmatic path forward as it allows us to experiment with Cubed by making minimal changes to Dask. (It also doesn't preclude using apply_ufunc in the future.)

Would love to get your thoughts @jeromekelleher and @timothymillar.

tomwhite · 2024-09-06T11:37:50Z

One issue that this work has highlighted is whether we allow chunking in core dimensions, ploidy in this case. With Dask we do allow ploidy to be chunked, but this only works because we rely on Dask's map_blocks to transparently concatenate chunks along chunked core dimensions. The Dask documentation at https://docs.dask.org/en/stable/generated/dask.array.map_blocks.html actually warns against relying on this, and says

Due to memory-size-constraints, it is often not advisable to use drop_axis on an axis that is chunked.

So I think we should explicitly disallow this case, and fail if the input dataset is chunked in the ploidy dimension. There is no reason to chunk in this dimension, and our VCF converters in sgkit and bio2zarr never do.

I bring it up in this PR since Cubed will never concatenate chunks in this way, which exposed the issue.

jeromekelleher · 2024-09-06T11:57:32Z

So I think we should explicitly disallow this case, and fail if the input dataset is chunked in the ploidy dimension. There is no reason to chunk in this dimension, and our VCF converters in sgkit and bio2zarr never do.

Agreed - there's no good reason to allow this. 2D chunks are more than enough complication here!

tomwhite · 2024-09-09T11:27:36Z

Added a check for chunking in the ploidy dimension.

timothymillar · 2024-09-09T21:18:11Z

Sorry for the radio silence - I was on leave. This approach looks good to me. There are a couple of cases where we use da.gufunc instead of da.map_blocks. But those should be trivial to replace with da.apply_gufunc if we want to match xarray/cubed.

tomwhite · 2024-09-10T10:08:23Z

Sorry for the radio silence - I was on leave. This approach looks good to me. There are a couple of cases where we use da.gufunc instead of da.map_blocks. But those should be trivial to replace with da.apply_gufunc if we want to match xarray/cubed.

Thanks @timothymillar! That sounds good to me. My plan is to get the aggregation functions needed for QC working under Cubed next - sample_stats, variant_stats, hardy_weinberg_test.

tomwhite added 4 commits September 3, 2024 13:43

Add a --use-cubed option to pytest

517db29

Introduce sgkit.distarray to switch between Dask and Cubed

3637f03

Use sgkit.distarray for count_call_alleles

28f58a4

Only rechunk in non-core dims in test_count_call_alleles__chunked

899c7d9

tomwhite mentioned this pull request Sep 6, 2024

Fail during planning if map_blocks drop_axis is for a chunked dimension cubed-dev/cubed#569

Merged

Fix coverage

0ec6634

tomwhite marked this pull request as ready for review September 6, 2024 12:57

tomwhite added 2 commits September 6, 2024 13:59

Add GH actions workflow to test on Cubed

b1dd259

Don't allow chunking in ploidy dimension

471d837

tomwhite requested a review from jeromekelleher September 9, 2024 11:27

jeromekelleher approved these changes Sep 9, 2024

View reviewed changes

tomwhite added the auto-merge Auto merge label for mergify test flight label Sep 10, 2024

tomwhite merged commit c963f5e into sgkit-dev:main Sep 10, 2024
9 of 11 checks passed

tomwhite deleted the dist-array branch September 10, 2024 10:13

tomwhite mentioned this pull request Sep 10, 2024

Run test_count_call_alleles on Cubed #1249

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run test_count_call_alleles on Cubed (alternative approach) #1254

Run test_count_call_alleles on Cubed (alternative approach) #1254

tomwhite commented Sep 6, 2024

tomwhite commented Sep 6, 2024

jeromekelleher commented Sep 6, 2024

tomwhite commented Sep 9, 2024

timothymillar commented Sep 9, 2024

tomwhite commented Sep 10, 2024

Run test_count_call_alleles on Cubed (alternative approach) #1254

Run test_count_call_alleles on Cubed (alternative approach) #1254

Conversation

tomwhite commented Sep 6, 2024

tomwhite commented Sep 6, 2024

jeromekelleher commented Sep 6, 2024

tomwhite commented Sep 9, 2024

timothymillar commented Sep 9, 2024

tomwhite commented Sep 10, 2024