Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run test_count_call_alleles on Cubed (alternative approach) #1254

Merged
merged 7 commits into from
Sep 10, 2024

Conversation

tomwhite
Copy link
Collaborator

@tomwhite tomwhite commented Sep 6, 2024

See #908

This takes an alternative approach to the one in #1249, by using map_blocks on both Dask and Cubed, rather than Xarray's apply_ufunc. This avoids the possible issue of the Dask graph changing (originally from #871) that was seen in #1249 by keeping the Dask code path the same.

I think this is the more pragmatic path forward as it allows us to experiment with Cubed by making minimal changes to Dask. (It also doesn't preclude using apply_ufunc in the future.)

Would love to get your thoughts @jeromekelleher and @timothymillar.

@tomwhite
Copy link
Collaborator Author

tomwhite commented Sep 6, 2024

One issue that this work has highlighted is whether we allow chunking in core dimensions, ploidy in this case. With Dask we do allow ploidy to be chunked, but this only works because we rely on Dask's map_blocks to transparently concatenate chunks along chunked core dimensions. The Dask documentation at https://docs.dask.org/en/stable/generated/dask.array.map_blocks.html actually warns against relying on this, and says

Due to memory-size-constraints, it is often not advisable to use drop_axis on an axis that is chunked.

So I think we should explicitly disallow this case, and fail if the input dataset is chunked in the ploidy dimension. There is no reason to chunk in this dimension, and our VCF converters in sgkit and bio2zarr never do.

I bring it up in this PR since Cubed will never concatenate chunks in this way, which exposed the issue.

@jeromekelleher
Copy link
Collaborator

So I think we should explicitly disallow this case, and fail if the input dataset is chunked in the ploidy dimension. There is no reason to chunk in this dimension, and our VCF converters in sgkit and bio2zarr never do.

Agreed - there's no good reason to allow this. 2D chunks are more than enough complication here!

@tomwhite tomwhite marked this pull request as ready for review September 6, 2024 12:57
@tomwhite
Copy link
Collaborator Author

tomwhite commented Sep 9, 2024

Added a check for chunking in the ploidy dimension.

@timothymillar
Copy link
Collaborator

Sorry for the radio silence - I was on leave. This approach looks good to me. There are a couple of cases where we use da.gufunc instead of da.map_blocks. But those should be trivial to replace with da.apply_gufunc if we want to match xarray/cubed.

@tomwhite
Copy link
Collaborator Author

Sorry for the radio silence - I was on leave. This approach looks good to me. There are a couple of cases where we use da.gufunc instead of da.map_blocks. But those should be trivial to replace with da.apply_gufunc if we want to match xarray/cubed.

Thanks @timothymillar! That sounds good to me. My plan is to get the aggregation functions needed for QC working under Cubed next - sample_stats, variant_stats, hardy_weinberg_test.

@tomwhite tomwhite added the auto-merge Auto merge label for mergify test flight label Sep 10, 2024
@tomwhite tomwhite merged commit c963f5e into sgkit-dev:main Sep 10, 2024
9 of 11 checks passed
@tomwhite tomwhite deleted the dist-array branch September 10, 2024 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Auto merge label for mergify test flight
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants