Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid dependencies between chunks in variants dimensions caused by unnecessary dask usage #871

Closed
timothymillar opened this issue Jul 7, 2022 · 1 comment · Fixed by #872
Labels
question Further information is requested

Comments

@timothymillar
Copy link
Collaborator

I've noticed that count_call_alleles and some methods using cohorts create unnecessary dependencies between chunks in the variants dimension. For example the current task graph for observed_heterozygosity on a dataset with 10 chunks in the variants dimentions looks like this:

Task graph

obshet_old

In count_call_alleles this is a result of using da.empty to indicate the number of alleles for a gufunc. In observed_heterozygosity (and also diversity) this is caused by forcing the sample_cohort array to be a dask array which doesn't achieve much because we immediately call compute on that array to get the number of cohorts. Replacing both of these cases with numpy arrays results in the following equivalent task graph:

Task graph

obshet_new

My understanding is that the second task graph should be more efficient to schedule at larger scales (can any dask experts confirm?). Is there any reason not to make such a change? I guess it makes the use of the sample_cohort array a little bit opaque.

@tomwhite
Copy link
Collaborator

Interesting - thanks for digging into this. It looks like a good change to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants