Add `GroupBy.shuffle_to_chunks()` #9320

dcherian · 2024-08-07T02:34:50Z

This adds some new API to shuffle an Xarray object. Shuffling means we sort so that members of a group occur in the same chunk, with the possibility of multiple groups in a single chunk.

gb = ds.groupby(..)
shuffled : Dataset = gb.distributed_shuffle()
shuffled.groupby(...).quantile()

I've also added shuffle_by to DataArray and Dataset. This generalizes sortby, and lets you persist a shuffled Xarray object to disk.

ds.shuffle_by(grouper)

xref Groupby-map is slow with out of order indices #9220, xref Using the shuffle primitive in Xarray #9546
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst
wait on dask 2024.08.1
upstream flox release
add chunks to signature.
error if group is chunked

cc @phofl

xarray/core/groupby.py

xarray/namedarray/daskmanager.py

xarray/core/groupby.py

xarray/tests/test_groupby.py

xref pydata/xarray#9320

* main: Revise (pydata#9366) Fix rechunking to a frequency with empty bins. (pydata#9364) whats-new entry for dropping python 3.9 (pydata#9359) drop support for `python=3.9` (pydata#8937) Revise (pydata#9357) try to fix scheduled hypothesis test (pydata#9358)

max-sixty · 2024-11-11T20:08:00Z

xarray/core/groupby.py

+        "Shuffle" means the object is sorted so that all group members occur sequentially,
+        in the same chunk. Multiple groups may occur in the same chunk.
+        This method is particularly useful for chunked arrays (e.g. dask, cubed).
+        particularly when you need to map a function that requires all members of a group
+        to be present in a single chunk. For chunked array types, the order of appearance
+        is not guaranteed, but will depend on the input chunking.


Is a single group limited to a single chunk? Assuming so, if we get one giant chuck, could that present any performance problems?

The other chunks get "auto" reshaped. This is controlled by the chunks kwarg, which only takes "auto" at the moment.

https://docs.dask.org/en/latest/generated/dask.array.shuffle.html

max-sixty · 2024-11-11T20:17:22Z

xarray/core/resample.py

+           With resampling it is a lot better to use ``.chunk`` instead of ``.shuffle``,
+           since one can only resample a sorted time coordinate.


I'm sure I'm being slow here — but doesn't .shuffle claim to sort the array?

Or this means "when resampling with a dimension other than the chunking dimension"?

Yes this is confusing.

Resampling only works with sorted axes, so shuffling is basically identical to rechunking. Maybe I should just delete this line?

max-sixty · 2024-11-11T20:30:42Z

ATM shuffle_by returns the shuffled Dataset or DataArray. groupby.shuffle() returns a new Groupby object that uses the shuffled data.

I am not very attached to this. FYO Stephan mentioned sortby as a possible alternate entrypoint instead.

Yes, this makes sense.

Is there a case for automatically running the equivalent of .shuffle_by on a chunked array when we run .sortby? And then we wouldn't need .shuffle_by? Or there are enough times when changing the chunk sizes would be confusing that it's not worthwhile?

(For context: I'm keen to keep the interface of our main objects really small & orthogonal. This seems useful enough that it is worth adding a method for, but wanted to see if there's any way to get the benefits without even adding a method...)

* main: fix html repr indexes section (pydata#9768) Bump pypa/gh-action-pypi-publish from 1.11.0 to 1.12.2 in the actions group (pydata#9763) unpin array-api-strict, as issues are resolved upstream (pydata#9762) rewrite the `min_deps_check` script (pydata#9754) CI runs ruff instead of pep8speaks (pydata#9759) Specify copyright holders in main license file (pydata#9756) Compress PNG files (pydata#9747) Dispatch to Dask if nanquantile is available (pydata#9719) Updates to Dask page in Xarray docs (pydata#9495) http:// → https:// (pydata#9748) Discard useless `!s` conversion in f-string (pydata#9752) Apply ruff/flake8-simplify rule SIM401 (pydata#9749) Use micromamba 1.5.10 where conda is needed (pydata#9737) pin array-api-strict<=2.1 (pydata#9751) Reorganise ruff rules (pydata#9738) use new conda-forge package pydap-server (pydata#9741)

dcherian · 2024-11-13T02:20:45Z

We could also just have GroupBy.shuffle() -> Dataset | DataArray.

Then

# Save the groupby kwargs since we need it twice
groupers = {"label": UniqueGrouper()}

# get the shuffled dataset
shuffled: Dataset = ds.groupby(groupers).shuffle()

# Now group the shuffled dataset the same way and 
# map a UDF that expects all members of a group in memory at the same time.
shuffled.groupby(groupers).map(UDF)

Alternative names could be GroupBy.as_sorted() -> Dataset | DataArray, GroupBy.sorted_object() -> Dataset | DataArray.

After thinking about it a while, sortby is similar but different w.r.t chunking. It's vectorized indexing with an index you get from calling argsort, and dask/cubed is free to rechunk as they please. For example sortby(dim=UniqueGrouper()) when all(chunksizes["dim"] == 1) is basically free in terms of rechunking, but with a shuffle, you'd get the same labels in the same chunk so that chunking would have to change (assuming duplicated dim labels). We could have sortby a grouper object basically provide np.concatenate(_group_indices) as the indexer.

This is all fairly advanced stuff, so we can start with the smallest possible API surface, which would be GroupBy.shuffle IMO (or any of the other names above)

cc @shoyer

max-sixty · 2024-11-13T16:55:54Z

Very much agree with all of your message — sortby being a bit different, starting with GroupBy.shuffle

We could also just have GroupBy.shuffle() -> Dataset | DataArray.

(If it were GroupBy.shuffle() -> GroupBy, then I guess .shuffle_by would be equivalent to .groupby(groupers).shuffle().map(lambda x: x)...)

shoyer · 2024-11-13T18:04:16Z

Would it be too verbose to call this distributed_shuffle? Shuffling just has a very specific meaningful in distributed computing that is quite different from the generic meaning of "shuffle" for rearranging data.

max-sixty · 2024-11-13T19:13:47Z

Would it be too verbose to call this distributed_shuffle? Shuffling just has a very specific meaningful in distributed computing that is quite different from the generic meaning of "shuffle" for rearranging data.

For the method is on the GroupBy object, is shuffle less ambiguous? FWIW I had correctly presumed it meant its actual meaning given it was associated with group by

Otherwise we could workshop distributed_shuffle to something terser :)

* main: Bump minimum versions (pydata#9796) Namespace-aware `xarray.ufuncs` (pydata#9776) Add prettier and pygrep hooks to pre-commit hooks (pydata#9644) `rolling.construct`: Add `sliding_window_kwargs` to pipe arguments down to `sliding_window_view` (pydata#9720) Bump codecov/codecov-action from 4.6.0 to 5.0.2 in the actions group (pydata#9793) Buffer types (pydata#9787) Add download stats badges (pydata#9786) Fix open_mfdataset for list of fsspec files (pydata#9785) add 'User-Agent'-header to pooch.retrieve (pydata#9782) Optimize `ffill`, `bfill` with dask when `limit` is specified (pydata#9771) fix cf decoding of grid_mapping (pydata#9765) Allow wrapping `np.ndarray` subclasses (pydata#9760) Optimize polyfit (pydata#9766) Use `map_overlap` for rolling reductions with Dask (pydata#9770)

dcherian · 2024-11-19T20:28:31Z

OK I went with distributed_shuffle returning a shuffled Xarray object. Any further groupby operations will need to repeat the groupby statement:
e.g.

gb = array.groupby_bins("dim_0", bins=bins, **cut_kwargs)

shuffled: DataArray = gb.distributed_shuffle()

gb = shuffle.groupby_bins("dim_0", bins=bins, **cut_kwargs)

I couldn't think of a shorter name, but happy to change it.

dcherian · 2024-11-19T21:15:29Z

how about shuffle_to_chunks?

max-sixty · 2024-11-19T22:32:10Z

Looks good! (No strong view on the exact name...)

dcherian · 2024-11-20T23:57:56Z

OK I'm going with shuffle_to_chunks. Will merge in ~24 hours. It's been around a while now :)

shoyer · 2024-11-21T02:20:12Z

Sounds good!

…

On Wed, Nov 20, 2024 at 3:58 PM Deepak Cherian ***@***.***> wrote: OK I'm going with shuffle_to_chunks. Will merge in ~12 hours. It's been around a while now :) — Reply to this email directly, view it on GitHub <#9320 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJJFVXAWGTXX62ULVGF6L32BUOZVAVCNFSM6AAAAABMDO3MWGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBZG44DEMBZGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

max-sixty · 2024-11-21T02:31:08Z

Really like the new name!

Add GroupBy.shuffle()

3bc51bd

dcherian commented Aug 7, 2024

View reviewed changes

xarray/core/groupby.py Outdated Show resolved Hide resolved

dcherian force-pushed the groupby-shuffle branch from 58d01b2 to 1df705e Compare August 7, 2024 02:55

Cleanup

60d7619

dcherian force-pushed the groupby-shuffle branch from 1df705e to 60d7619 Compare August 7, 2024 03:00

dcherian mentioned this pull request Aug 7, 2024

Groupby-map is slow with out of order indices #9220

Open

dcherian added 2 commits August 7, 2024 09:23

Cleanup

d1429cd

fix

31fc00e

dcherian commented Aug 7, 2024

View reviewed changes

xarray/namedarray/daskmanager.py Outdated Show resolved Hide resolved

phofl mentioned this pull request Aug 7, 2024

Ensure that array-shuffle with range-like full indexer is a no-op dask/dask#11281

Closed

TomNicholas added topic-groupby topic-dask topic-chunked-arrays Managing different chunked backends, e.g. dask labels Aug 13, 2024

TomNicholas mentioned this pull request Aug 13, 2024

Groupby Shuffle primitive cubed-dev/cubed#552

Open

dcherian added 2 commits August 13, 2024 16:18

return groupby instance from shuffle

4583853

Fix nD by

abd9dd2

dcherian commented Aug 13, 2024

View reviewed changes

xarray/core/groupby.py Outdated Show resolved Hide resolved

dcherian commented Aug 13, 2024

View reviewed changes

xarray/tests/test_groupby.py Outdated Show resolved Hide resolved

dcherian and others added 3 commits August 13, 2024 20:35

Merge branch 'main' into groupby-shuffle

6b820aa

Skip if no dask

0d70656

fix tests

fafb937

dcherian added a commit to xarray-contrib/flox that referenced this pull request Aug 14, 2024

Fix bug with NaNs in by and method='blockwise'

b15990e

xref pydata/xarray#9320

dcherian mentioned this pull request Aug 14, 2024

Fix bug with NaNs in by and method='blockwise' xarray-contrib/flox#384

Merged

dcherian added a commit to xarray-contrib/flox that referenced this pull request Aug 14, 2024

Fix bug with NaNs in by and method='blockwise'

15eab72

xref pydata/xarray#9320

dcherian added a commit to xarray-contrib/flox that referenced this pull request Aug 14, 2024

Fix bug with NaNs in by and method='blockwise'

55c8897

xref pydata/xarray#9320

dcherian added a commit to xarray-contrib/flox that referenced this pull request Aug 14, 2024

Fix bug with NaNs in by and method='blockwise' (#384)

a9f51a1

xref pydata/xarray#9320

dcherian added 4 commits August 14, 2024 13:42

Add chunks to signature

a08450e

FIx self

d0cd218

Another Self fix

4edc976

max-sixty reviewed Nov 11, 2024

View reviewed changes

dcherian added 3 commits November 19, 2024 13:18

Remove shuffle_by

d467bc6

shuffle -> distributed_shuffle

231533c

dcherian changed the title ~~Add GroupBy.shuffle(), Dataset.shuffle_by(), DataArray.shuffle_by()~~ Add GroupBy.distributed_shuffle() Nov 19, 2024

return xarray object from distributed_shuffle

c77d7c5

dcherian force-pushed the groupby-shuffle branch from 74d7bb0 to c77d7c5 Compare November 19, 2024 20:32

fix

bccacfe

dcherian force-pushed the groupby-shuffle branch from 6951e6a to bccacfe Compare November 19, 2024 20:43

dcherian added 2 commits November 19, 2024 13:54

fix doctest

2d4392a

fix api

003e9f2

dcherian removed the needs review label Nov 19, 2024

Rename to shuffle_to_chunks

0f80c81

dcherian changed the title ~~Add GroupBy.distributed_shuffle()~~ Add GroupBy.shuffle_to_chunks() Nov 20, 2024

dcherian added the plan to merge Final call for comments label Nov 20, 2024

dcherian mentioned this pull request Nov 21, 2024

release notes for v2024.11.0 #9794

Draft

6 tasks

update docs

88bef5d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `GroupBy.shuffle_to_chunks()` #9320

Add `GroupBy.shuffle_to_chunks()` #9320

dcherian commented Aug 7, 2024 •

edited

Loading

max-sixty Nov 11, 2024

dcherian Nov 13, 2024

max-sixty Nov 11, 2024

dcherian Nov 13, 2024

max-sixty commented Nov 11, 2024

dcherian commented Nov 13, 2024 •

edited

Loading

max-sixty commented Nov 13, 2024

shoyer commented Nov 13, 2024

max-sixty commented Nov 13, 2024

dcherian commented Nov 19, 2024 •

edited

Loading

dcherian commented Nov 19, 2024

max-sixty commented Nov 19, 2024

dcherian commented Nov 20, 2024 •

edited

Loading

shoyer commented Nov 21, 2024 via email

max-sixty commented Nov 21, 2024

		With resampling it is a lot better to use ``.chunk`` instead of ``.shuffle``,
		since one can only resample a sorted time coordinate.

Add GroupBy.shuffle_to_chunks() #9320

Are you sure you want to change the base?

Add GroupBy.shuffle_to_chunks() #9320

Conversation

dcherian commented Aug 7, 2024 • edited Loading

max-sixty Nov 11, 2024

Choose a reason for hiding this comment

dcherian Nov 13, 2024

Choose a reason for hiding this comment

max-sixty Nov 11, 2024

Choose a reason for hiding this comment

dcherian Nov 13, 2024

Choose a reason for hiding this comment

max-sixty commented Nov 11, 2024

dcherian commented Nov 13, 2024 • edited Loading

max-sixty commented Nov 13, 2024

shoyer commented Nov 13, 2024

max-sixty commented Nov 13, 2024

dcherian commented Nov 19, 2024 • edited Loading

dcherian commented Nov 19, 2024

max-sixty commented Nov 19, 2024

dcherian commented Nov 20, 2024 • edited Loading

shoyer commented Nov 21, 2024 via email

max-sixty commented Nov 21, 2024

Add `GroupBy.shuffle_to_chunks()` #9320

Add `GroupBy.shuffle_to_chunks()` #9320

dcherian commented Aug 7, 2024 •

edited

Loading

dcherian commented Nov 13, 2024 •

edited

Loading

dcherian commented Nov 19, 2024 •

edited

Loading

dcherian commented Nov 20, 2024 •

edited

Loading