-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GroupBy.shuffle_to_chunks()
#9320
base: main
Are you sure you want to change the base?
Conversation
58d01b2
to
1df705e
Compare
1df705e
to
60d7619
Compare
* main: Revise (pydata#9366) Fix rechunking to a frequency with empty bins. (pydata#9364) whats-new entry for dropping python 3.9 (pydata#9359) drop support for `python=3.9` (pydata#8937) Revise (pydata#9357) try to fix scheduled hypothesis test (pydata#9358)
"Shuffle" means the object is sorted so that all group members occur sequentially, | ||
in the same chunk. Multiple groups may occur in the same chunk. | ||
This method is particularly useful for chunked arrays (e.g. dask, cubed). | ||
particularly when you need to map a function that requires all members of a group | ||
to be present in a single chunk. For chunked array types, the order of appearance | ||
is not guaranteed, but will depend on the input chunking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a single group limited to a single chunk? Assuming so, if we get one giant chuck, could that present any performance problems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other chunks get "auto" reshaped. This is controlled by the chunks
kwarg, which only takes "auto" at the moment.
https://docs.dask.org/en/latest/generated/dask.array.shuffle.html
xarray/core/resample.py
Outdated
With resampling it is a lot better to use ``.chunk`` instead of ``.shuffle``, | ||
since one can only resample a sorted time coordinate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure I'm being slow here — but doesn't .shuffle
claim to sort the array?
Or this means "when resampling with a dimension other than the chunking dimension"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is confusing.
Resampling only works with sorted axes, so shuffling is basically identical to rechunking. Maybe I should just delete this line?
Yes, this makes sense. Is there a case for automatically running the equivalent of (For context: I'm keen to keep the interface of our main objects really small & orthogonal. This seems useful enough that it is worth adding a method for, but wanted to see if there's any way to get the benefits without even adding a method...) |
* main: fix html repr indexes section (pydata#9768) Bump pypa/gh-action-pypi-publish from 1.11.0 to 1.12.2 in the actions group (pydata#9763) unpin array-api-strict, as issues are resolved upstream (pydata#9762) rewrite the `min_deps_check` script (pydata#9754) CI runs ruff instead of pep8speaks (pydata#9759) Specify copyright holders in main license file (pydata#9756) Compress PNG files (pydata#9747) Dispatch to Dask if nanquantile is available (pydata#9719) Updates to Dask page in Xarray docs (pydata#9495) http:// → https:// (pydata#9748) Discard useless `!s` conversion in f-string (pydata#9752) Apply ruff/flake8-simplify rule SIM401 (pydata#9749) Use micromamba 1.5.10 where conda is needed (pydata#9737) pin array-api-strict<=2.1 (pydata#9751) Reorganise ruff rules (pydata#9738) use new conda-forge package pydap-server (pydata#9741)
We could also just have Then # Save the groupby kwargs since we need it twice
groupers = {"label": UniqueGrouper()}
# get the shuffled dataset
shuffled: Dataset = ds.groupby(groupers).shuffle()
# Now group the shuffled dataset the same way and
# map a UDF that expects all members of a group in memory at the same time.
shuffled.groupby(groupers).map(UDF) Alternative names could be After thinking about it a while, This is all fairly advanced stuff, so we can start with the smallest possible API surface, which would be cc @shoyer |
Very much agree with all of your message —
(If it were |
Would it be too verbose to call this |
For the method is on the Otherwise we could workshop |
* main: Bump minimum versions (pydata#9796) Namespace-aware `xarray.ufuncs` (pydata#9776) Add prettier and pygrep hooks to pre-commit hooks (pydata#9644) `rolling.construct`: Add `sliding_window_kwargs` to pipe arguments down to `sliding_window_view` (pydata#9720) Bump codecov/codecov-action from 4.6.0 to 5.0.2 in the actions group (pydata#9793) Buffer types (pydata#9787) Add download stats badges (pydata#9786) Fix open_mfdataset for list of fsspec files (pydata#9785) add 'User-Agent'-header to pooch.retrieve (pydata#9782) Optimize `ffill`, `bfill` with dask when `limit` is specified (pydata#9771) fix cf decoding of grid_mapping (pydata#9765) Allow wrapping `np.ndarray` subclasses (pydata#9760) Optimize polyfit (pydata#9766) Use `map_overlap` for rolling reductions with Dask (pydata#9770)
GroupBy.distributed_shuffle()
OK I went with gb = array.groupby_bins("dim_0", bins=bins, **cut_kwargs)
shuffled: DataArray = gb.distributed_shuffle()
gb = shuffle.groupby_bins("dim_0", bins=bins, **cut_kwargs) I couldn't think of a shorter name, but happy to change it. |
74d7bb0
to
c77d7c5
Compare
6951e6a
to
bccacfe
Compare
how about |
Looks good! (No strong view on the exact name...) |
GroupBy.distributed_shuffle()
GroupBy.shuffle_to_chunks()
OK I'm going with shuffle_to_chunks. Will merge in ~24 hours. It's been around a while now :) |
Sounds good!
…On Wed, Nov 20, 2024 at 3:58 PM Deepak Cherian ***@***.***> wrote:
OK I'm going with shuffle_to_chunks. Will merge in ~12 hours. It's been
around a while now :)
—
Reply to this email directly, view it on GitHub
<#9320 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJJFVXAWGTXX62ULVGF6L32BUOZVAVCNFSM6AAAAABMDO3MWGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBZG44DEMBZGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Really like the new name! |
This adds some new API to shuffle an Xarray object. Shuffling means we sort so that members of a group occur in the same chunk, with the possibility of multiple groups in a single chunk.
I've also added
shuffle_by
to DataArray and Dataset. This generalizessortby
, and lets you persist a shuffled Xarray object to disk.whats-new.rst
api.rst
2024.08.1
chunks
to signature.group
is chunkedcc @phofl