-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle extra indexes for zarr region writes #8904
base: main
Are you sure you want to change the base?
Conversation
Alternatively perhaps #8877 is too liberal and we should only drop indexes that are in In general, I think we end up regretting this sort of implicit behaviour. It confuses a /lot/ of people. What do you think of keeping the error but adding a nice copy-pasteable cc @max-sixty |
Perhaps. Curious to hear other perspectives on this, and whether there are applications out there where people actually want to write indexes while doing region writes. I can't think of any. It's also always unsafe for parallel writes with Personally I find it more confusing that I have to drop the otherwise essential coordinate labels on my arrays just to get a region write to work. My MO with all other xarray ops is to keep and utilize these whenever possible, so this feels like a departure from that. The current error message already returns the suggestion to add |
Yes this! I both agree that the existing error is annoying and that magic behavior can be really confusing. At the cost of a small amount of perf and some additional code, would checking that the indexes match be a good balance? So it's only raising an error if the index on disk differs from the version in memory. |
Or what about an opt-in kwarg, |
Actually I think #8460 might dominate this — by returning the dataset to write, it's moving the proposed magic out of So my preference would be to try and push that through. If that were to merge, would this still be helpful? |
I agree that #8460 should generally be the recommended way once that's merged. I think I do have some patterns where that doesn't totally apply though and I'm still going to need the x = DataArray(dims=["time", "lat", "lon"])
y = DataArray(dims=["time", "lat", "lon"])
template = DataArray(dims=["lat", "lon", "parameter"])
xr.initialize_zarr(template, path)
# logic similar to:
# https://github.com/pydata/xarray/blob/a529f1d5b03279e88e3703b5a02a784838533d2c/xarray/tests/test_backends.py#L5708
blocks = get_block_slices()
for region in blocks:
result = model.fit(x.sel(region), y.sel(region))
# I could assign result.data to its region in the template here,
# but doesn't really make sense because I want to flush to disk and discard
result.to_zarr(path, region=region) |
I guess my main point with this PR is that #8877 already introduced the majority of the "magic" to drop all indexes. If we're doing that, I don't see any reason to raise when non-region indexes are passed. But if we think #8877 went overboard, then indeed this change doesn't make sense and we should probably revert to doing something like:
|
whats-new.rst
Small follow up to #8877. If we're going to drop the indices anyways for region writes, we may as well not raise if they are still in the dataset. This makes the user experience of region writes simpler:
I find this annoying because I often have a dataset with a bunch of unrelated indexes and have to remember which ones to drop, or use some verbose
set
logic. I thought #8877 might have already done this, but not quite. By just reordering the point at which we drop indices, we can now skip this. We still raise if data vars are passed that don't overlap with the region.cc @dcherian