-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor concat to use merge for non-concatenated variables #3239
Conversation
7afbe51
to
1b4858f
Compare
da23a33
to
d3191d9
Compare
d3191d9
to
9dc340e
Compare
25b8659
to
c114143
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ready for review!
xarray/tests/test_combine.py
Outdated
@@ -567,8 +567,8 @@ def test_combine_concat_over_redundant_nesting(self): | |||
|
|||
def test_combine_nested_but_need_auto_combine(self): | |||
objs = [Dataset({"x": [0, 1]}), Dataset({"x": [2], "wall": [0]})] | |||
with raises_regex(ValueError, "cannot be combined"): | |||
combine_nested(objs, concat_dim="x") | |||
# with raises_regex(ValueError, "cannot be combined"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works now. With this PR, I think concat
is very similar to auto_combine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this works because wall
and x
are both indexes and since combine
calls align
, this will work.
Now with docs! I updated the mfdataset and parallel reading sections in |
xarray/core/concat.py
Outdated
if variables_to_merge: | ||
to_merge = [] | ||
for ds in datasets: | ||
if variables_to_merge - set(ds.variables): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that set difference is not symmetric. I think you might want:
if variables_to_merge - set(ds.variables): | |
if variables_to_merge != set(ds.variables): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is right. We are looking for variables that need to be merged but aren't present in ds
.
xarray/core/merge.py
Outdated
# TODO: add preservation of attrs into fillna | ||
out = getattr(out, combine_method)(var) | ||
out.attrs = var.attrs | ||
equals_list.append(getattr(out, compat)(var.compute())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why you're now explicitly computing everything ahead of time? In the previous implementation, at least we would not complete everything if there was a failure in comparison between the first two variables.
It seems like the ideal thing to do (in the case of dask) might be to build up a single computation graph and call compute once? Certainly we can save that for later, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why you're now explicitly computing everything ahead of time?
This was a mistake. I've fixed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any changes to this logic?
We do have inconsistent behavior here currently (on master), which your PR would resolve:
- Inside
merge()
, variables are computed after doing the equality comparison. - Inside
concat()
, variables are computed before doing the equality comparison.
We could certainly change this for consistency, but I'm not 100% sure which is the better option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah now I understand, we want to check attrs, shapes etc. before comparing values. How about
if equals is None:
out = out.compute()
for var in variables[1:]:
equals = getattr(out, compat)(var)
if not equals:
break
This still passes the dask test (test_concat_loads_variables
in test_dask.py
) but out
is still being computed before doing any checks. Not sure how to get around this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty reasonable to me. @crusaderky any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the proper solution (for a different PR) is to have a function that loops through a list of variables and compares everything but values; and then if needed, loops again and compares values.
Test failure is #3300 |
* upstream/master: ignore h5py 2.10.0 warnings and fix invalid_netcdf warning test. (pydata#3301) Update why-xarray.rst with clearer expression (pydata#3307) Updater to testing environment name (pydata#3253)
Co-Authored-By: Stephan Hoyer <shoyer@google.com>
…pat-override * 'compat-override' of github.com:dcherian/xarray: Update xarray/core/merge.py
Thanks for the review @shoyer |
* master: Fix whats-new date :/ Revert to dev version Release v0.13.0 auto_combine deprecation to 0.14 (pydata#3314) Deprecation: groupby, resample default dim. (pydata#3313) Raise error if cmap is list of colors (pydata#3310) Refactor concat to use merge for non-concatenated variables (pydata#3239) Honor `keep_attrs` in DataArray.quantile (pydata#3305) Fix DataArray api doc (pydata#3309) Accept int value in head, thin and tail (pydata#3298) ignore h5py 2.10.0 warnings and fix invalid_netcdf warning test. (pydata#3301) Update why-xarray.rst with clearer expression (pydata#3307) Compat and encoding deprecation to 0.14 (pydata#3294) Remove deprecated concat kwargs. (pydata#3288) allow np-array levels and colors in 2D plots (pydata#3295) Remove some deprecations (pydata#3292) Make argmin/max work lazy with dask (pydata#3244) Add head, tail and thin methods (pydata#3278) Updater to testing environment name (pydata#3253)
* upstream/master: (43 commits) Add hypothesis support to related projects (#3335) More doc fixes (#3333) Improve the documentation of swap_dims (#3331) fix the doc names of the return value of swap_dims (#3329) Fix isel performance regression (#3319) Allow weakref (#3318) Clarify that "scatter" is a plotting method in what's new. (#3316) Fix whats-new date :/ Revert to dev version Release v0.13.0 auto_combine deprecation to 0.14 (#3314) Deprecation: groupby, resample default dim. (#3313) Raise error if cmap is list of colors (#3310) Refactor concat to use merge for non-concatenated variables (#3239) Honor `keep_attrs` in DataArray.quantile (#3305) Fix DataArray api doc (#3309) Accept int value in head, thin and tail (#3298) ignore h5py 2.10.0 warnings and fix invalid_netcdf warning test. (#3301) Update why-xarray.rst with clearer expression (#3307) Compat and encoding deprecation to 0.14 (#3294) ...
This PR adds two changes:
concat
so that non-concatenated variables are passed to merge.concat
gains acompat
kwarg which tells merge how to merge things.concat
's previous behaviour was effectivelycompat='equals'
which is now the default.compat="override"
to skip equality checking in merge. concat'sdata_vars='minimal'
andcoords='minimal'
options are now more useful because you can specifycompat='override'
to skip equality checking and just pick the first one. Previously this would raise an error if any of the variables differed by floating point noise.black . && mypy . && flake8
whats-new.rst
for all changes andapi.rst
for new API