Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid auto creation of indexes in concat #8872

Merged
merged 81 commits into from
May 8, 2024

Conversation

TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented Mar 25, 2024

If we create a Coordinates object using the concatenated result_indexes, and pass that to the Dataset constructor, we can explicitly set the correct indexes from the start, instead of auto-creating the wrong ones and then trying to overwrite them with the correct indexes later (which is what the current implementation does).

@TomNicholas TomNicholas changed the title Concat avoid index auto creation Avoid auto creation of indexes in concat Mar 25, 2024
@TomNicholas
Copy link
Member Author

TomNicholas commented Mar 25, 2024

All the tests in test_concat.py pass, but this change causes one failure in test_groupby.py:

FAILED xarray/tests/test_groupby.py::test_groupby_drops_nans - ValueError: no coordinate variables found for these indexes: {'id'}

xarray/core/concat.py Outdated Show resolved Hide resolved
Comment on lines 668 to 671
result_indexes[dim] = index

# TODO: add indexes at Dataset creation (when it is supported)
result = result._overwrite_indexes(result_indexes)
Copy link
Member

@benbovy benbovy Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing those lines doesn't break any existing test because the line above result[dim] = index_vars[dim] actually re-creates a default PandasIndex when assigning the new dim variable. However, this unnecessarily re-creates a new index (or re-wrap an existing one) and this may not work in the future if we allow passing a custom xarray index as dim argument to concat.

It would be better to explicitly add both index and index_vars to result. Best way would be to assign them to result_indexes and coord_vars respectively before constructing the Coordinates object and then the result object, unless there are cases where result.drop_vars(unlabeled_dims) would delete the index coordinate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @benbovy. I think your comment might explain the behaviour I just noticed in zarr-developers/VirtualiZarr#18 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've addressed your comment now @benbovy

@TomNicholas
Copy link
Member Author

TomNicholas commented Mar 28, 2024

#8884 is now merged into this PR but all that should do is make the MergeError I'm currently seeing get emitted slightly earlier.

EDIT: Since reverted in light of #8886

@@ -978,6 +979,34 @@ def test_concat_str_dtype(self, dtype, dim) -> None:

assert np.issubdtype(actual.x2.dtype, dtype)

def test_concat_avoids_index_auto_creation(self) -> None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to myself: The reason this test didn't catch the problem described in zarr-developers/VirtualiZarr#18 (comment) is because this test checks that concatenating datasets that start without indexes stay without indexes, whereas that problem is from concatenating datasets with indexes but having the coordinate variables be silently replaced by IndexVariable objects created from the index data.

@TomNicholas
Copy link
Member Author

The groupby test is still failing

FAILED xarray/tests/test_groupby.py::test_groupby_drops_nans - ValueError: no coordinate variables found for these indexes: {'id'}

This is weird, because it seems groupby creates a call to concat (on line 1861 of groupby.py) in which it attempts to concatenate along dim='id', but id exists as an index and as a data variable on the inputs, but not as a coordinate variable. The Coordinates constructor I added inside concat then complains about this situation. I can't tell whether this is a bug in groupby's input to concat that I've exposed (i.e. it should have made 'id' a coord), or an edge case for concat to handle.

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@keewis keewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't spot anything important, either.

It looks like the parameter type spec nit has been undone, though.

xarray/core/concat.py Outdated Show resolved Hide resolved
Co-authored-by: Justus Magin <keewis@users.noreply.github.com>
xarray/core/dataarray.py Outdated Show resolved Hide resolved
xarray/core/dataset.py Outdated Show resolved Hide resolved
@TomNicholas TomNicholas enabled auto-merge (squash) May 8, 2024 19:11
@TomNicholas TomNicholas merged commit 6057128 into pydata:main May 8, 2024
28 checks passed
@TomNicholas
Copy link
Member Author

Thank you both! ❤️

@TomNicholas TomNicholas deleted the concat-avoid-index-auto-creation branch May 8, 2024 19:39
andersy005 pushed a commit that referenced this pull request May 10, 2024
* test not creating indexes on concatenation

* construct result dataset using Coordinates object with indexes passed explicitly

* remove unnecessary overwriting of indexes

* ConcatenatableArray class

* use ConcatenableArray in tests

* add regression tests

* fix by performing check

* refactor assert_valid_explicit_coords and rename dims->sizes

* Revert "add regression tests"

This reverts commit beb665a.

* Revert "fix by performing check"

This reverts commit 22f361d.

* Revert "refactor assert_valid_explicit_coords and rename dims->sizes"

This reverts commit 55166fc.

* fix failing test

* possible fix for failing groupby test

* Revert "possible fix for failing groupby test"

This reverts commit 6e9ead6.

* test expand_dims doesn't create Index

* add option to not create 1D index in expand_dims

* refactor tests to consider data variables and coordinate variables separately

* test expand_dims doesn't create Index

* add option to not create 1D index in expand_dims

* refactor tests to consider data variables and coordinate variables separately

* fix bug causing new test to fail

* test index auto-creation when iterable passed as new coordinate values

* make test for iterable pass

* added kwarg to dataarray

* whatsnew

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Revert "refactor tests to consider data variables and coordinate variables separately"

This reverts commit ba5627e.

* Revert "add option to not create 1D index in expand_dims"

This reverts commit 95d453c.

* test that concat doesn't raise if create_1d_index=False

* make test pass by passing create_1d_index down through concat

* assert that an UnexpectedDataAccess error is raised when create_1d_index=True

* eliminate possibility of xarray internals bypassing UnexpectedDataAccess error by accessing .array

* update tests to use private versions of assertions

* create_1d_index->create_index

* Update doc/whats-new.rst

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>

* Rename create_1d_index -> create_index

* fix ConcatenatableArray

* formatting

* whatsnew

* add new create_index kwarg to overloads

* split vars into data_vars and coord_vars in one loop

* avoid mypy error by using new variable name

* warn if create_index=True but no index created because dimension variable was a data var not a coord

* add string marks in warning message

* regression test for dtype changing in to_stacked_array

* correct doctest

* Remove outdated comment

* test we can skip creation of indexes during shape promotion

* make shape promotion test pass

* point to issue in whatsnew

* don't create dimension coordinates just to drop them at the end

* Remove ToDo about not using Coordinates object to pass indexes

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>

* get rid of unlabeled_dims variable entirely

* move ConcatenatableArray and similar to new file

* formatting nit

Co-authored-by: Justus Magin <keewis@users.noreply.github.com>

* renamed create_index -> create_index_for_new_dim in concat

* renamed create_index -> create_index_for_new_dim in expand_dims

* fix incorrect arg name

* add example to docstring

* add example of using new kwarg to docstring of expand_dims

* add example of using new kwarg to docstring of concat

* re-nit the nit

Co-authored-by: Justus Magin <keewis@users.noreply.github.com>

* more instances of the nit

* fix docstring doctest formatting nit

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
Co-authored-by: Justus Magin <keewis@users.noreply.github.com>
andersy005 added a commit that referenced this pull request May 10, 2024
* main:
  Avoid auto creation of indexes in concat (#8872)
  Fix benchmark CI (#9013)
  Avoid extra read from disk when creating Pandas Index. (#8893)
  Add a benchmark to monitor performance for large dataset indexing (#9012)
  Zarr: Optimize `region="auto"` detection (#8997)
  Trigger CI only if code files are modified. (#9006)
  Fix for ruff 0.4.3 (#9007)
  Port negative frequency fix for `pandas.date_range` to `cftime_range` (#8999)
  Bump codecov/codecov-action from 4.3.0 to 4.3.1 in the actions group (#9004)
  Speed up localize (#8536)
  Simplify fast path (#9001)
  Add argument check_dims to assert_allclose to allow transposed inputs (#5733) (#8991)
  Fix syntax error in test related to cupy (#9000)
andersy005 added a commit to hmaarrfk/xarray that referenced this pull request May 10, 2024
* backend-indexing:
  Trigger CI only if code files are modified. (pydata#9006)
  Enable explicit use of key tuples (instead of *Indexer objects) in indexing adapters and explicitly indexed arrays (pydata#8870)
  add `.oindex` and `.vindex` to `BackendArray` (pydata#8885)
  temporary enable CI triggers on feature branch
  Avoid auto creation of indexes in concat (pydata#8872)
  Fix benchmark CI (pydata#9013)
  Avoid extra read from disk when creating Pandas Index. (pydata#8893)
  Add a benchmark to monitor performance for large dataset indexing (pydata#9012)
  Zarr: Optimize `region="auto"` detection (pydata#8997)
  Trigger CI only if code files are modified. (pydata#9006)
  Fix for ruff 0.4.3 (pydata#9007)
  Port negative frequency fix for `pandas.date_range` to `cftime_range` (pydata#8999)
  Bump codecov/codecov-action from 4.3.0 to 4.3.1 in the actions group (pydata#9004)
  Speed up localize (pydata#8536)
  Simplify fast path (pydata#9001)
  Add argument check_dims to assert_allclose to allow transposed inputs (pydata#5733) (pydata#8991)
  Fix syntax error in test related to cupy (pydata#9000)
jsignell added a commit to jsignell/VirtualiZarr that referenced this pull request May 10, 2024
TomNicholas pushed a commit to zarr-developers/VirtualiZarr that referenced this pull request May 10, 2024
* Install xarray from main now that [#8872](pydata/xarray#8872) has merged

* Remove note in docs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-combine combine/concat/merge
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants