Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add set_xindex and drop_indexes methods #6971

Merged
merged 29 commits into from
Sep 28, 2022

Conversation

benbovy
Copy link
Member

@benbovy benbovy commented Aug 31, 2022

This PR adds Dataset and DataArray .set_xindex and .drop_indexes methods (the latter is also discussed in #4366). I've cherry picked the relevant commits in the scipy22 branch and added a few more commits. This PR also allows passing build options to any Index.

Some comments and open questions:

  • Should we make the index_cls argument of set_xindex optional?

    • I.e., set_index(coord_names, index_cls=None, **options) where a pandas index is created by default (or a pandas multi-index if several coordinate names are given), provided that the coordinate(s) are valid 1-d candidates.
    • This would be redundant with the existing set_index method, but this would be convenient if we later depreciate it.
  • Should we depreciate set_index and reset_index? I think we should, but probably not at this point yet.

  • There's a special case for multi-indexes where set_xindex(["foo", "bar"], PandasMultiIndex) adds a dimension coordinate in addition to the "foo" and "bar" level coordinates so that it is consistent with the rest of Xarray. I find it a bit annoying, though. Probably another motivation for depreciating this dimension coordinate.

  • In this PR I also imported the Index base class in Xarray's root namespace.

    • It is needed for custom indexes and it's just a little more convenient than importing it from xarray.core.indexes.
    • Should we do the same for PandasIndex and PandasMultiIndex subclasses? Maybe if one wants to create a custom index inheriting from it. PandasMultiIndex factory methods could be also useful if we depreciate passing pd.MultiIndex objects as DataArray / Dataset coordinates.

@benbovy
Copy link
Member Author

benbovy commented Aug 31, 2022

Also, since .drop_indexes is new API I didn't feel the need to implement the old behavior regarding pandas multi-indexes (restored in #6592 and #6798 but deprecated anyway). @dcherian what do you think?

@benbovy
Copy link
Member Author

benbovy commented Aug 31, 2022

BTW, viewing pull-request doc builds on RTD seems broken? Clicking on the "Details" link of the corresponding check leads to a 404.

@benbovy benbovy mentioned this pull request Aug 31, 2022
49 tasks
xarray/core/dataarray.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@mathause mathause left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm super excited to see this in - then we get to finally play with the new indexes (if I understand this correctly).

# coordinates do not conflict), but let's not allow this for now
indexed_coords = set(coord_names) & set(self._indexes)

if indexed_coords:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that you cannot use coords in more than one indexes? (I am not sure how important this is but could imagine a use case where lat & lon are used as 1D indexes and in a KDTree).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's right, allow multiple indexes per coordinate would make many things much harder.

There are indeed some examples (like the one you mention) where it could be useful to have multiple indexes. But I think it could be solved by either switching between indexes (if building them is not too expensive) or via a custom "meta-index" that would encapsulate both kinds of indexes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough - thanks for the clarification!

Try setting a pandas (multi-)index by default.
@benbovy
Copy link
Member Author

benbovy commented Sep 7, 2022

Should we make the index_cls argument of set_xindex optional?

I ended up doing it. It is convenient for setting a pandas index for a non-dimension coordinate, which is currently not possible to do with set_index(). For unindexed dimension coordinates (e.g., now possible after renaming coordinates or dimensions), I find the syntax set_index(x="x") a bit weird compared to set_xindex("x").

xarray/core/dataset.py Outdated Show resolved Hide resolved
xarray/core/dataset.py Outdated Show resolved Hide resolved
xarray/core/dataset.py Outdated Show resolved Hide resolved
xarray/core/dataset.py Outdated Show resolved Hide resolved
@TomNicholas
Copy link
Member

In this PR I also imported the Index base class in Xarray's root namespace.

It is needed for custom indexes and it's just a little more convenient than importing it from xarray.core.indexes.
Should we do the same for PandasIndex and PandasMultiIndex subclasses? Maybe if one wants to create a custom index inheriting from it. PandasMultiIndex factory methods could be also useful if we depreciate passing pd.MultiIndex objects as DataArray / Dataset coordinates.

Have you thought about whether we might want to expose a separate public xarray.indexes namespace? Then as the list of helpers for creating custom index objects grows they could live in there, so we might have xarray.Index, but xarray.indexes.PandasIndex, xarray.indexes.PandasMultiIndex, xarray.indexes.PeriodicBoundaryIndex, xarray.indexes.OtherDomainAgnosticIndex etc. all listed in the docs API page ?

@benbovy
Copy link
Member Author

benbovy commented Sep 14, 2022

Have you thought about whether we might want to expose a separate public xarray.indexes namespace?

Yes I've been thinking about it and I agree I find it cleaner than exposing all of this in Xarray's main namespace. There's a few (minor) cons, though:

  • I think the indexes.py and indexing.py modules and their content are well located in core
  • We could create a xarray/indexes/__init__.py and import there a few "public" classes from core, but is it worth it? I'm not sure if the number of Xarray built-in indexes will grow much beyond PandasIndex and PandasMultiIndex. Perhaps it's preferable not?
  • Things like CFTimeIndex are already imported in Xarray's main namespace

@TomNicholas
Copy link
Member

TomNicholas commented Sep 14, 2022 via email

coord_names: Hashable | Sequence[Hashable],
index_cls: type[Index] | None = None,
**options,
) -> Dataset:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
) -> Dataset:
) -> T_Dataset:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mypy is not happy with this:

xarray/tests/test_dataset.py:3307: error: Argument 1 to "set_xindex" of "Dataset" has incompatible type "List[str]"; expected "Hashable"  [arg-type]
xarray/tests/test_dataset.py:3307: note: Following member(s) of "List[str]" have conflicts:
xarray/tests/test_dataset.py:3307: note:     __hash__: expected "Callable[[], int]", got "None"
xarray/tests/test_dataset.py:3307: note: Protocol member Hashable.__hash__ expected instance variable, got classe variabl

#6971 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strings are sequences apparently:

isinstance("str", typing.Sequence)
Out[63]: True

Try out CoordNames = Union[str, Iterable[Hashable]] seems to be succesful in #7048.
It would be nice if we aligned these tricky types so try to use named variables for repeated arguments.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically a str is also an Iterable of Hashable :P
But the typing community is quite relaxed about violating that fact.
So as long as you don't need the two types to be "perpendicular" it should work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for using a named variable like CoordNames. The tricky thing here is that the order is important. Do we use Sequence in Xarray in that case? I guess we would need to define two variables for each case where the order does / doesn't matter?

Also, I don't remember whether a single coordinate name should be str or Hashable. Should we treat it like a single dimension name or not?

I feel like this issue should be addressed more globally in Xarray than within the scope of this PR. Perhaps better to move on and merge this PR before the next release?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we try to move to str | Iterable [Hashable] for "one or more dims", and Hashable for a single dim.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we try to move to str | Iterable[Hashable] for "one or more dims", and Hashable for a single dim.

Probably not in all cases? For example, with DataArray.__init__(..., dims: str | Iterable[Hashable]) the type checker would allow passing a set. Recently I had to figure out what was going on with xr.DataArray(data=np.zeros((10, 5)), dims={'x', 'time'}), which mypy should actually catch with Sequence[Hashable]. Slightly off-topic: should we have two variables Dims and OrderedDims defined in xarray.core.types?

Same issue here for coordinate names. str | Sequence[Hashable] seems to work well, though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should a set not be allowed?
It's already since quite some time that the order is preserved? I think all built-in Iterables have conserved order, and internally we convert to tuple anyway?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the order is preserved for sets (unlike dicts). This is what I can get with CPython 3.9 / Xarray v2022.6.0:

print(xr.DataArray(data=np.zeros((2, 3)), dims={'x', 'time'}))
# <xarray.DataArray (time: 2, x: 3)>
# array([[0., 0., 0.],
#        [0., 0., 0.]])
# Dimensions without coordinates: time, x

tuple({'x', 'time'})
# ('time', 'x')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops you are right, that was dicts.
Then indeed we need to distinguish between dims and ordered dims.

xarray/core/dataarray.py Outdated Show resolved Hide resolved
@benbovy benbovy mentioned this pull request Sep 23, 2022
@benbovy
Copy link
Member Author

benbovy commented Sep 27, 2022

In the last commit I added the xarray.indexes namespace from which we can import Index, PandasIndex and PandasMultiIndex.

Thanks everyone for the feedback and review!

I think this is ready to merge, if we agree to address the coord_names typing issue in another PR?

@benbovy benbovy merged commit e678a1d into pydata:main Sep 28, 2022
@benbovy benbovy deleted the add-set-xindex-and-drop-indexes branch December 8, 2022 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Public API for setting new indexes: add a set_xindex method?
7 participants