-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add set_xindex and drop_indexes methods #6971
Add set_xindex and drop_indexes methods #6971
Conversation
It allows passing options to the constructor of a custom index class (if any). The **options arguments of Dataset.set_xindex() are passed through. Also add type annotations to set_xindex().
BTW, viewing pull-request doc builds on RTD seems broken? Clicking on the "Details" link of the corresponding check leads to a 404. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm super excited to see this in - then we get to finally play with the new indexes (if I understand this correctly).
# coordinates do not conflict), but let's not allow this for now | ||
indexed_coords = set(coord_names) & set(self._indexes) | ||
|
||
if indexed_coords: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that you cannot use coords in more than one indexes? (I am not sure how important this is but could imagine a use case where lat & lon are used as 1D indexes and in a KDTree).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's right, allow multiple indexes per coordinate would make many things much harder.
There are indeed some examples (like the one you mention) where it could be useful to have multiple indexes. But I think it could be solved by either switching between indexes (if building them is not too expensive) or via a custom "meta-index" that would encapsulate both kinds of indexes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough - thanks for the clarification!
Try setting a pandas (multi-)index by default.
I ended up doing it. It is convenient for setting a pandas index for a non-dimension coordinate, which is currently not possible to do with |
Have you thought about whether we might want to expose a separate public |
Yes I've been thinking about it and I agree I find it cleaner than exposing all of this in Xarray's main namespace. There's a few (minor) cons, though:
|
I personally would still choose to put indexes stuff in a separate
namespace, just because it's neater, but I can see it's borderline.
…On Wed, 14 Sep 2022, 06:33 Benoit Bovy, ***@***.***> wrote:
Have you thought about whether we might want to expose a separate public
xarray.indexes namespace?
Yes I've been thinking about it and I agree I find it cleaner than
exposing all of this in Xarray's main namespace. There's a few (minor)
cons, though:
- I think the indexes.py and indexing.py modules and their content are
well located in core
- We could create a xarray/indexes/__init__.py and import there a few
"public" classes from core, but is it worth it? I'm not sure if the
number of Xarray built-in indexes will grow much beyond PandasIndex
and PandasMultiIndex. Perhaps it's preferable not?
- Things like CFTimeIndex are already imported in Xarray's main
namespace
—
Reply to this email directly, view it on GitHub
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_pydata_xarray_pull_6971-23issuecomment-2D1246569430&d=DwMCaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=qdISi9HqjazmE0DcySuXts3OlnplnLfKjH4hpzAV0xo&m=4E5eW5IsNTqFQTrWcdzS851OngwlYEdG3SG0WlL5z0sbHu692Rkq4bkhw8yxynW1&s=s8yiD2RYG-LEkCEiuSDT6KhIowl7VtGsnb_6GuYOwZk&e=>,
or unsubscribe
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AISNPI6DG7BJG5WWV7VBCEDV6GSXRANCNFSM6AAAAAAQBLFT4I&d=DwMCaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=qdISi9HqjazmE0DcySuXts3OlnplnLfKjH4hpzAV0xo&m=4E5eW5IsNTqFQTrWcdzS851OngwlYEdG3SG0WlL5z0sbHu692Rkq4bkhw8yxynW1&s=bAX5LysTxNxkTVXx0Tv75_8-UZ5okn0yuHXvGeGScGg&e=>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
xarray/core/dataset.py
Outdated
coord_names: Hashable | Sequence[Hashable], | ||
index_cls: type[Index] | None = None, | ||
**options, | ||
) -> Dataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) -> Dataset: | |
) -> T_Dataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mypy is not happy with this:
xarray/tests/test_dataset.py:3307: error: Argument 1 to "set_xindex" of "Dataset" has incompatible type "List[str]"; expected "Hashable" [arg-type]
xarray/tests/test_dataset.py:3307: note: Following member(s) of "List[str]" have conflicts:
xarray/tests/test_dataset.py:3307: note: __hash__: expected "Callable[[], int]", got "None"
xarray/tests/test_dataset.py:3307: note: Protocol member Hashable.__hash__ expected instance variable, got classe variabl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strings are sequences apparently:
isinstance("str", typing.Sequence)
Out[63]: True
Try out CoordNames = Union[str, Iterable[Hashable]]
seems to be succesful in #7048.
It would be nice if we aligned these tricky types so try to use named variables for repeated arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically a str is also an Iterable of Hashable :P
But the typing community is quite relaxed about violating that fact.
So as long as you don't need the two types to be "perpendicular" it should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for using a named variable like CoordNames
. The tricky thing here is that the order is important. Do we use Sequence
in Xarray in that case? I guess we would need to define two variables for each case where the order does / doesn't matter?
Also, I don't remember whether a single coordinate name should be str
or Hashable
. Should we treat it like a single dimension name or not?
I feel like this issue should be addressed more globally in Xarray than within the scope of this PR. Perhaps better to move on and merge this PR before the next release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we try to move to str | Iterable [Hashable]
for "one or more dims", and Hashable
for a single dim.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we try to move to str | Iterable[Hashable] for "one or more dims", and Hashable for a single dim.
Probably not in all cases? For example, with DataArray.__init__(..., dims: str | Iterable[Hashable])
the type checker would allow passing a set. Recently I had to figure out what was going on with xr.DataArray(data=np.zeros((10, 5)), dims={'x', 'time'})
, which mypy should actually catch with Sequence[Hashable]
. Slightly off-topic: should we have two variables Dims
and OrderedDims
defined in xarray.core.types
?
Same issue here for coordinate names. str | Sequence[Hashable]
seems to work well, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should a set not be allowed?
It's already since quite some time that the order is preserved? I think all built-in Iterables have conserved order, and internally we convert to tuple anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the order is preserved for sets (unlike dicts). This is what I can get with CPython 3.9 / Xarray v2022.6.0:
print(xr.DataArray(data=np.zeros((2, 3)), dims={'x', 'time'}))
# <xarray.DataArray (time: 2, x: 3)>
# array([[0., 0., 0.],
# [0., 0., 0.]])
# Dimensions without coordinates: time, x
tuple({'x', 'time'})
# ('time', 'x')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops you are right, that was dicts.
Then indeed we need to distinguish between dims and ordered dims.
In the last commit I added the Thanks everyone for the feedback and review! I think this is ready to merge, if we agree to address the |
whats-new.rst
api.rst
This PR adds Dataset and DataArray
.set_xindex
and.drop_indexes
methods (the latter is also discussed in #4366). I've cherry picked the relevant commits in thescipy22
branch and added a few more commits. This PR also allows passing build options to anyIndex
.Some comments and open questions:
Should we make the
index_cls
argument ofset_xindex
optional?set_index(coord_names, index_cls=None, **options)
where a pandas index is created by default (or a pandas multi-index if several coordinate names are given), provided that the coordinate(s) are valid 1-d candidates.set_index
method, but this would be convenient if we later depreciate it.Should we depreciate
set_index
andreset_index
? I think we should, but probably not at this point yet.There's a special case for multi-indexes where
set_xindex(["foo", "bar"], PandasMultiIndex)
adds a dimension coordinate in addition to the "foo" and "bar" level coordinates so that it is consistent with the rest of Xarray. I find it a bit annoying, though. Probably another motivation for depreciating this dimension coordinate.In this PR I also imported the
Index
base class in Xarray's root namespace.xarray.core.indexes
.PandasIndex
andPandasMultiIndex
subclasses? Maybe if one wants to create a custom index inheriting from it.PandasMultiIndex
factory methods could be also useful if we depreciate passingpd.MultiIndex
objects as DataArray / Dataset coordinates.