Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug when querying unnamed dataarray #5493

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented Jun 18, 2021

There might be a slightly neater way to do this, but this works.

Comment on lines 4493 to 4494
name = _THIS_ARRAY if self.name is None else self.name
ds = self._to_dataset_whole(name=name, shallow_copy=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason why we don't just use _to_temp_dataset / _from_temp_dataset regardless of the name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_to_dataset_whole is just the same thing as _to_temp_dataset except it accepts the shallow_copy argument, which is was already being used here.

Using _to_dataset_whole with name=None will fail (so I can't just pass self.name because that will be None for an unnamed dataarray). Using _to_dataset_whole without passing a name fails in the same way.

Then trying to use _from_temp_dataset fails with

    def _from_temp_dataset(
        self, dataset: Dataset, name: Union[Hashable, None, Default] = _default
    ) -> "DataArray":
>       variable = dataset._variables.pop(_THIS_ARRAY)
E       KeyError: <this-array>

so I think it will fail for named dataarrays?

Copy link
Collaborator

@keewis keewis Jun 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need the shallow_copy=True as this is a temporary dataset (and it's the usual pattern for delegating to the Dataset implementation). No need to use the name parameter for _from_temp_dataset, either, as it's being restored from self.

Edit: actually, the failure for _from_temp_dataset fails because of the name parameter to _to_dataset_whole. In total, I'd suggest to use the same pattern as in sel

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to use the same pattern as in sel

That doesn't work for query because query needs the name of the dataarray to be a variable on the temporary dataset, otherwise it can't evaluate queries involving the dataarray name (such as x="a > 5" where da.name = 'a'). The tests will fail with this error if you try that:

>               raise UndefinedVariableError(key, is_local) from err
E               pandas.core.computation.ops.UndefinedVariableError: name 'a' is not defined

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically in order for query to work you have to provide a name to construct the temporary dataset with, which rules out using _to_temp_dataset.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed. However, if I had to choose between "self" for unnamed and the name for named DataArray objects or always "self" I'd choose the latter because it's easier to use (and the implementation would be easier)? Not sure if we should try to support both (i.e. allow referencing by name and using "self")?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. What I just wrote supports both individually: You use
the name when it's named, and self when it isn't. But if it's named you
can't use self. I suppose you could support using self even when it's named
by examining the query and replacing parts of it before giving it to
ds.query?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I added support for using 'self' even when the dataarray has a name, by using .replace(). It works, but makes the implementation more complicated again.

Copy link
Member Author

@TomNicholas TomNicholas Jun 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way to slightly simplify this would be to always rename all dataarray names to 'self' before querying.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break if there's a non-dimensional coordinate variable named self won't it? I'm not sure there's good solution though, perhaps ... :D

@max-sixty max-sixty mentioned this pull request Jun 21, 2021
5 tasks
"""

ds = self._to_dataset_whole(shallow_copy=True)
ds = ds.query(
if self.name is None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line causes a mypy error

xarray/core/dataarray.py:4510: error: Incompatible types in assignment (expression has type "Hashable", variable has type "str")  [assignment]
Found 1 error in 1 file (checked 142 source files)

@TomNicholas TomNicholas mentioned this pull request Jul 8, 2021
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

query not possible on unnamed DataArray
4 participants