Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bool(ds) should raise a "the truth value of a Dataset is ambiguous" error #6124

Open
delgadom opened this issue Dec 29, 2021 · 15 comments · Fixed by #6126
Open

bool(ds) should raise a "the truth value of a Dataset is ambiguous" error #6124

delgadom opened this issue Dec 29, 2021 · 15 comments · Fixed by #6126

Comments

@delgadom
Copy link
Contributor

delgadom commented Dec 29, 2021

Throwing this out there - happy to be shot down if people are opposed.

Current behavior / griping

Currently, coercing a dataset to a boolean invokes ds.__bool__, which in turn calls bool(ds.data_vars):

class Dataset(DataWithCoords, DatasetArithmetic, Mapping):
    ...
    def __bool__(self) -> bool:
        return bool(self.data_vars)

This has the unfortunate property of returning True as long as there is at least one data_variable, regardless of the contents.

Currently, the behavior of Dataset.__bool__ is, at least as far as I've seen, never helpful but frequently unhelpful. I've seen (and written) tests written for DataArrays being passed a Dataset and suddenly the tests are meaningless so many times. Conversely, I've never found a legitimate use case for bool(ds). As far as I can tell, this is essentially the same as len(ds.data_vars) > 0.

In fact, while testing out my proposed changes below on a fork, I found two tests in the xarray test suite that had succumbed to this issue: see #6122 and #6123.

This has been discussed before - see #4290. This discussion focused on the question "should bool(xr.Dataset({'a': False})) return False?". I agree that it's not clear when it should be false and picking a behavior which deviates from Mapping feels arbitrary and gross.

Proposed behavior

I'm proposing that the API be changed, so that bool(xr.Dataset({'a': False})) raise an error, similar to the implementation in pd.Series.

In this implementation in pandas, attempting to evaluate even a single-element series as a boolean raises an error:

In [14]: bool(pd.Series([False]))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-b0ad7f4d9277> in <module>
----> 1 bool(pd.Series([False]))

~/miniconda3/envs/rhodium-env/lib/python3.9/site-packages/pandas/core/generic.py in __nonzero__(self)
   1532     @final
   1533     def __nonzero__(self):
-> 1534         raise ValueError(
   1535             f"The truth value of a {type(self).__name__} is ambiguous. "
   1536             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I understand hesitancy around changing the core API. That said, if anyone can find an important, correct use of bool(ds) in the wild I'll eat my hat :)

Implementation

This could be as simple as raising an error on ds.__bool__, something like:

class Dataset(DataWithCoords, DatasetArithmetic, Mapping):
    ...
    def __bool__(self) -> bool:
        raise ValueError(
            "The truth value of a Dataset is ambiguous. Reduce the data "
            "variables to a scalar value with any(ds.values()) or "
            "all(ds.values())."
        )

The only other change that would be needed is an assertion that directly calls bool(ds) in test_dataset::TestDataset.test_properties, which checks for the exact behavior I'm changing:

        assert bool(ds)

This would need to be changed to:

        with pytest.raises(ValueError):
            bool(ds)

If this sounds good, I can submit a PR with these changes.

@Illviljan
Copy link
Contributor

A Dataset is more similar to a dict or pd.DataFrame.

DataFrame has a similar error, same cooks I suppose:

bool(pd.DataFrame())
*** ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

bool(pd.DataFrame([[0, 2], [0, 4]], columns=['A', 'B']))
*** ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

dict works the same way as a dataset, if something exist in it is True:

bool({})
False

bool({'a': False})
True

I see "if not empty do x"-checks all the time with dicts in python code,
Is it that strange to follow the behavior of dict?

@max-sixty
Copy link
Collaborator

I definitely empathize with the tradeoff here. That you found xarray's test's were making this error is fairly damning.

But the biggest impediment to changing this behavior is that Dataset follows the Mapping protocol, which has this behavior. One nice feature of xarray is that we follow python's protocols where possible, and that includes truthiness for dict-like / Mapping objects. Notably pd.DataFrame objects only partially implement the protocol, including .values.

If there's a synthesis of keeping the truthiness while reducing the chance of these mistakes, that would be very welcome.

I'm not sure this is an improvement, but in the example converting to a DataArray gets away from the truthiness issue: (result.min(...) >= 1.5).to_array().all()

(I wrote this before seeing @Illviljan 's response, which is very similar)

@delgadom
Copy link
Contributor Author

Yeah… I do understand how it’s currently working and why, and the behavior is certainly intuitive to those who appreciate the mapping inheritance.

That said, I feel I have to make a last stand argument because this trips people up quite often (on my team and elsewhere). I haven’t yet come across an example of anyone using this correctly, but I see users misusing it all the time.

The examples and behavior you’re showing @Illviljan seem to me like more the natural result of an implementation detail than a critical principle of the dataset design. While it’s obvious why bool(ds) resolves to True or False in the current implementation, I still think it’s not used and is out of step with how most users think of a dataset. The fact that nothing in the package requires a change in order to pass tests (the broken tests and the pathological case aside) reaffirms for me that this is not a useful feature.

I don’t know much about the mapping protocol or how closely it must be followed. Is the idea here that packages building on xarray (or interoperability features in e.g. numpy or dask) depend on a strict adherence to the full spec?

@delgadom
Copy link
Contributor Author

If there's a synthesis of keeping the truthiness while reducing the chance of these mistakes, that would be very welcome.

@max-sixty im not sure what this would look like. Do you mean a warning or are you hinting that the bar that would need to be met is a silver bullet that preserves bool(ds) but somehow isn’t confusing?

@shoyer
Copy link
Member

shoyer commented Dec 29, 2021

The original intention here was definitely to honor the Mapping protocol exactly, but I agree that bool() is hard to use correctly on Dataset objects. I would support deprecating this.

@delgadom delgadom mentioned this issue Dec 29, 2021
3 tasks
@delgadom
Copy link
Contributor Author

I realize this may be a larger discussion, but the implementation is so easy I went ahead and filed a PR that issues a PendingDeprecationWarning in Dataset.__bool__.

@max-sixty
Copy link
Collaborator

max-sixty commented Jan 1, 2022

Would this mean that Dataset doesn't inherit from Mapping (in the end-state, after deprecation)? Or that we inherit from Mapping but override bool to raise an error?

@shoyer
Copy link
Member

shoyer commented Jan 4, 2022

@dcherian So I guess we decided to go for it? :)

@max-sixty We can still inherit from Mapping and just explicitly raise TypeError inside Dataset.__bool__ rather than using the default implementation. This is certainly a slight violation from the Mapping protocol, but xarray.Dataset is already a little inconsistent (e.g., we override __eq__ inconsistently with the default mapping behavior). Overall I think it's probably a worthwhile the more user-friendly choice?

@dcherian
Copy link
Contributor

dcherian commented Jan 4, 2022

Sorry! I thought we had consensus but perhaps not?

Shall we revert? That said it's PendingDeprecationWarning so it should only be raised in tests, which seems OK?

@dcherian dcherian reopened this Jan 4, 2022
@max-sixty
Copy link
Collaborator

TBC, I am fine merging! I would lightly vote against it, but weigh my vote far below @shoyer 's.

And more broadly let's not wait for everyone to agree on everything!

@Illviljan
Copy link
Contributor

I do wonder at what point a mapping isn't a mapping anymore?

For example DataFrames aren't considered mappings:

isinstance(df, collections.abc.Mapping)
Out[4]: False

And if we are to follow pandas example maybe we should just remove the Mapping inheritance?

class Dataset(DataWithCoords, DatasetArithmetic, Mapping):

@shoyer
Copy link
Member

shoyer commented Jan 5, 2022

After discussing this a little more, I am on the fence about whether this change this is a good idea. If we can't come to concensus about expected behavior, then that is probably an indication that we should leave things the same to avoid churn.

@shoyer
Copy link
Member

shoyer commented Jan 5, 2022

I made a Twitter poll, let's see what that says 😄

https://twitter.com/xarray_dev/status/1478776987925684224?s=20

@dopplershift
Copy link
Contributor

$0.02 from the peanut gallery is that my mental model of Dataset is that it's a dictionary on steroids, so following the Mapping protocol (including __bool__) makes sense to me. I put "I don't know" for @shoyer's poll, but if the question was "should" then I would have said "number of variables".

While I'm not going to sit here and argue that if ds: is a common operation and important feature (I don't often open files that result in empty datasets), I'd personally want a truly compelling argument for breaking from the Mapping protocol. Right now it's "Dataset works like dict plus some stuff". It would change to be "Dataset works like dict for all things except __bool__, plus some stuff". The latter to me requires more mental effort since I'm no longer strictly supplementing past experience.

@jhamman
Copy link
Member

jhamman commented Jan 8, 2022

I'm also late to the party but I would say I fall squarely in the Dataset is a dict-like camp. If we remove __bool__, should we also remove __len__? Basically, everything @dopplershift aligns with my perspective here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants