bool(ds) should raise a "the truth value of a Dataset is ambiguous" error #6124

delgadom · 2021-12-29T02:35:39Z

Throwing this out there - happy to be shot down if people are opposed.

Current behavior / griping

Currently, coercing a dataset to a boolean invokes ds.__bool__, which in turn calls bool(ds.data_vars):

class Dataset(DataWithCoords, DatasetArithmetic, Mapping):
    ...
    def __bool__(self) -> bool:
        return bool(self.data_vars)

This has the unfortunate property of returning True as long as there is at least one data_variable, regardless of the contents.

Currently, the behavior of Dataset.__bool__ is, at least as far as I've seen, never helpful but frequently unhelpful. I've seen (and written) tests written for DataArrays being passed a Dataset and suddenly the tests are meaningless so many times. Conversely, I've never found a legitimate use case for bool(ds). As far as I can tell, this is essentially the same as len(ds.data_vars) > 0.

In fact, while testing out my proposed changes below on a fork, I found two tests in the xarray test suite that had succumbed to this issue: see #6122 and #6123.

This has been discussed before - see #4290. This discussion focused on the question "should bool(xr.Dataset({'a': False})) return False?". I agree that it's not clear when it should be false and picking a behavior which deviates from Mapping feels arbitrary and gross.

Proposed behavior

I'm proposing that the API be changed, so that bool(xr.Dataset({'a': False})) raise an error, similar to the implementation in pd.Series.

In this implementation in pandas, attempting to evaluate even a single-element series as a boolean raises an error:

In [14]: bool(pd.Series([False]))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-b0ad7f4d9277> in <module>
----> 1 bool(pd.Series([False]))

~/miniconda3/envs/rhodium-env/lib/python3.9/site-packages/pandas/core/generic.py in __nonzero__(self)
   1532     @final
   1533     def __nonzero__(self):
-> 1534         raise ValueError(
   1535             f"The truth value of a {type(self).__name__} is ambiguous. "
   1536             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I understand hesitancy around changing the core API. That said, if anyone can find an important, correct use of bool(ds) in the wild I'll eat my hat :)

Implementation

This could be as simple as raising an error on ds.__bool__, something like:

class Dataset(DataWithCoords, DatasetArithmetic, Mapping):
    ...
    def __bool__(self) -> bool:
        raise ValueError(
            "The truth value of a Dataset is ambiguous. Reduce the data "
            "variables to a scalar value with any(ds.values()) or "
            "all(ds.values())."
        )

The only other change that would be needed is an assertion that directly calls bool(ds) in test_dataset::TestDataset.test_properties, which checks for the exact behavior I'm changing:

        assert bool(ds)

This would need to be changed to:

        with pytest.raises(ValueError):
            bool(ds)

If this sounds good, I can submit a PR with these changes.

The text was updated successfully, but these errors were encountered:

Illviljan · 2021-12-29T07:49:07Z

A Dataset is more similar to a dict or pd.DataFrame.

DataFrame has a similar error, same cooks I suppose:

bool(pd.DataFrame())
*** ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

bool(pd.DataFrame([[0, 2], [0, 4]], columns=['A', 'B']))
*** ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

dict works the same way as a dataset, if something exist in it is True:

bool({})
False

bool({'a': False})
True

I see "if not empty do x"-checks all the time with dicts in python code,
Is it that strange to follow the behavior of dict?

max-sixty · 2021-12-29T10:53:53Z

I definitely empathize with the tradeoff here. That you found xarray's test's were making this error is fairly damning.

But the biggest impediment to changing this behavior is that Dataset follows the Mapping protocol, which has this behavior. One nice feature of xarray is that we follow python's protocols where possible, and that includes truthiness for dict-like / Mapping objects. Notably pd.DataFrame objects only partially implement the protocol, including .values.

If there's a synthesis of keeping the truthiness while reducing the chance of these mistakes, that would be very welcome.

I'm not sure this is an improvement, but in the example converting to a DataArray gets away from the truthiness issue: (result.min(...) >= 1.5).to_array().all()

(I wrote this before seeing @Illviljan 's response, which is very similar)

delgadom · 2021-12-29T16:18:35Z

Yeah… I do understand how it’s currently working and why, and the behavior is certainly intuitive to those who appreciate the mapping inheritance.

That said, I feel I have to make a last stand argument because this trips people up quite often (on my team and elsewhere). I haven’t yet come across an example of anyone using this correctly, but I see users misusing it all the time.

The examples and behavior you’re showing @Illviljan seem to me like more the natural result of an implementation detail than a critical principle of the dataset design. While it’s obvious why bool(ds) resolves to True or False in the current implementation, I still think it’s not used and is out of step with how most users think of a dataset. The fact that nothing in the package requires a change in order to pass tests (the broken tests and the pathological case aside) reaffirms for me that this is not a useful feature.

I don’t know much about the mapping protocol or how closely it must be followed. Is the idea here that packages building on xarray (or interoperability features in e.g. numpy or dask) depend on a strict adherence to the full spec?

delgadom · 2021-12-29T16:22:10Z

If there's a synthesis of keeping the truthiness while reducing the chance of these mistakes, that would be very welcome.

@max-sixty im not sure what this would look like. Do you mean a warning or are you hinting that the bar that would need to be met is a silver bullet that preserves bool(ds) but somehow isn’t confusing?

shoyer · 2021-12-29T19:30:29Z

The original intention here was definitely to honor the Mapping protocol exactly, but I agree that bool() is hard to use correctly on Dataset objects. I would support deprecating this.

delgadom · 2021-12-29T21:09:12Z

I realize this may be a larger discussion, but the implementation is so easy I went ahead and filed a PR that issues a PendingDeprecationWarning in Dataset.__bool__.

max-sixty · 2022-01-01T01:50:07Z

Would this mean that Dataset doesn't inherit from Mapping (in the end-state, after deprecation)? Or that we inherit from Mapping but override bool to raise an error?

shoyer · 2022-01-04T01:59:46Z

@dcherian So I guess we decided to go for it? :)

@max-sixty We can still inherit from Mapping and just explicitly raise TypeError inside Dataset.__bool__ rather than using the default implementation. This is certainly a slight violation from the Mapping protocol, but xarray.Dataset is already a little inconsistent (e.g., we override __eq__ inconsistently with the default mapping behavior). Overall I think it's probably a worthwhile the more user-friendly choice?

dcherian · 2022-01-04T02:09:11Z

Sorry! I thought we had consensus but perhaps not?

Shall we revert? That said it's PendingDeprecationWarning so it should only be raised in tests, which seems OK?

max-sixty · 2022-01-04T09:58:51Z

TBC, I am fine merging! I would lightly vote against it, but weigh my vote far below @shoyer 's.

And more broadly let's not wait for everyone to agree on everything!

Illviljan · 2022-01-04T12:04:42Z

I do wonder at what point a mapping isn't a mapping anymore?

For example DataFrames aren't considered mappings:

isinstance(df, collections.abc.Mapping)
Out[4]: False

And if we are to follow pandas example maybe we should just remove the Mapping inheritance?

xarray/xarray/core/dataset.py

Line 584 in 60754fd

class Dataset(DataWithCoords, DatasetArithmetic, Mapping):

shoyer · 2022-01-05T17:05:14Z

After discussing this a little more, I am on the fence about whether this change this is a good idea. If we can't come to concensus about expected behavior, then that is probably an indication that we should leave things the same to avoid churn.

shoyer · 2022-01-05T17:16:56Z

I made a Twitter poll, let's see what that says 😄

https://twitter.com/xarray_dev/status/1478776987925684224?s=20

dopplershift · 2022-01-07T21:16:20Z

$0.02 from the peanut gallery is that my mental model of Dataset is that it's a dictionary on steroids, so following the Mapping protocol (including __bool__) makes sense to me. I put "I don't know" for @shoyer's poll, but if the question was "should" then I would have said "number of variables".

While I'm not going to sit here and argue that if ds: is a common operation and important feature (I don't often open files that result in empty datasets), I'd personally want a truly compelling argument for breaking from the Mapping protocol. Right now it's "Dataset works like dict plus some stuff". It would change to be "Dataset works like dict for all things except __bool__, plus some stuff". The latter to me requires more mental effort since I'm no longer strictly supplementing past experience.

jhamman · 2022-01-08T00:18:34Z

I'm also late to the party but I would say I fall squarely in the Dataset is a dict-like camp. If we remove __bool__, should we also remove __len__? Basically, everything @dopplershift aligns with my perspective here.

delgadom mentioned this issue Dec 29, 2021

Deprecate bool(ds) #6126

Merged

3 tasks

dcherian closed this as completed in #6126 Jan 3, 2022

dcherian reopened this Jan 4, 2022

headtr1ck added topic-error reporting needs discussion labels Mar 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bool(ds) should raise a "the truth value of a Dataset is ambiguous" error #6124

bool(ds) should raise a "the truth value of a Dataset is ambiguous" error #6124

delgadom commented Dec 29, 2021 •

edited by max-sixty

Loading

Illviljan commented Dec 29, 2021

max-sixty commented Dec 29, 2021

delgadom commented Dec 29, 2021

delgadom commented Dec 29, 2021

shoyer commented Dec 29, 2021

delgadom commented Dec 29, 2021

max-sixty commented Jan 1, 2022 •

edited

Loading

shoyer commented Jan 4, 2022

dcherian commented Jan 4, 2022

max-sixty commented Jan 4, 2022

Illviljan commented Jan 4, 2022

shoyer commented Jan 5, 2022

shoyer commented Jan 5, 2022

dopplershift commented Jan 7, 2022

jhamman commented Jan 8, 2022

bool(ds) should raise a "the truth value of a Dataset is ambiguous" error #6124

bool(ds) should raise a "the truth value of a Dataset is ambiguous" error #6124

Comments

delgadom commented Dec 29, 2021 • edited by max-sixty Loading

Current behavior / griping

Proposed behavior

Implementation

Illviljan commented Dec 29, 2021

max-sixty commented Dec 29, 2021

delgadom commented Dec 29, 2021

delgadom commented Dec 29, 2021

shoyer commented Dec 29, 2021

delgadom commented Dec 29, 2021

max-sixty commented Jan 1, 2022 • edited Loading

shoyer commented Jan 4, 2022

dcherian commented Jan 4, 2022

max-sixty commented Jan 4, 2022

Illviljan commented Jan 4, 2022

shoyer commented Jan 5, 2022

shoyer commented Jan 5, 2022

dopplershift commented Jan 7, 2022

jhamman commented Jan 8, 2022

delgadom commented Dec 29, 2021 •

edited by max-sixty

Loading

max-sixty commented Jan 1, 2022 •

edited

Loading