Migrate datatree.py module into xarray.core. #8789

owenlittlejohns · 2024-02-27T22:27:16Z

This PR migrates the datatree.py module to xarray/core/datatree.py, as part of the on-going effort to merge xarray-datatree into xarray itself.

Most of the changes are import path changes, and type-hints, but there is one minor change to the methods available on the DataTree class: to_array has been converted to to_dataarray, to align with the method on Dataset. (See conversation here)

This PR was initially published as a draft here.

Completes migration step for datatree/datatree.py Track merging datatree into xarray #8572
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

welcome · 2024-02-27T22:27:19Z

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

owenlittlejohns · 2024-02-27T22:35:29Z

xarray/core/datatree.py

@@ -77,7 +73,7 @@
 # """


-T_Path = Union[str, NodePath]
+T_Path = str | NodePath


Looks like mypy and the Python 3.9 tests don't like this change. I can revert it if that sounds best?

Yes that syntax won't work on 3.9 so we can't use it for now.

I think you can add the pragma at the top? from __future__ import annotations. Maybe it's technically a pragma.

I think it was already there, so I'm guessing the custom type here is not covered by __future__.annotations in the same way that type hints are?

Okay - that looks better but still not perfect.

I'm working on the mypy issues still present, including things like annotations in the test files. I see some other things that I will try to debug tomorrow.

that's what I get for guessing without looking.

flamingbear

Just two random things I forgot to mention earlier.

owenlittlejohns · 2024-02-28T22:47:20Z

xarray/core/datatree.py

Thanks - I added these changes in: 2c5e54c

owenlittlejohns

There are still some things to work out, but I wanted to update the PR with some changes, to get some feedback and guidance.

owenlittlejohns · 2024-03-01T01:17:44Z

xarray/core/datatree.py

@@ -169,33 +166,32 @@ def update(self, other) -> NoReturn:
        )

    # FIXME https://github.com/python/mypy/issues/7328
-    @overload
-    def __getitem__(self, key: Mapping) -> Dataset:  # type: ignore[misc]
+    @overload  # type: ignore[override]


There's a few new type: ignore[override] annotations on DatasetView methods. @flamingbear and I spent a while today trying to disentangle things here.

Essentially, the issue comes down to DatasetView inheriting from Dataset, which uses typing.Self as a return type in a few places. This would mean DatasetView would have a return type of DatasetView. But, instead, these return types are explicitly a Dataset. (mypy has some examples mentioning that subclasses shouldn't have more generalised return types than their parents, which seems related - thanks @flamingbear for the reference)

I'm not sure that just ignoring the error is the best thing to do, but I don't think we had a better idea for an implementation. I think @TomNicholas has mentioned a future step of reworking the class inheritance, but I'm not sure if that would also cover this.

Hmm. So I guess I have violated the Liskov substitution principle here... Does that represent an actual possible failure mode? Like are there any places where "if you can use a Dataset you can use a DatasetView" fails? I suppose passing a DatasetView into any function that attempts to call __setitem__ would violate this.

Are there alternative designs that would give us immutability without DatasetView being a subclass of Dataset? One alternative might be something like

class FrozenDataset: _dataset: Dataset def __init__(self, ds): self._dataset = ds def __getattr__(self, name): # Forward the method call to the Dataset class if name in ['__setitem__', ...]: raise AttributeError("Mutation of the DatasetView is not allowed, ...") else: return getattr(self._ds, name)

but then that would not pass an isinstance check, i.e. isinstance(dt.ds, Dataset) would return False.

@TomNicholas has mentioned a future step of reworking the class inheritance, but I'm not sure if that would also cover this.

I was just talking about having both Dataset and DataTree inherit from a common DataMapping class that holds variables. But I don't think that would cover this, as that DataMapping should also be returning Self.

This is much more like the Frozen class that xarray sometimes uses.

So looking at this in Owen's absence. It seems like there are two paths.

Use a frozenDataset type thing, that fails the isinstance check?

Overriding the mypy errors understanding the the DatasetView is an library internal implementation and we know not to misuse it?

The first seems like the right thing to do, but I don't know what breaks with the isinstance check or why that would be necessary.
Is there some ABC that both Dataset and DatasetView could implement to pass the isinstance check and is that a big change? (edit: I'm thinking no. Also, this may show that I'm not afraid to ask the dumb questions in public.)

I'm going to make it a wrapper. and isinstance(dt.ds, Dataset) will return False.

Coming back to this with things that have been tried (and sadly not succeeded to date):

First up, I tried to implement the suggested FrozenDataset wrapper. There was an issue with using getattr - as it can't be used to intercept magic methods on Python classes.

Next: Trying to implement a Metaclass to allow interception of magic methods (initially inspired by this snippet). This proved tricky to do (I didn't quite get it fully working) and felt very much like it was an overly complicated solution for the problem we were trying to solve.

Next: Trying a mix-in to overwrite the affected methods. I put a time-box on this attempt as we want to unblock the migration. I did not get this working in the allocated time.

Lastly: Conceding defeat and adding the type ignore statements to cover this.

Not ideal, but the DatasetView and Datatree.ds property both have usage over the last couple of years without significant issues. I opened #8855 to capture that we need to work on a better fix at a later date.

owenlittlejohns · 2024-03-01T01:24:30Z

xarray/core/datatree.py

@@ -636,31 +629,31 @@ def __array__(self, dtype=None):
            "invoking the `to_array()` method."
        )

-    def __repr__(self) -> str:
-        return formatting.datatree_repr(self)
+    def __repr__(self) -> str:  # type: ignore[override]


This is another annotation I wanted to call out. I think it makes sense here. NamedNode.__repr__ has an optional level kwarg, which isn't an argument in repr_datatree. repr_datatree is using RenderTree, which does have a maxLevel, but I don't think that's quite the same thing. (If it is, though, we could edit DataTree.__repr__ to match the signature of NamedNode.__repr__, and then pass it on through)

I don't think properly maps to the maxLevel

owenlittlejohns · 2024-03-01T01:26:05Z

xarray/core/datatree.py


        if d:
            # Populate tree with children determined from data_objects mapping
            for path, data in d.items():
                # Create and set new node
                node_name = NodePath(path).name
-                if isinstance(data, cls):
+                if isinstance(data, DataTree):


So this was largely to make mypy happy, but I think it makes sense. Just using cls, I don't think mypy was realising that objects that were DataTree instances couldn't get to the else statement.

owenlittlejohns · 2024-03-01T01:27:17Z

xarray/core/datatree.py

@@ -1064,14 +1059,18 @@ def from_dict(

        # First create the root node
        root_data = d.pop("/", None)
-        obj = cls(name=name, data=root_data, parent=None, children=None)
+        if isinstance(root_data, DataTree):


The extra code here is a duplication of what is happening for the child nodes. mypy was unhappy with the types for the root_data, in case it was a DataTree instance.

owenlittlejohns · 2024-03-01T01:29:00Z

xarray/tests/test_datatree.py

        assert dt.name is None

    def test_bad_names(self):
        with pytest.raises(TypeError):
-            DataTree(name=5)
+            DataTree(name=5)  # type: ignore[arg-type]


I have added a few mypy annotations like this. I think they make sense, because the tests are explicitly checking what happens when you use the wrong type of argument.

Yes you always have to do this within tests that check for TypeErrors

Cool beans. I didn't want people to think I'd just thrown in type: ignore statements in just to quieten down mypy. (I'd thought about it a bit, and then thrown in type: ignore statements to quieten down mypy 😉)

owenlittlejohns · 2024-03-01T01:32:40Z

xarray/tests/test_datatree.py

-        results = DataTree(name="results", data=data)
-        xrt.assert_identical(results[["temp", "p"]], data[["temp", "p"]])
+        results: DataTree = DataTree(name="results", data=data)
+        xrt.assert_identical(results[["temp", "p"]], data[["temp", "p"]])  # type: ignore[index]


I think this is reasonable - this and test_getitem_dict_like_selection_access_to_dataset are using Iterable types (a list and a dictionary), which currently would raise exceptions (and the tests are marked as xfail). Other options I considered included writing tests that would make sure the expected assertions were raised, but it feels like the actual desired behaviour is TBD, and that the best thing to do was to leave things as they were.

owenlittlejohns · 2024-03-01T01:35:51Z

xarray/core/datatree.py

@@ -446,7 +440,7 @@ def ds(self) -> DatasetView:
        return DatasetView._from_node(self)



This is really a comment for L430:

def ds(self) -> DatasetView:

This makes sense (because of wanting an immutable version of the Dataset), but it's causing the bulk of the remaining mypy issues in the tests, where Dataset objects are being directly assigned to DataTree.ds. That said, I think I'm a little surprised this is causing an issue, given the signature below for the @ds.setter. I'd love some guidance on resolving those mypy errors!

I'm also surprised that causes a problem... Is this pattern valid typing in general for properties?

class A: ... class B(A): ... class C: @property def foo(self) -> B: ... @property.setter def foo(self, value: A): ...

that's fine until you set C.foo with A.

class A: ... class B(A): ... class C: @property def foo(self) -> B: return B() @foo.setter def foo(self, value: A): pass nope = C() nope.foo = A()

output:

> mypy demo/demo7.py demo/demo7.py:20: error: Incompatible types in assignment (expression has type "A", variable has type "B") [assignment]

And I think this is the open issue for fixing/ignoring this issue
python/mypy#3004

But it seems like it's not high priority since it's been open for 7 years.

owenlittlejohns · 2024-03-01T01:42:57Z

xarray/core/datatree.py

-    def update(self, other: Dataset | Mapping[str, DataTree | DataArray]) -> None:
+    def update(
+        self, other: Dataset | Mapping[str, DataTree | DataArray | Variable]
+    ) -> None:
        """
        Update this node's children and / or variables.



This is a comment for L937 (can't add it directly)...

This is another mypy issue because the type of k can be a Hashable (from iterating through a Dataset) or a str, but DataTree.name needs a str.

There seem to be a couple of options:

Cast k as a string.

Check that k is a string, and raise an exception if it isn't.

Happy to do either (or something else). I couldn't think of an immediate issue with casting a Hashable to a string, but wanted to check (in case there might be some chance of a weird collision between e.g. 1 and "1").

This is symptomatic of a more general issue that technically all the keys in DataTree should be Hashable to match what xarray "mostly" supports (see #8294 (comment)).

@headtr1ck do you have any thoughts on supporting Hashable for names of child nodes in DataTree?

I think we probably have to support Hashable at some point, otherwise any operation that combines a DataTree with a Dataset or DataArray will be a nightmare to type.

Okay thanks. In that case I defer to Owen on whether it would be easier to do that in this PR or a follow-up one.

I'm helping out while Owen is at a conference. I was going to see if it were easy to add this change to handle Hashable and quickly ran into the PurePosixPath's pathsegments needing to be PathLike.

I think if we want to use NodePath for traversing and altering the tree, the paths are going to need to be coerced into strings.

DataTree's _name can be changed to Hashable, but when it's calling NamedNode's __init__ I think that property setter will have to coerce to a string.
But I'm not thinking of what kind of problems is that going to cause? Beyond say someone using (7,) as a key/path as well as "(7,)" and having collisions.

~~I think we can handle that by checking collisions. but I'm not 100% sure yet. Am I missing something obvious?~~

I think I was missing something obvious (just making the DataTree's _name Hashable is not going to help here, it will also get converted by the NamedNode pieces.)

This is going to be a question for tomorrow's meeting.

In that meeting we discussed how having Hashables in datatree is possibly more trouble than it is worth, because (a) you can't form a valid path out of a list of hashables (but a list of strings is always fine), and (b) names of groups can't be hashable-like in netCDF or definitely not in Zarr anyway, so there doesn't seem like much of a use case for hashable-named groups at least.

For now it was decided to see if we could type: ignore our way to having an initial implementation that does not support hashables in datatree (which as we can explicitly forbid hashables at tree creation time hopefully isn't a ridiculous idea). #8836 was made to track the intent to revisit this.

I did not really follow all the discussions about this PR.
But for me, not accepting Hashables sounds reasonable (even though just the thought of a tree-like datastructure sounds like this should be possible for any Hashables names. E.g. a node object with a Hashables name and a number of node children). But agreed, it makes things like getitem unnecessarily complicated.

Probably the best for now is some type ignores or casts.
In the future I anyway have plans to make Dataset a generic class in variable names (and dimension names for that matter). Then this problem can be solved by returning, e.g. Dataset[str].

I did not really follow all the discussions about this PR.

Yeah sorry - a lot of them happened over zoom (see the March 19th meeting notes here).

sounds like this should be possible for any Hashables names

The problem is not the tree construction, it's serialization (because I can't guarantee being able to make a single valid unix path out of .join'ing all those hashables).

Probably the best for now is some type ignores or casts.

Cool, that's what we've done.

In the future I anyway have plans to make Dataset a generic class in variable names (and dimension names for that matter). Then this problem can be solved by returning, e.g. Dataset[str].

That would be awesome!

xarray/core/datatree.py

flamingbear · 2024-03-01T22:17:07Z

xarray/core/datatree.py

@@ -1449,7 +1448,7 @@ def merge_child_nodes(self, *paths, new_path: T_Path) -> DataTree:

    # TODO some kind of .collapse() or .flatten() method to merge a subtree

-    def as_array(self) -> DataArray:
+    def as_dataarray(self) -> DataArray:


Don't forget a quick update to whats-new.rst for this change.

Great catch.

flamingbear · 2024-03-04T17:42:46Z

doc/whats-new.rst

@@ -32,6 +32,10 @@ New Features
 Breaking changes
 ~~~~~~~~~~~~~~~~

+- ``Datatree``'s ``as_array`` renamed ``to_dataarray`` to align with ``Dataset``. (:pull:`8789`)


@TomNicholas should this have been kept out of breaking changes? mostly because it's not actually released? Wasn't sure where I should keep it. reference

Yeah we never had a need before for "breaking changes that aren't breaking yet but will be, but only relevant for previous users of another package" 😅

These datatree breaking changes really only need to be written down somewhere, even a GH issue, so that we can point to them all at once when it comes time to do the grand reveal.

#8807 and reverted. 😬

Accurately reflects the default value now.

for "breaking changes that aren't breaking yet but will be, but only relevant for previous users of another package"

We updated the name but not the function.

DAS-2062

So this is where we are moving forward with the assumption that DataTree nodes are alway named with a string. In this section of `update` even though we know the key is a str, mypy refuses. I chose explicit recast over mypy ignores, tell me why that's wrong?

owenlittlejohns · 2024-03-21T03:00:00Z

@TomNicholas - I think this PR is at a point now where it can be reviewed again in earnest. 🤞

TomNicholas

We should get this merged. Most of the changes are small and/or typing (I'm impressed you got mypy to pass!). I only have one substantive comment (about mixins).

TomNicholas · 2024-03-26T04:51:23Z

xarray/core/datatree.py

+from xarray.datatree_.datatree.common import TreeAttrAccessMixin
+from xarray.datatree_.datatree.formatting import datatree_repr
+from xarray.datatree_.datatree.formatting_html import (
+    datatree_repr as datatree_repr_html,
+)
+from xarray.datatree_.datatree.mapping import (
+    TreeIsomorphismError,
+    check_isomorphic,
+    map_over_subtree,
+)
+from xarray.datatree_.datatree.ops import (
    DataTreeArithmeticMixin,
    MappedDatasetMethodsMixin,
    MappedDataWithCoords,
 )
-from .render import RenderTree
-from xarray.core.treenode import NamedNode, NodePath, Tree
+from xarray.datatree_.datatree.render import RenderTree


Might it make sense to not actually import and use any of these from xarray.datatree._datatree imports in this PR? For example the DataTree object should still pass 99% of its tests without inheriting from the TreeAttrAccessMixin. That way we are still being explicit about what has and has not been "merged and approved".

I think you are losing some of the testing if you do that. Right now we're still collecting and running all of the tests in datatree_. If you pull out all of the pieces that aren't migrated you'll lose that testing. I think we're explicit about what is merged and approved by what is not in datatree_ anymore. And none of this is visible to a user yet. Let's talk in today's meeting and you can convince me otherwise.

TomNicholas · 2024-03-26T16:05:45Z

With @shoyer 's blessing (in the meeting) I hereby merge this PR

welcome · 2024-03-26T16:05:56Z

Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again!

* main: (26 commits) [pre-commit.ci] pre-commit autoupdate (pydata#8900) Bump the actions group with 1 update (pydata#8896) New empty whatsnew entry (pydata#8899) Update reference to 'Weighted quantile estimators' (pydata#8898) 2024.03.0: Add whats-new (pydata#8891) Add typing to test_groupby.py (pydata#8890) Avoid in-place multiplication of a large value to an array with small integer dtype (pydata#8867) Check for aligned chunks when writing to existing variables (pydata#8459) Add dt.date to plottable types (pydata#8873) Optimize writes to existing Zarr stores. (pydata#8875) Allow multidimensional variable with same name as dim when constructing dataset via coords (pydata#8886) Don't allow overwriting indexes with region writes (pydata#8877) Migrate datatree.py module into xarray.core. (pydata#8789) warn and return bytes undecoded in case of UnicodeDecodeError in h5netcdf-backend (pydata#8874) groupby: Dispatch quantile to flox. (pydata#8720) Opt out of auto creating index variables (pydata#8711) Update docs on view / copies (pydata#8744) Handle .oindex and .vindex for the PandasMultiIndexingAdapter and PandasIndexingAdapter (pydata#8869) numpy 2.0 copy-keyword and trapz vs trapezoid (pydata#8865) upstream-dev CI: Fix interp and cumtrapz (pydata#8861) ...

owenlittlejohns commented Feb 27, 2024

View reviewed changes

TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label Feb 28, 2024

flamingbear reviewed Feb 28, 2024

View reviewed changes

owenlittlejohns commented Mar 1, 2024

View reviewed changes

flamingbear reviewed Mar 1, 2024

View reviewed changes

xarray/core/datatree.py Outdated Show resolved Hide resolved

flamingbear reviewed Mar 1, 2024

View reviewed changes

owenlittlejohns and others added 10 commits March 4, 2024 10:40

Migrate datatree.py module into xarray.core.

ed47ffd

Add correct PR reference to whats-new.rst.

0784196

Revert to using Union in datatree.py.

647582d

Catch remaining unfixed import path.

26f3e61

Fix easier mypy annotations in datatree.py and test_datatree.py.

5e575d7

Straggling mypy change in datatree.py.

6b7a15f

datatree.py comment clean-up.

738bf28

More mypy corrections in datatree.py and test_datatree.py.

6eaa021

Removes unnecessary dict wrapper.

0397e67

DAS-2062: renames as_array -> to_dataarray

c45c56a

flamingbear force-pushed the DAS-2062-migrate-datatree-module-pr branch from 94b17f9 to c45c56a Compare March 4, 2024 17:40

flamingbear reviewed Mar 4, 2024

View reviewed changes

flamingbear added 6 commits March 4, 2024 10:46

DAS-2062: Updates doc string for Datatree.to_zarr

b333b1d

Accurately reflects the default value now.

DAS-2062: reverts what-new.rst

34e00bd

for "breaking changes that aren't breaking yet but will be, but only relevant for previous users of another package"

Merge branch 'main' into DAS-2062-migrate-datatree-module-pr

c171470

DAS-2062: clarify wording in comment.

a0d3702

Change Datatree.to_dataarray to call correctly

869103b

We updated the name but not the function.

Merge branch 'main' into DAS-2062-migrate-datatree-module-pr

379bc5c

flamingbear mentioned this pull request Mar 14, 2024

DataTree should support Hashable names. #8836

Open

flamingbear and others added 3 commits March 14, 2024 16:41

Clarify DataTree's names are still strings now.

12590fb

DAS-2062

Merge branch 'main' into DAS-2062-migrate-datatree-module-pr

8c3ba13

Merge branch 'main' into DAS-2062-migrate-datatree-module-pr

5cc7c41

owenlittlejohns mentioned this pull request Mar 19, 2024

DatasetView class breaks Liskov's rule. #8855

Open

owenlittlejohns added 4 commits March 19, 2024 14:39

Ignore mypy errors for DataTree.ds assignment.

ffa5f71

Fix DataTree.update type hint.

881af78

Final mypy issue - ignore DataTree.get override.

3fc4796

Update contributors in whats-new.rst

14b5c02

owenlittlejohns and others added 3 commits March 21, 2024 08:07

Merge branch 'main' into DAS-2062-migrate-datatree-module-pr

3aaf837

Fix GitHub handle.

ce68416

Merge branch 'main' into DAS-2062-migrate-datatree-module-pr

0ae54cb

TomNicholas approved these changes Mar 26, 2024

View reviewed changes

TomNicholas merged commit 473b87f into pydata:main Mar 26, 2024
29 checks passed

owenlittlejohns deleted the DAS-2062-migrate-datatree-module-pr branch March 26, 2024 17:15

TomNicholas mentioned this pull request Apr 9, 2024

Track merging datatree into xarray #8572

Closed

27 tasks

		@@ -446,7 +440,7 @@ def ds(self) -> DatasetView:
		return DatasetView._from_node(self)

Migrate datatree.py module into xarray.core. #8789

Migrate datatree.py module into xarray.core. #8789

Conversation

owenlittlejohns commented Feb 27, 2024

welcome bot commented Feb 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flamingbear left a comment

Choose a reason for hiding this comment

This comment was marked as resolved.

Choose a reason for hiding this comment

owenlittlejohns left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flamingbear Mar 4, 2024 • edited Loading

Choose a reason for hiding this comment

flamingbear Mar 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flamingbear Mar 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

owenlittlejohns commented Mar 21, 2024

TomNicholas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomNicholas commented Mar 26, 2024

welcome bot commented Mar 26, 2024

flamingbear Mar 4, 2024 •

edited

Loading

flamingbear Mar 5, 2024 •

edited

Loading

flamingbear Mar 4, 2024 •

edited

Loading