Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement DataTree aggregations #9589

Merged
merged 15 commits into from
Oct 13, 2024
Merged

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Oct 7, 2024

They now allow for dimensions that are missing on particular nodes, and use Xarray's standard generate_aggregations machinery, like aggregations for DataArray and Dataset.

shoyer added 2 commits October 7, 2024 20:58
They now allow for dimensions that are missing on particular nodes, and
use Xarray's standard generate_aggregations machinery, like aggregations
for DataArray and Dataset.

Fixes pydata#8949, pydata#8963
@shoyer shoyer requested a review from TomNicholas October 7, 2024 12:01
@TomNicholas TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label Oct 7, 2024
Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cranking out the PRs here @shoyer !

This looks great, my only questions are the same as in #9588.

Comment on lines +1639 to +1648
for node in self.subtree:
reduce_dims = [d for d in node._node_dims if d in dims]
node_result = node.dataset.reduce(
func,
reduce_dims,
keep_attrs=keep_attrs,
keepdims=keepdims,
numeric_only=numeric_only,
**kwargs,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the concern in #9588 (comment) not apply here too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#9588 has special logic for handling cases where coordinates that formerly had an index are reduced to scalars. That can't happen for aggregation.

@@ -830,6 +830,26 @@ def drop_dims_from_indexers(
)


def dim_arg_to_dims_set(dim: Dims, all_dims: Collection) -> set:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is the difference to parse_dims (except that it returns a tuple, which is hardly a reason to add a new method)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! I switched to use parse_dims instead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it seems a bit more efficient to keep things as sets. For now I've added parse_dims_as_set, which looks like a slightly better fit for the one other use of parse_dims that I could find.

Were there other intended uses for parse_dims and parse_ordered_dims? I was surprised to only find one use of parse_dims inside Xarray.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention was to ultimately use this in all methods that expect one or more dims.
But given that every method somehow handles multiple dims differently nobody was brave enough to change that because it is a somewhat breaking change.

@@ -1607,3 +1616,35 @@ def to_zarr(
compute=compute,
**kwargs,
)

def _get_all_dims(self) -> set:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds quite useful. Maybe this should be exposed as a public API (maybe a property)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to someone adding this later!

Copy link
Collaborator

@headtr1ck headtr1ck Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to admit, that I never used DataTree... But what exactly does DataTree.dims return and how is it different to this?

Edit: is it full tree vs subtree dims?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataTree.dims returns all dimensions defined at the base level of the tree. This method also returns dimensions defined on descendant nodes.

@shoyer shoyer force-pushed the datatree-aggregation branch from daead56 to 830f797 Compare October 10, 2024 09:25
@TomNicholas
Copy link
Member

@shoyer FYI the test failures are real - parse_dims_as_set apparently breaks some expected error messages and also causes a typing error. I'm happy to finish this off if you're busy?

@shoyer
Copy link
Member Author

shoyer commented Oct 11, 2024 via email

@TomNicholas
Copy link
Member

I've fixed the typing errors and the test failures. I will merge this tomorrow unless anyone has any objections to my fixes.

@TomNicholas TomNicholas added the plan to merge Final call for comments label Oct 13, 2024
@shoyer shoyer merged commit 707231e into pydata:main Oct 13, 2024
35 checks passed
@shoyer
Copy link
Member Author

shoyer commented Oct 13, 2024

Looks great, thanks Tom!

@TomNicholas TomNicholas mentioned this pull request Oct 13, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments topic-DataTree Related to the implementation of a DataTree class
Projects
Development

Successfully merging this pull request may close these issues.

Mapping DataTree methods over nodes with variables for which the args are invalid
3 participants