Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Typing of abstract base classes #7020

Closed
wants to merge 9 commits into from
Closed

Conversation

headtr1ck
Copy link
Collaborator

This PR adds some typing to several abstract base classes that are used in xarray.

Most of it is working, only one major point I could not figure out:

What is the type of NDArrayMixin.array???
I would appreciate it if someone that has more insight into this would help me.

Several minor open points:

  • What is the return value of ExplicitlyIndexed.__getitem__
  • What is the return value of ExplicitlyIndexed.transpose
  • What is the return value of AbstractArray.data
  • Variable.values seems to be able to return scalar values which is incompatible with the AbstractArray definition.

Overall it seems that typing has helped to find some problems again :)

Mypy should fail for tests, I did not adopt them yet, want to solve the outstanding issues first.

@headtr1ck
Copy link
Collaborator Author

The inheritance in the indexes is quite convoluted.
Trying to fix the runtime issues concerning abstract methods missing opens another rabbit hole ...
Don't know if I can fix that without some type:ignores or Anys :/

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks again!

There's an error around the abstract transpose method, which is causing lots of tests to fail, but that aside let's merge!

self: T_ExplicitlyIndexed, key: Any
) -> T_ExplicitlyIndexed | NumpyIndexingAdapter | np.ndarray | np.datetime64 | np.timedelta64:
...
# TODO: fix the return type? Something more generic?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't it also return something like a str, assuming the index holds strings? Or am I misunderstanding this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is the PandasIndexingAdapter which has some special casing for datetimes.

Anyway it is quite confusing to me, when you return an Index and when you return a np.ndarray (or scalar for that matter).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreee it is confusing.

I think these annotations are quite valuable though. Can we merge a version of this PR that just documents current behaviour? Then we can work to clean it up in a different PR.

@headtr1ck
Copy link
Collaborator Author

There's an error around the abstract transpose method, which is causing lots of tests to fail, but that aside let's merge!

I tried removing the abstract transpose method from the Baseclass but then several mypy complaints appeared.
I think one has to add overloads to as_indexable, which however I failed to to.
I got complaints that np.ndarray and pd.Index have overlapping types, might be a mypy bug?

@dcherian
Copy link
Contributor

What is the return value of ExplicitlyIndexed.getitem

I think this is the mess I ran in to in a different PR: #6874 (comment) . It's basically another ExplicitlyIndexed array except when it wraps a BackendArray, in which case you get a concrete array. IMO it should always be an ExplicitlyIndexed array.

@headtr1ck
Copy link
Collaborator Author

I have now removed the requirement to implement a transpose method as only the adapter classes implemented it but not the backend classes (it required two # type:ignores...).
Maybe it would make sense to have two separate base classes for these two use cases?

@@ -51,9 +51,13 @@ def __init__(self, variable_name, datastore):
self.datastore = datastore
self.variable_name = variable_name
array = self.get_variable().data
self.shape = array.shape
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment describing why this was necessary please?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I don't really understand why.
The Abstract Array class defines an abstract shape property.

Mypy infers correctly that setting self.shape in the constructor fulfills this requirement. (Liskov is ok here, since the base class wants at least a readable attribute, and does not care about writable)

However the ABCMeta class does not seem to understand this and raises an error at runtime (even worse than just while type checking...)

I have just noticed that the tests are still failing since many other backends have the same issue.

Not sure what to do with this, maybe it is ok to force external backends to adhere to this standard as well? At least they will notice quickly that something is broken since they cannot create their classes anymore...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some discussion about this in #6239 as well.

Copy link
Collaborator Author

@headtr1ck headtr1ck Sep 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, the abc inspection checks at a per-class level, so instance attributes cannot be resolved.
I think there are two solutions to this:

  1. Force subclasses to define shape as a property. Should be easy for internal classes, external ones might have to change or they stop working.
  2. Don't use abstractmethod for shape and simply use shape: tuple[int, ...] in the Baseclass to tell type checkers that such a property should exist.

Nr. 1 might be the cleaner solution since shape cannot be set anyway without breaking things.
Unfortunately it could break external libs.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for deep diving into this!

This is definitely one of the darker corners of Xarray.

@@ -577,19 +583,20 @@ class NDArrayMixin(NdimSizeLenMixin):
"""

__slots__ = ()
array: Any # TODO: what is this type, maybe ExplicitlyIndexed?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything that implements dtype (returning np.dtype), shape (returning tuple[int, ...]) and __getitem__ (not exactly sure what the return value needs to be, but it probbaly doesn't matter for this interface). I guess is some very basic version of a "duck numpy array," of the sort that is often used for loading data from different storage backends.

As an aside, the comment in the second paragraph of the docstring able overriding properties should probably be deleted: "A subclass should... set the array property and override one or more of dtype, shape and __getitem__."

Copy link
Collaborator Author

@headtr1ck headtr1ck Sep 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think a Protocol would be the correct solution here.
Maybe a SupportsMinimalArray.
The return type of __getitem__ could be Any for now, would be still an improvement.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'll do that in a different PR to keep the changes smaller. Quite sure such a change will lead to many many different problems that need fixing.

self: T_ExplicitlyIndexed, key: Any
) -> T_ExplicitlyIndexed | NumpyIndexingAdapter | np.ndarray | np.datetime64 | np.timedelta64:
...
# TODO: fix the return type? Something more generic?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really figure this out! I would love if we could constrain Xarray's interface more tightly here to simplify the programming model and reduce bugs. Type annotations are probably key!


@property
@abstractmethod
def data(self) -> Any: # TODO: type me?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The informal guarantee is that .data is numpy duck array, i.e., something with an interface matching np.ndarray. Unfortunately, this is a somewhat ill-defined notation at present from a type-system perspective.

In practice, Xarray only guarantees that .data passes these checks:

xarray/xarray/core/utils.py

Lines 259 to 270 in 63ba862

def is_duck_array(value: Any) -> bool:
if isinstance(value, np.ndarray):
return True
return (
hasattr(value, "ndim")
and hasattr(value, "shape")
and hasattr(value, "dtype")
and (
(hasattr(value, "__array_function__") and hasattr(value, "__array_ufunc__"))
or hasattr(value, "__array_namespace__")
)
)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think again we could introduce some Protocol here, a bit messy with the two special cases of array_namespace vs function but maybe simply a Union of two Protocols could work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, maybe such a simple Protocol will not solve the typing issues here, since we probably expect that it supports most numpy ndarray methods...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to this problem.
I think the really best solution would be to make DataArryay a Generic Type which indicates the underlying datastructure (something like DataArray[np.ndarray]).

Only two problems:

there is no way to represent the data type of coordinates independently and also for Datasets things don't work if the underlying data type is mixed.

This kind of typing would make it more convoluted for beginners (maybe it is ok though).

Maybe I open an extra issue for this.

return _as_array_or_item(self._data)
return _as_array_or_item(
self._data
) # type:ignore[return-value] # TODO: fix this mess
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

See these comments for context:

This function mostly exists because 0-dimensional ndarrays with
dtype=datetime64 are broken :(
https://github.com/numpy/numpy/issues/4337
https://github.com/numpy/numpy/issues/7619
TODO: remove this (replace with np.asarray) once these issues are fixed

numpy/numpy#4337 is actually fixed now, so perhaps we could indeed simply use np.asarray here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second issue is still open.
But to me it seems that everywhere it is expected that Variable.values returns a np.ndarray, so I am not sure where this special casing is required.
Is there a test where this special case can be seen?

@headtr1ck
Copy link
Collaborator Author

I think the implementation of a protocol for duckarrays should be done in another PR.

The mess of ExplicitlyIndexed.__getitem__ should also be fixed in another PR, I don't feel comfortable enough yet with this part of xarray to touch things there :)

@headtr1ck
Copy link
Collaborator Author

Already one problem is showing from "enforcing" a shape property by setting is as an abstract property.
cfgrib comes with its own backend which is used instead of xarrays internal one and it is still using a shape attribute.

Anyone actually know why a abstract property cannot be fulfilled by setting an attribute? At the end I require at least a readable variable, but do not care whether it is settable or not...

Alternative:
Do not use abstract properties but the "old-fashioned" way of raising a NotImplementedError in the ABC instead.

Copy link
Contributor

@Illviljan Illviljan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ABC run, I find them easier to read. https://docs.python.org/3/whatsnew/3.4.html#abc

@@ -144,24 +145,49 @@ def wrapped_func(self, dim=None, **kwargs): # type: ignore[misc]
).strip()


class AbstractArray:
class AbstractArray(metaclass=ABCMeta):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class AbstractArray(metaclass=ABCMeta):
class AbstractArray(ABC):

Inheriting from ABC doesn't look as scary as metaclass=ABCMeta

@@ -1,6 +1,7 @@
from __future__ import annotations

import warnings
from abc import ABCMeta, abstractmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from abc import ABCMeta, abstractmethod
from abc import ABC, abstractmethod

@@ -235,7 +269,7 @@ def sizes(self: Any) -> Frozen[Hashable, int]:
return Frozen(dict(zip(self.dims, self.shape)))


class AttrAccessMixin:
class AttrAccessMixin(metaclass=ABCMeta):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class AttrAccessMixin(metaclass=ABCMeta):
class AttrAccessMixin(ABC):

@@ -46,6 +46,7 @@
import re
import sys
import warnings
from abc import ABCMeta, abstractmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from abc import ABCMeta, abstractmethod
from abc import ABC, abstractmethod

@@ -541,15 +542,20 @@ def __repr__(self) -> str:
return f"{type(self).__name__}({list(self)!r})"


class NdimSizeLenMixin:
class NdimSizeLenMixin(metaclass=ABCMeta):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class NdimSizeLenMixin(metaclass=ABCMeta):
class NdimSizeLenMixin(ABC):

try:
return self.shape[0]
except IndexError:
raise TypeError("len() of unsized object")


class NDArrayMixin(NdimSizeLenMixin):
class NDArrayMixin(NdimSizeLenMixin, metaclass=ABCMeta):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class NDArrayMixin(NdimSizeLenMixin, metaclass=ABCMeta):
class NDArrayMixin(NdimSizeLenMixin, ABC):

Not sure about the order here. Maybe it's the opposite?

@@ -3,12 +3,24 @@
import enum
import functools
import operator
from abc import ABCMeta, abstractmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from abc import ABCMeta, abstractmethod
from abc import ABC, abstractmethod

T_ExplicitlyIndexed = TypeVar("T_ExplicitlyIndexed", bound="ExplicitlyIndexed")


class ExplicitlyIndexed(metaclass=ABCMeta):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class ExplicitlyIndexed(metaclass=ABCMeta):
class ExplicitlyIndexed(ABC):


class ExplicitlyIndexedNDArrayMixin(NDArrayMixin, ExplicitlyIndexed):

class ExplicitlyIndexedNDArrayMixin(NDArrayMixin, ExplicitlyIndexed, metaclass=ABCMeta):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class ExplicitlyIndexedNDArrayMixin(NDArrayMixin, ExplicitlyIndexed, metaclass=ABCMeta):
class ExplicitlyIndexedNDArrayMixin(NDArrayMixin, ExplicitlyIndexed, ABC):

last or first position?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants