Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Type annotations for Index #36708

Open
itamarst opened this issue Sep 28, 2020 · 9 comments
Open

ENH: Type annotations for Index #36708

itamarst opened this issue Sep 28, 2020 · 9 comments
Labels
Enhancement Typing type annotations, mypy/pyright type checking

Comments

@itamarst
Copy link

itamarst commented Sep 28, 2020

Is your feature request related to a problem?

As described in #26766, it would be good to have type annotations for Index.

Describe the solution you'd like

I would like the type of the publicly-exposed sub-objects to be part of the Index type. For example, these two Index instances contain Timestamp from the user's perspective, regardless of the internal implementation:

>>> import pandas as pd
>>> import numpy as np
>>> import datetime
>>> pd.Index([np.datetime64(100, "s")])[0]
Timestamp('1970-01-01 00:01:40')
>>> pd.Index([datetime.datetime(2020, 9, 28)])[0]
Timestamp('2020-09-28 00:00:00')

Because the fact Index returns different subclasses of itself, getting the type checker to can acknowledge that correctly is tricky. What's more, you'll notice a naive "Index[S] based on fact it's created with List[S]" won't work: Index([np.datetime64(100, "s"]) contains Timestamp instances, at least as far as the user is concerned, and Timestamp is very much not a np.datetime64.

Here is the only solution I've come up with that works; see also python/mypy#9482, there is no way at the moment to have this work without breaking up Index into a parent class that does __new__ and a subclass that does all the work.

The basic idea is that you have a protocol, IndexType. This is a sketch, because demonstrating this with real code would be that much harder:

from typing import TypeVar, Generic, List, Union, overload
from typing_extensions import Protocol
from datetime import datetime

T = TypeVar("T", covariant=True)  # need to look into why covariant is required, might not be, not fundamental 
S = TypeVar("S")

class datetime64(int):
    """Stand-in for np.datetime64."""


class IndexType(Protocol[T]):
    def first(self) -> T: ...


class Index:

    @overload
    def __new__(cls, values: List[datetime64]) -> "Datetime64Index": ...
    @overload
    def __new__(cls, values: List[datetime]) -> "Datetime64Index": ...
    @overload
    def __new__(cls, values: List[S]) -> "DefaultIndex[S]": ...

    def __new__(cls, values):
        if type(values[0]) in (datetime, datetime64):
            cls = Datetime64Index
        else:
            cls = DefaultIndex
        return object.__new__(cls)


class DefaultIndex(Index, Generic[S]):
    def __init__(self, values: List[S]):
        self.values = values

    def first(self) -> S:
        return self.values[0]


class Datetime64Index(DefaultIndex):

    def __init__(self, values: Union[List[datetime], List[datetime64]]):
        self.values : List[datetime64] = [
            datetime64(o.timestamp()) if isinstance(o, datetime) else o
            for o in values
        ]

    def first(self) -> datetime:
        return datetime.fromtimestamp(self.values[0])


# Should work
a: IndexType[datetime] = Index([datetime64(100)])
b: IndexType[datetime] = Index([datetime(2000, 10, 20)])
c: IndexType[bool] = Index([True])

# Should complain
d: IndexType[datetime] = Index(["a"])
e: IndexType[bool] = Index(["a"])

API breaking implications

Hopefully nothing.

Describe alternatives you've considered

I tried lots and lots of other ways of structuring this. None of them worked, except this variant.

Additional context

Part of my motivation here is to help use type checking so that users can check whether switching from Pandas to Pandas-alikes like Modin/Dask/etc.. works, by having everyone use matching type annotations.

As such, just saying "this API accepts an Index" is not good enough, because some Pandas APIs have e.g. special cases for Index[bool], you really do need to have some way of indicating the Index type for annotations to be sufficiently helpful.

What I'd like

Some feedback on whether this approach is something you all would be OK with. If so, I can try to implement it for the real Index classes.

@itamarst itamarst added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 28, 2020
@jreback
Copy link
Contributor

jreback commented Sep 28, 2020

we have almost 0 support for actually containing np.datetime64 in an index itself
so your ask is very confusing

we support DatetimeIndex for this purpose

@itamarst
Copy link
Author

I am aware that the index can't contain an np.datetime64, that's the motivation for this particular design. So let me try to back and up and explain how I got to this design.

To begin with, a naive typing scheme for Index would look like this:

class Index(Generic[T]):
    def __init__(self, List[T]): ...

    def __getitem__(self, i) -> T: ...

We can try it out, and it seems to work just fine initially:

>>> from pandas import Index
>>> i = Index([True, False])
>>> i[0]
True
>>> i[1]
False
>>> i = Index(["hello", "world"])
>>> i[0]
'hello'

Success! Except...

>>> from numpy import datetime64
>>> i = Index([datetime64(100000, "s")])
>>> i[0]
Timestamp('1970-01-02 03:46:40')
>>> isinstance(i[0], datetime64)
False

So it turns out the naive typing scheme doesn't work: you need to deal with the fact that the items you get out of the index might be a different type than what went in. And my sketch in my initial issue description is an attempt to demonstrate how one might be able to do it.

@simonjayhawkins
Copy link
Member

Thanks @itamarst we always welcome PRs adding type annotations.

A couple of points to note.

error: Missing type parameters for generic type "DefaultIndex" [type-arg] with --disallow-any-generics. We don't yet have this on CI but normally request that type parameters are added during review. This will give errors such as Return type "datetime" of "first" incompatible with return type "datetime64" in supertype "DefaultIndex" [override] or similar depending on the type parameter.

we currently support Python 3.7 and don't have typing_extensions as a required dependency.

@itamarst
Copy link
Author

Could you expand on "We don't yet have this on CI but normally request that type parameters are added during review." I don't understand what that means, sorry. And why are you disallowing generics?

@simonjayhawkins
Copy link
Member

from https://mypy.readthedocs.io/en/stable/command_line.html#cmdoption-mypy-disallow-any-generics

This flag disallows usage of generic types that do not specify explicit type parameters.

in the prototype above class Datetime64Index(DefaultIndex) is equivalent to class Datetime64Index(DefaultIndex[Any])

Is the Any hiding other issues with this prototype?

@itamarst
Copy link
Author

Ah, I see. It ought to be subclassing DefaultIndex[datetime], yeah.

Do you have infrastructure for testing type checking? Part of what would need doing is assertions saying "this is type checked as correct, that is type checked as incorrect".

@simonjayhawkins
Copy link
Member

Do you have infrastructure for testing type checking?

we run mypy on the codebase in CI for checking internal consistency, see for example https://github.com/pandas-dev/pandas/runs/1183480483

Part of what would need doing is assertions saying "this is type checked as correct, that is type checked as incorrect".

We don't have any checks for the public facing api (only from internal calls to public functions)

#28142 has been opened for discussion on how types will be made available for pandas users. I don't recall any discussion on a test suite for testing the public api.

In our type annotation journey, even after lots of effort and type annotations added, there is still so much to do. pandas is a large codebase. we still have many unannotated functions not being checked (we have many modules with check_untyped_defs=False), none of our cython code has stubs, so we don't have our core libraries typed, and all our numpy usages currently resolve to Any.

@simonjayhawkins simonjayhawkins added Typing type annotations, mypy/pyright type checking and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 29, 2020
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Sep 29, 2020
@itamarst
Copy link
Author

itamarst commented Oct 7, 2020

One useful thing is that NumPy master now has type annotations, and next release will hopefully make that public, so that should help some.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@twoertwein
Copy link
Member

pandas-stubs recently added support for Index[int] and so on by making IndexOpsMixin generic.

It will probably take quite some time until this change makes it into pandas (waiting to see whether people report issues about it on pandas-stubs and it would be a rather large PR).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Typing type annotations, mypy/pyright type checking
Projects
None yet
Development

No branches or pull requests

5 participants