Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review/Overhaul __array__ and view methods #23569

Closed
jbrockmendel opened this issue Nov 8, 2018 · 12 comments · Fixed by #23593
Closed

Review/Overhaul __array__ and view methods #23569

jbrockmendel opened this issue Nov 8, 2018 · 12 comments · Fixed by #23593
Labels
Blocker Blocking issue or pull request for an upcoming release Compat pandas objects compatability with Numpy or Python functions Internals Related to non-user accessible pandas implementation
Milestone

Comments

@jbrockmendel
Copy link
Member

#23524 fixes a problem with DatetimeIndex.__array__; we should do a thorough check of the __array__ methods on other classes to make sure (and test) that they make sense

@jorisvandenbossche jorisvandenbossche changed the title Review/Overhauld __array__ and view methods Review/Overhaul __array__ and view methods Nov 8, 2018
@jorisvandenbossche
Copy link
Member

So question is what do we want to support in __array__ (for the datetime-like EAs):

  • default np.array(EA) will return an array of M8/m8 dtype, compatible with what it does for Series/Index (on master this currenlty gives object dtype, due to using the default iteration), and object array for PeriodArray
  • np.array(EA, dtype=object) converts to object array of pandas Timestamp/Timedelta scalars.

But do we also want to support np.array(EA, dtype=int) to convert to i8/ordinals ?

@jorisvandenbossche jorisvandenbossche added this to the 0.24.0 milestone Nov 9, 2018
@jorisvandenbossche
Copy link
Member

But do we also want to support np.array(EA, dtype=int) to convert to i8/ordinals ?

For datetime/timedelta, this will happen automatically if we do nothing. So the only other option is to raise an error, but that does not seem necessary I think.
The question is still whether to explicitly enable it for PeriodArray.

@gfyoung gfyoung added Internals Related to non-user accessible pandas implementation Compat pandas objects compatability with Numpy or Python functions labels Nov 11, 2018
@jreback jreback modified the milestones: 0.24.0, Contributions Welcome Nov 25, 2018
@jreback jreback modified the milestones: Contributions Welcome, 0.24.0 Dec 2, 2018
@TomAugspurger
Copy link
Contributor

Sorry I missed this earlier. I'm concerned about the default behavior on DatetimeIndex.__array__ and Series[datetime64[ns, tz]] with tz-aware data. Both of these convert to UTC, drop the tz, and return an M8[ns]-dtype array

In [7]: idx = pd.date_range('2000', periods=4, tz='US/Central')

In [8]: ser = pd.Series(idx)

In [9]: np.asarray(ser)
Out[9]:
array(['2000-01-01T06:00:00.000000000', '2000-01-02T06:00:00.000000000',
       '2000-01-03T06:00:00.000000000', '2000-01-04T06:00:00.000000000'],
      dtype='datetime64[ns]')

In [10]: np.asarray(idx)
Out[10]:
array(['2000-01-01T06:00:00.000000000', '2000-01-02T06:00:00.000000000',
       '2000-01-03T06:00:00.000000000', '2000-01-04T06:00:00.000000000'],
      dtype='datetime64[ns]')

This unfortunately conflicts with __iter__.

In [11]: list(ser)
Out[11]:
[Timestamp('2000-01-01 00:00:00-0600', tz='US/Central', freq='D'),
 Timestamp('2000-01-02 00:00:00-0600', tz='US/Central', freq='D'),
 Timestamp('2000-01-03 00:00:00-0600', tz='US/Central', freq='D'),
 Timestamp('2000-01-04 00:00:00-0600', tz='US/Central', freq='D')]

In [12]: list(idx)
Out[12]:
[Timestamp('2000-01-01 00:00:00-0600', tz='US/Central', freq='D'),
 Timestamp('2000-01-02 00:00:00-0600', tz='US/Central', freq='D'),
 Timestamp('2000-01-03 00:00:00-0600', tz='US/Central', freq='D'),
 Timestamp('2000-01-04 00:00:00-0600', tz='US/Central', freq='D')]

Do we care about that (I do)? Should we deprecate the current behavior?

diff --git a/pandas/core/arrays/datetimes.py b/pandas/core/arrays/datetimes.py
index e731dd33f..0f0577b1a 100644
--- a/pandas/core/arrays/datetimes.py
+++ b/pandas/core/arrays/datetimes.py
@@ -404,6 +404,9 @@ class DatetimeArrayMixin(dtl.DatetimeLikeArrayMixin,
     # Array-Like / EA-Interface Methods
 
     def __array__(self, dtype=None):
+        if dtype is None:
+            warnings.warn("bad", FutureWarning)
+            dtype = object
         if is_object_dtype(dtype):
             return np.array(list(self), dtype=object)
         elif is_int64_dtype(dtype):

usage:

In [6]: np.asarray(idx)
/Users/taugspurger/sandbox/pandas/pandas/core/arrays/datetimes.py:408: FutureWarning: bad
  warnings.warn("bad", FutureWarning)
Out[6]:
array([Timestamp('2000-01-01 00:00:00-0600', tz='US/Central', freq='D'),
       Timestamp('2000-01-02 00:00:00-0600', tz='US/Central', freq='D'),
       Timestamp('2000-01-03 00:00:00-0600', tz='US/Central', freq='D'),
       Timestamp('2000-01-04 00:00:00-0600', tz='US/Central', freq='D')],
      dtype=object)

In [7]: np.asarray(ser)
/Users/taugspurger/sandbox/pandas/pandas/core/arrays/datetimes.py:408: FutureWarning: bad
  warnings.warn("bad", FutureWarning)
Out[7]:
array([Timestamp('2000-01-01 00:00:00-0600', tz='US/Central', freq='D'),
       Timestamp('2000-01-02 00:00:00-0600', tz='US/Central', freq='D'),
       Timestamp('2000-01-03 00:00:00-0600', tz='US/Central', freq='D'),
       Timestamp('2000-01-04 00:00:00-0600', tz='US/Central', freq='D')],
      dtype=object)

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 4, 2018

As an example of the kind of strange knock-on effects this has, see https://github.com/pandas-dev/pandas/pull/23601/files#diff-425b4da47d01dc33d86c5c697e196b70R1128.

Ideally we'd just write array = np.asarray(values) there, but we had to special case categorical there, since it dispatches to .categories.__array__, which did the "wrong" thing for tz-aware datetime data.

@jreback
Copy link
Contributor

jreback commented Dec 4, 2018

I agree here. np.asarray() should preserve dtype if possible. so ok with making this consistent.

@TomAugspurger TomAugspurger added the Blocker Blocking issue or pull request for an upcoming release label Dec 8, 2018
@TomAugspurger
Copy link
Contributor

Is changing DatetimeIndex__array__ and `Series[datetime64[ns, tz]] to be warn a blocker for 0.24.0 (I think it should be).

I have a branch started. Just need to go through and fix up all the warnings.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 2, 2019

Picking this up now. I'm it's an annoyingly large diff to get things through.

I'm going with @jorisvandenbossche's suggestion to

  1. Make DTA.__array__ have the future behavior now (tz-aware -> object-dtype ndarray)
  2. Implement the deprecation in DatetimeIndex.__array__ and Series.__array__.

This proved to be much easier that doing the deprecation that DatetimeArray.__array__, and modifying the callers to pass dtype= to avoid the warning / avoid changing behavior. In some cases, it was hard to determine what dtype should be passed.

Unfortunately, Series.__array__ uses Series.get_values, so I need to plumb the dtype keyword all the way through to internals. I think I had to add the dtype keyword to get_values and to_dense as well. It's unavoidable I think. Will try to have a PR later today so people can take a look.

@jorisvandenbossche
Copy link
Member

Unfortunately, Series.array uses Series.get_values

In light of deprecating Series.get_values, we can directly use the internals here (or, do np.array(Series.array) ?)

@TomAugspurger
Copy link
Contributor

I didn't realize Series.get_values was being deprecated.

But that's a decent idea I'll try exploring. IIUC the main purpose of get_values was special handling for Categorical and Sparse at the block level, which should no longer be necessary.

@jorisvandenbossche
Copy link
Member

I didn't realize Series.get_values was being deprecated.

I don't think there is a PR, but it certainly is on the wishlist :)

@jorisvandenbossche
Copy link
Member

See #19617. I think the discussion there is mainly resolved as we now have the explicit .array and np.array/.to_numpy() to cover the two use cases.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 3, 2019 via email

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jan 3, 2019
This deprecates the current behvior when converting tz-aware Series
or Index to an ndarray. Previously, we converted to M8[ns], throwing
away the timezone information. In the future, we will return an
object-dtype array filled with Timestamps, each of which has the correct
tz.

```python
In [1]: import pandas as pd; import numpy as np

In [2]: ser = pd.Series(pd.date_range('2000', periods=2, tz="CET"))

In [3]: np.asarray(ser)
/bin/ipython:1: FutureWarning: Converting timezone-aware DatetimeArray to timezone-naive ndarray with 'datetime64[ns]' dtype. In the future, this will return an ndarray with 'object' dtype where each element is a 'pandas.Timestamp' with the correct 'tz'.
        To accept the future behavior, pass 'dtype=object'.
        To keep the old behavior, pass 'dtype="datetime64[ns]"'.
  #!/Users/taugspurger/Envs/pandas-dev/bin/python3
Out[3]:
array(['1999-12-31T23:00:00.000000000', '2000-01-01T23:00:00.000000000'],
      dtype='datetime64[ns]')
```

xref pandas-dev#23569
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jan 3, 2019
This deprecates the current behvior when converting tz-aware Series
or Index to an ndarray. Previously, we converted to M8[ns], throwing
away the timezone information. In the future, we will return an
object-dtype array filled with Timestamps, each of which has the correct
tz.

```python
In [1]: import pandas as pd; import numpy as np

In [2]: ser = pd.Series(pd.date_range('2000', periods=2, tz="CET"))

In [3]: np.asarray(ser)
/bin/ipython:1: FutureWarning: Converting timezone-aware DatetimeArray to timezone-naive ndarray with 'datetime64[ns]' dtype. In the future, this will return an ndarray with 'object' dtype where each element is a 'pandas.Timestamp' with the correct 'tz'.
        To accept the future behavior, pass 'dtype=object'.
        To keep the old behavior, pass 'dtype="datetime64[ns]"'.
  #!/Users/taugspurger/Envs/pandas-dev/bin/python3
Out[3]:
array(['1999-12-31T23:00:00.000000000', '2000-01-01T23:00:00.000000000'],
      dtype='datetime64[ns]')
```

xref pandas-dev#23569
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jan 3, 2019
This deprecates the current behvior when converting tz-aware Series
or Index to an ndarray. Previously, we converted to M8[ns], throwing
away the timezone information. In the future, we will return an
object-dtype array filled with Timestamps, each of which has the correct
tz.

```python
In [1]: import pandas as pd; import numpy as np

In [2]: ser = pd.Series(pd.date_range('2000', periods=2, tz="CET"))

In [3]: np.asarray(ser)
/bin/ipython:1: FutureWarning: Converting timezone-aware DatetimeArray to timezone-naive ndarray with 'datetime64[ns]' dtype. In the future, this will return an ndarray with 'object' dtype where each element is a 'pandas.Timestamp' with the correct 'tz'.
        To accept the future behavior, pass 'dtype=object'.
        To keep the old behavior, pass 'dtype="datetime64[ns]"'.
  #!/Users/taugspurger/Envs/pandas-dev/bin/python3
Out[3]:
array(['1999-12-31T23:00:00.000000000', '2000-01-01T23:00:00.000000000'],
      dtype='datetime64[ns]')
```

xref pandas-dev#23569
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker Blocking issue or pull request for an upcoming release Compat pandas objects compatability with Numpy or Python functions Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants