API: return Index instead of array from DatetimeIndex field accessors (GH15022) #15589

jorisvandenbossche · 2017-03-06T09:43:49Z

closes API: let DatetimeIndex date/time components return a new Index instead of array #15022
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

This changes the datetime field accessors of a DatetimeIndex (and PeriodIndex, etc) to return an Index object instead of a plain array:

So for example:

# PR

In [1]: idx = pd.date_range("2015-01-01", periods=5, freq='10H')

In [2]: idx
Out[2]: 
DatetimeIndex(['2015-01-01 00:00:00', '2015-01-01 10:00:00',
               '2015-01-01 20:00:00', '2015-01-02 06:00:00',
               '2015-01-02 16:00:00'],
              dtype='datetime64[ns]', freq='10H')

In [3]: idx.hour
Out[3]: Int64Index([0, 10, 20, 6, 16], dtype='int64')

instead of

# master

In [1]: idx = pd.date_range("2015-01-01", periods=5, freq='10H')

In [2]: idx.hour
Out[2]: array([ 0, 10, 20,  6, 16], dtype=int32)

jorisvandenbossche · 2017-03-06T09:58:15Z

One failing test I am not sure what to do about. On master, the following preserves the name:

In [1]: idx = pd.date_range("2015-01-01", periods=10, freq='10H', name='name!')

In [2]: idx.map(lambda x: x.hour)
Out[2]: Int64Index([0, 10, 20, 6, 16, 2, 12, 22, 8, 18], dtype='int64', name='name!')

but now not anymore. The reason for this is the DatetimeIndex.map implementation, where if the function returns an Index, this is returned, but otherwise Index.map is used and this passes through the attributes.
There actually also seems to be a bug in the map implementation:

pandas/pandas/tseries/base.py

Lines 331 to 343 in 5067708

    
           def map(self, f): 
        
               try: 
        
                   result = f(self) 
        
                   # Try to use this result if we can 
        
                   if isinstance(result, np.ndarray): 
        
                       self._shallow_copy(result) 
        
                   if not isinstance(result, Index): 
        
                       raise TypeError('The map function must return an Index object') 
        
                   return result 
        
               except Exception: 
        
                   return self.asobject.map(f)

(line 337 the shallow_copy is called, but nothing is done with the result) cc @nateyoder

jorisvandenbossche · 2017-03-06T10:42:44Z

Maybe also more in general: should those field accessors preserve the index name?

jreback · 2017-03-06T13:08:26Z

these should have the name of the Series themselves (e.g. the name of the values)

jreback · 2017-03-06T13:12:29Z

pandas/tseries/common.py

@@ -106,6 +106,8 @@ def _delegate_property_get(self, name):
        elif not is_list_like(result):
            return result

+        result = np.asarray(result)
+


hmm, why are you converting back to an ndarray here? I don't think this necessary

the take_1d 2 lines below needs an array, not an index.
I could only convert it specifically for that, but thought it couldn't do harm to put it here, as it is otherwise passed to Series as values, so will be converted to array anyway.

you can change it to

result.take(...) which will handle this. It was one this way because it was an array originally.

Another reason to convert to an array is so that Series does not take a copy of the values (which it does if you pass an Index object I think)

jorisvandenbossche · 2017-03-06T13:13:54Z

these should have the name of the Series themselves

There is no Series here, it is only about Index

jreback · 2017-03-06T13:15:12Z

pandas/tseries/index.py

@@ -77,16 +77,19 @@ def f(self):

            result = tslib.get_start_end_field(values, field, self.freqstr,
                                               month_kw)
+            result = self._maybe_mask_results(result, convert='float64')
+


I think you can have a single result = self._maybe_mask_results(result, convert='float64') just before returning; it won't do anything to something w/o nan's anyhow (and is more clear code)

the problem is with the weekday_name, which gives strings, and for this the astype('float64') will fail

And I am also not sure why the is_leap_year is treated differently, but converting missing values back to NaN would be an API change, as for some reason that attribute currently keeps it missing values as False:

In [14]: idx = pd.DatetimeIndex(['2012-01-01', pd.NaT, '2013-01-01']) In [15]: idx.is_leap_year Out[15]: Index([True, False, False], dtype='object')

About is_leap_year, was done like this in purpose in #13739, citing @sinhrks "pd.NaT.is_leap_year results in False, as I think users want bool array."

But, this does not seem very consistent with other is_ methods .. (but I would keep this for another issues/PR)

oh, these are a bug in _mask_missing_values then. It needs to ignore object and boolean dtypes. (or better yet, only work on is_numeric_dtype).

If you can't get it work (in time you have allowed), lmk and i'll take a look.

jreback · 2017-03-06T13:16:05Z

pandas/tseries/index.py


-        return self._maybe_mask_results(result, convert='float64')
+        return Index(result)


name=self.values.name

jreback · 2017-03-06T13:17:25Z

Maybe also more in general: should those field accessors preserve the index name?

yes

jreback · 2017-03-06T13:18:02Z

obviously when finished, need a sub-section in whatsnew for this, it is technically an API change, though actually should be back-compat

jreback · 2017-03-06T13:40:52Z

these should have the name of the Series themselves
There is no Series here, it is only about Index

no what I mean is the name of the result index should be the name of the original Series values

IOW

In [16]: s = Series(pd.date_range('20130101',periods=3), name='foo')

In [17]: s.dt.day
Out[17]: 
0    1
1    2
2    3
Name: foo, dtype: int64

In [18]: Index(s.dt.day, name='foo')
Out[18]: Int64Index([1, 2, 3], dtype='int64', name='foo')

jorisvandenbossche · 2017-03-06T13:44:30Z

Sorry, I still don't understand. Do you mean that eg the s.index.day attribute takes s.name as its name (instead of s.index.name)?
But and Index can live completely independent of a Series, so I don't see why this should be the case (or do we have examples of that somewhere else in pandas?)

jreback · 2017-03-06T13:46:53Z

Sorry, I still don't understand. Do you mean that eg the s.index.day attribute takes s.name as its name (instead of s.index.name)?
But and Index can live completely independent of a Series, so I don't see why this should be the case (or do we have examples of that somewhere else in pandas?)

yes of course, you are working on the values, so you return the values .name attribute. this is standard practice, for example any type of operation.

This is de-facto the same as doing.

In [1]: s = Series([1,2,3],index=Index(list('abc'), name='bar'), name='foo')

In [2]: s
Out[2]: 
bar
a    1
b    2
c    3
Name: foo, dtype: int64

In [3]: pd.Index(s)
Out[3]: Int64Index([1, 2, 3], dtype='int64', name='foo')

jorisvandenbossche · 2017-03-06T13:49:08Z

@jreback the starting object in this PR is an index, not a series. So the values I pass to Index are coming from an Index, not from a Series.
So I suppose it has to take the name of the Index, but there is no Series involved here.

(self.name is the Index name)

jreback · 2017-03-06T13:53:06Z

@jorisvandenbossche this only affects the delegates (which is always a Series). NOT directly from the index (though actually that should also propogate the name).

chris-b1 · 2017-03-06T16:02:02Z

@jorisvandenbossche - I don't feel strongly about this, but given that the dt accessors return a like shaped array, wouldn't it make sense to wrap the results back in a Series? E.g., no different than this:

In [35]: s = pd.Series(['a', 'b', 'c'])

In [36]: s.str.upper()
Out[36]: 
0    A
1    B
2    C
dtype: object

jorisvandenbossche · 2017-03-06T16:06:07Z

@chris-b1 this PR is about Index, not Series (will add better description at the top and whatsnew to make this more clear). So the equivalent example is:

In [55]: s = pd.Index(['a', 'b', 'c'])

In [56]: s.str.upper()
Out[56]: Index(['A', 'B', 'C'], dtype='object')

So in fact I make the datetime fields more consistent with the the str methods, as the first now return an array, while the string methods already return the result wrapped in an Index.

chris-b1 · 2017-03-06T16:07:55Z

Oh, yep that makes sense then, sorry I basically only read the title.

jorisvandenbossche · 2017-03-06T16:12:58Z

Ah, yes :-) updated the title to make that more clear (although it is not only for DatetimeIndex, but also PeriodIndex and TimedeltaIndex). And that reminds me, I don't think I already changed this for TimedeltaIndex

TODO:

same change for TimedeltaIndex field accessors for consistency? (-> days, seconds, total_seconds)

jreback · 2017-03-06T16:23:24Z

same change for TimedeltaIndex field accessors for consistency? (-> days, seconds, total_seconds)

this should be for all datetime-like accessors I think (no exclusions).

jreback · 2017-03-07T14:12:30Z

pandas/tests/indexes/timedeltas/test_timedelta.py

@@ -509,6 +509,10 @@ def test_fields(self):
        tm.assert_series_equal(s.dt.seconds, Series(
            [10 * 3600 + 11 * 60 + 12, np.nan], index=[0, 1]))

+        # preserve name (GH15589)


might be better to add something to

pandas/tests/indexes/datetimelike.py. These are inherited by all of the datetimelike test indexes.

The only problem is that they don't have a common field attribute.

I now ensured I have a test for each of period, timedelta, datetime that checks the name preservation, but indeed, ideally would have a test in datetimelike.py for that.

The only problem is that they don't have a common field attribute.

you should simply run it for index._datetimelike_ops which are defined per-class

but no big deal

that would indeed be a possibility, and just checked and eg also freq is included in this list (which has a different return type). So would start skipping those, which would also not be that clean.

yeah prob should just define these as fixtures I think, then would make it really easy

https://github.com/pandas-dev/pandas/blob/master/pandas/tests/series/test_datetime_values.py#L30

jreback · 2017-03-07T14:13:05Z

pandas/tseries/period.py

@@ -52,7 +52,8 @@
 def _field_accessor(name, alias, docstring=None):
    def f(self):
        base, mult = _gfc(self.freq)
-        return get_period_field_arr(alias, self._values, base)
+        result = get_period_field_arr(alias, self._values, base)
+        return Index(result)


name=self.name

yes, still busy :-)

jorisvandenbossche · 2017-03-07T14:49:17Z

same change for TimedeltaIndex field accessors for consistency? (-> days, seconds, total_seconds)

this should be for all datetime-like accessors I think (no exclusions).

Yep, I did that, but also noticed that there are quite some other operations on Index objects that also return an array. Maybe we should have a more general discussion on where to draw the line (but that is for another issue)

jreback · 2017-03-07T14:52:55Z

Yep, I did that, but also noticed that there are quite some other operations on Index objects that also return an array. Maybe we should have a more general discussion on where to draw the line (but that is for another issue)

yes pls create an issue (maybe with checkboxes)?

jorisvandenbossche · 2017-03-08T08:48:26Z

@jreback the failing test is one where a boolean Series is now object dtyped. This is because we don't have a boolean index, so it gets object dtype. But if it is then converted to a Series, this keeps object dtype

In [24]: idx = pd.date_range("2015-01-01", periods=5, freq='10H')

In [25]: idx.is_month_start
Out[25]: Index([True, True, True, False, False], dtype='object')

In [26]: pd.Series(idx).dt.is_month_start
Out[26]: 
0     True
1     True
2     True
3    False
4    False
dtype: object

Is there a good way to deal with this? (I can infer the dtype when it is object within the Properties delegator)

jorisvandenbossche · 2017-03-08T08:53:11Z

That is actually a side effect of this PR I did not consider. Returning an object index with booleans is not really good ..

On second thoughts, it is actually totally not acceptable, because filtering with a mask (boolean indexing) does not work anymore.
So if I want to keep this PR, I will have to distinguish the return type (array vs Index) on the dtype of the result (bool vs numerical/string). Unless we have a bool support in Index.

jreback · 2017-03-08T11:21:33Z

ehen we have a comparison method that returns a boolean array we just return the array directly

see _add_comparison_methods in indexes/base

so i would do the same here, just return the ndarray

…5022)

jorisvandenbossche · 2017-03-22T13:29:14Z

@jreback updated this, if you could have a look again

This PR has the consequence that it introduces an inconsistency between the return type of different datetime field accessors (-> array for boolean fields and Index for all others). So we have to be sure we are OK with introducing this.

jreback · 2017-03-22T13:36:54Z

@jorisvandenbossche yes will put some comments.

FYI don't cancel any travis jobs....testing the deduping auto cancellation.

jreback · 2017-03-22T13:37:39Z

This PR has the consequence that it introduces an inconsistency between the return type of different datetime field accessors (-> array for boolean fields and Index for all others). So we have to be sure we are OK with introducing this.

as I said before, I think this is ok. but let me look.

jreback

minor comments. suggestion for consolidating how fields are referenced a bit.

jreback · 2017-03-22T13:37:53Z

doc/source/whatsnew/v0.20.0.txt

@@ -471,6 +471,38 @@ New Behavior:

   s.map(lambda x: x.hour)

+


can you add a ref here

jreback · 2017-03-22T13:38:15Z

doc/source/whatsnew/v0.20.0.txt

+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The several datetime-related attributes (see :ref:`here <timeseries.components>`
+for an overview) of DatetimeIndex, PeriodIndex and TimedeltaIndex previously


double-backticks on DatetimeIndex etc.

jreback · 2017-03-22T13:38:34Z

doc/source/whatsnew/v0.20.0.txt

+The several datetime-related attributes (see :ref:`here <timeseries.components>`
+for an overview) of DatetimeIndex, PeriodIndex and TimedeltaIndex previously
+returned numpy arrays, now they will return a new Index object (:issue:`15022`).
+Only in case of a boolean field, still a boolean array is returned to support


only in the case of a

The last sentence is awkward, see if you can reword.

maybe explicity list the Index boolean methods? (e.g. is_quarter_start.....)

jreback · 2017-03-22T13:42:17Z

pandas/tests/scalar/test_timestamp.py

+
+        # boolean fields
+        fields = ['is_leap_year']
+        # other boolean fields like 'is_month_start' and 'is_month_end'


I suppose let's make an issue for this NaT enhancement?

Yes, will open an issue for that.

jreback · 2017-03-22T13:44:27Z

pandas/tseries/index.py

@@ -64,6 +64,7 @@ def f(self):
            if self.tz is not utc:
                values = self._local_timestamps()

+        # boolean accessors -> return array


I think it might be worth it to add something like this:

class DatetimeIndex....: _boolean_ops = ['is_month_start'......] _datetimelike_ops = [....] + _boolean_ops

then you can use that here.

In principle, that is indeed cleaner. But, the problem is that I would still have to distinguish here in another way, as the is_leap_year is also a boolean one, but has to be processed differently. So not sure if that is then worth it.

why is is_leap_year different? seems that it should be the same

that's wrong in the code, it can be treated exactly like the others.

Because the handling of NaNs is different (that is related to the other issue of NaT not having the boolean fields, will open an issue about that). For is_leap_year (which returns False for NaT), the handling of missing values in self._maybe_mask_results(result, convert='float64') would return the wrong result.

ahh ok. pls open a new issue and I will do a followup to fixup this. It much too specially casey. So good to go when you are ready.

jreback · 2017-03-22T13:45:46Z

pandas/tseries/index.py

        elif field in ['is_leap_year']:
            # no need to mask NaT
            return libts.get_date_field(values, field)
+
+        # non-boolean accessors -> return Index
+        elif field in ['weekday_name']:


same as above maybe list this in DatetimeIndex, maybe _other_ops = ['weekday_name'] or something
just to avoid explicity listing these in two places.

jreback · 2017-03-22T17:20:55Z

@jorisvandenbossche let's merge this unless anything else (I'll rebase on top after).

codecov · 2017-03-22T18:32:22Z

Codecov Report

Merging #15589 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #15589      +/-   ##
==========================================
- Coverage   91.02%   90.99%   -0.03%     
==========================================
  Files         143      143              
  Lines       49403    49407       +4     
==========================================
- Hits        44967    44960       -7     
- Misses       4436     4447      +11

Impacted Files	Coverage Δ
pandas/tseries/util.py	`100% <100%> (ø)`	⬆️
pandas/tseries/common.py	`88.09% <100%> (-1.07%)`	⬇️
pandas/tseries/converter.py	`62.95% <100%> (ø)`	⬆️
pandas/tseries/period.py	`92.67% <100%> (+0.01%)`	⬆️
pandas/tseries/tdi.py	`90.23% <100%> (ø)`	⬆️
pandas/tseries/index.py	`95.4% <100%> (ø)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/common.py	`90.96% <0%> (-0.34%)`	⬇️
pandas/core/frame.py	`97.86% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 79581ff...ffacd38. Read the comment docs.

jreback · 2017-03-22T18:47:08Z

thanks!

… (GH15022) closes pandas-dev#15022 Author: Joris Van den Bossche <jorisvandenbossche@gmail.com> Closes pandas-dev#15589 from jorisvandenbossche/api-dt-fields-index and squashes the following commits: ffacd38 [Joris Van den Bossche] doc fixes 41728a9 [Joris Van den Bossche] FIX: boolean fields should still return array 6317b6b [Joris Van den Bossche] Add whatsnew 96ed069 [Joris Van den Bossche] Preserve name for PeriodIndex field accessors cdf6cae [Joris Van den Bossche] Preserve name for DatetimeIndex field accessors f2831e2 [Joris Van den Bossche] Update timedelta accessors 52f9008 [Joris Van den Bossche] Fix tests 41008c7 [Joris Van den Bossche] API: return Index instead of array from datetime field accessors (GH15022)

jorisvandenbossche added API Design Datetime Datetime data dtype labels Mar 6, 2017

jorisvandenbossche force-pushed the api-dt-fields-index branch from 3ba410a to fc6f593 Compare March 6, 2017 10:40

jreback reviewed Mar 6, 2017

View reviewed changes

jorisvandenbossche changed the title ~~[WIP] API: return Index instead of array from datetime field accessors (GH15022)~~ [WIP] API: return Index instead of array from DatetimeIndex field accessors (GH15022) Mar 6, 2017

jreback reviewed Mar 7, 2017

View reviewed changes

jorisvandenbossche added this to the 0.20.0 milestone Mar 7, 2017

jorisvandenbossche changed the title ~~[WIP] API: return Index instead of array from DatetimeIndex field accessors (GH15022)~~ API: return Index instead of array from DatetimeIndex field accessors (GH15022) Mar 7, 2017

jorisvandenbossche added 6 commits March 22, 2017 13:42

API: return Index instead of array from datetime field accessors (GH1…

41008c7

…5022)

Fix tests

52f9008

Update timedelta accessors

f2831e2

Preserve name for DatetimeIndex field accessors

cdf6cae

Preserve name for PeriodIndex field accessors

96ed069

Add whatsnew

6317b6b

jorisvandenbossche force-pushed the api-dt-fields-index branch from 094b6ab to dad30a2 Compare March 22, 2017 13:24

jreback approved these changes Mar 22, 2017

View reviewed changes

FIX: boolean fields should still return array

41728a9

jorisvandenbossche force-pushed the api-dt-fields-index branch from dad30a2 to 41728a9 Compare March 22, 2017 14:16

jorisvandenbossche mentioned this pull request Mar 22, 2017

Return value of boolean datetime fields of NaT: False or NaN ? #15781

Closed

doc fixes

ffacd38

jreback closed this in 1a266ee Mar 22, 2017

jreback mentioned this pull request Apr 4, 2017

API: support a BooleanIndex #15890

Closed


		return self._maybe_mask_results(result, convert='float64')
		return Index(result)

API: return Index instead of array from DatetimeIndex field accessors (GH15022) #15589

API: return Index instead of array from DatetimeIndex field accessors (GH15022) #15589

Conversation

jorisvandenbossche commented Mar 6, 2017 • edited Loading

jorisvandenbossche commented Mar 6, 2017

jorisvandenbossche commented Mar 6, 2017

jreback commented Mar 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Mar 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback Mar 6, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Mar 6, 2017

jreback commented Mar 6, 2017

jreback commented Mar 6, 2017

jorisvandenbossche commented Mar 6, 2017

jreback commented Mar 6, 2017

jorisvandenbossche commented Mar 6, 2017 • edited Loading

jreback commented Mar 6, 2017

chris-b1 commented Mar 6, 2017

jorisvandenbossche commented Mar 6, 2017

chris-b1 commented Mar 6, 2017

jorisvandenbossche commented Mar 6, 2017 • edited Loading

jreback commented Mar 6, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Mar 7, 2017

jreback commented Mar 7, 2017

jorisvandenbossche commented Mar 8, 2017

jorisvandenbossche commented Mar 8, 2017

jreback commented Mar 8, 2017

jorisvandenbossche commented Mar 22, 2017

jreback commented Mar 22, 2017

jreback commented Mar 22, 2017

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Mar 22, 2017

codecov bot commented Mar 22, 2017 • edited Loading

Codecov Report

jreback commented Mar 22, 2017

jorisvandenbossche commented Mar 6, 2017 •

edited

Loading

jreback Mar 6, 2017 •

edited

Loading

jorisvandenbossche commented Mar 6, 2017 •

edited

Loading

jorisvandenbossche commented Mar 6, 2017 •

edited

Loading

jreback commented Mar 6, 2017 •

edited

Loading

codecov bot commented Mar 22, 2017 •

edited

Loading