-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: return Index instead of array from DatetimeIndex field accessors (GH15022) #15589
API: return Index instead of array from DatetimeIndex field accessors (GH15022) #15589
Conversation
One failing test I am not sure what to do about. On master, the following preserves the name:
but now not anymore. The reason for this is the Lines 331 to 343 in 5067708
shallow_copy is called, but nothing is done with the result) cc @nateyoder
|
3ba410a
to
fc6f593
Compare
Maybe also more in general: should those field accessors preserve the index name? |
these should have the name of the |
@@ -106,6 +106,8 @@ def _delegate_property_get(self, name): | |||
elif not is_list_like(result): | |||
return result | |||
|
|||
result = np.asarray(result) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, why are you converting back to an ndarray here? I don't think this necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the take_1d
2 lines below needs an array, not an index.
I could only convert it specifically for that, but thought it couldn't do harm to put it here, as it is otherwise passed to Series as values, so will be converted to array anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can change it to
result.take(...)
which will handle this. It was one this way because it was an array originally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another reason to convert to an array is so that Series
does not take a copy of the values (which it does if you pass an Index object I think)
There is no Series here, it is only about Index |
pandas/tseries/index.py
Outdated
@@ -77,16 +77,19 @@ def f(self): | |||
|
|||
result = tslib.get_start_end_field(values, field, self.freqstr, | |||
month_kw) | |||
result = self._maybe_mask_results(result, convert='float64') | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can have a single result = self._maybe_mask_results(result, convert='float64')
just before returning; it won't do anything to something w/o nan's anyhow (and is more clear code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem is with the weekday_name, which gives strings, and for this the astype('float64')
will fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I am also not sure why the is_leap_year
is treated differently, but converting missing values back to NaN would be an API change, as for some reason that attribute currently keeps it missing values as False:
In [14]: idx = pd.DatetimeIndex(['2012-01-01', pd.NaT, '2013-01-01'])
In [15]: idx.is_leap_year
Out[15]: Index([True, False, False], dtype='object')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, these are a bug in _mask_missing_values
then. It needs to ignore object
and boolean
dtypes. (or better yet, only work on is_numeric_dtype
).
If you can't get it work (in time you have allowed), lmk and i'll take a look.
pandas/tseries/index.py
Outdated
|
||
return self._maybe_mask_results(result, convert='float64') | ||
return Index(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name=self.values.name
yes |
obviously when finished, need a sub-section in whatsnew for this, it is technically an API change, though actually should be back-compat |
no what I mean is the name of the result index should be the name of the original Series values IOW
|
Sorry, I still don't understand. Do you mean that eg the |
yes of course, you are working on the This is de-facto the same as doing.
|
@jreback the starting object in this PR is an index, not a series. So the values I pass to ( |
@jorisvandenbossche this only affects the delegates (which is always a Series). NOT directly from the index (though actually that should also propogate the name). |
@jorisvandenbossche - I don't feel strongly about this, but given that the dt accessors return a like shaped array, wouldn't it make sense to wrap the results back in a In [35]: s = pd.Series(['a', 'b', 'c'])
In [36]: s.str.upper()
Out[36]:
0 A
1 B
2 C
dtype: object |
@chris-b1 this PR is about Index, not Series (will add better description at the top and whatsnew to make this more clear). So the equivalent example is:
So in fact I make the datetime fields more consistent with the the str methods, as the first now return an array, while the string methods already return the result wrapped in an Index. |
Oh, yep that makes sense then, sorry I basically only read the title. |
Ah, yes :-) updated the title to make that more clear (although it is not only for DatetimeIndex, but also PeriodIndex and TimedeltaIndex). And that reminds me, I don't think I already changed this for TimedeltaIndex TODO:
|
this should be for all datetime-like accessors I think (no exclusions). |
@@ -509,6 +509,10 @@ def test_fields(self): | |||
tm.assert_series_equal(s.dt.seconds, Series( | |||
[10 * 3600 + 11 * 60 + 12, np.nan], index=[0, 1])) | |||
|
|||
# preserve name (GH15589) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be better to add something to
pandas/tests/indexes/datetimelike.py. These are inherited by all of the datetimelike test indexes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only problem is that they don't have a common field attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now ensured I have a test for each of period, timedelta, datetime that checks the name preservation, but indeed, ideally would have a test in datetimelike.py for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only problem is that they don't have a common field attribute.
you should simply run it for index._datetimelike_ops
which are defined per-class
but no big deal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would indeed be a possibility, and just checked and eg also freq
is included in this list (which has a different return type). So would start skipping those, which would also not be that clean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah prob should just define these as fixtures I think, then would make it really easy
https://github.com/pandas-dev/pandas/blob/master/pandas/tests/series/test_datetime_values.py#L30
pandas/tseries/period.py
Outdated
@@ -52,7 +52,8 @@ | |||
def _field_accessor(name, alias, docstring=None): | |||
def f(self): | |||
base, mult = _gfc(self.freq) | |||
return get_period_field_arr(alias, self._values, base) | |||
result = get_period_field_arr(alias, self._values, base) | |||
return Index(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name=self.name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, still busy :-)
Yep, I did that, but also noticed that there are quite some other operations on Index objects that also return an array. Maybe we should have a more general discussion on where to draw the line (but that is for another issue) |
yes pls create an issue (maybe with checkboxes)? |
@jreback the failing test is one where a boolean Series is now object dtyped. This is because we don't have a boolean index, so it gets object dtype. But if it is then converted to a Series, this keeps object dtype
Is there a good way to deal with this? (I can infer the dtype when it is object within the Properties delegator) |
That is actually a side effect of this PR I did not consider. Returning an object index with booleans is not really good .. On second thoughts, it is actually totally not acceptable, because filtering with a mask (boolean indexing) does not work anymore. |
ehen we have a comparison method that returns a boolean array we just return the array directly see _add_comparison_methods in indexes/base so i would do the same here, just return the ndarray |
094b6ab
to
dad30a2
Compare
@jreback updated this, if you could have a look again This PR has the consequence that it introduces an inconsistency between the return type of different datetime field accessors (-> array for boolean fields and Index for all others). So we have to be sure we are OK with introducing this. |
@jorisvandenbossche yes will put some comments. FYI don't cancel any travis jobs....testing the deduping auto cancellation. |
as I said before, I think this is ok. but let me look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments. suggestion for consolidating how fields are referenced a bit.
@@ -471,6 +471,38 @@ New Behavior: | |||
|
|||
s.map(lambda x: x.hour) | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a ref here
doc/source/whatsnew/v0.20.0.txt
Outdated
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
The several datetime-related attributes (see :ref:`here <timeseries.components>` | ||
for an overview) of DatetimeIndex, PeriodIndex and TimedeltaIndex previously |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double-backticks on DatetimeIndex
etc.
doc/source/whatsnew/v0.20.0.txt
Outdated
The several datetime-related attributes (see :ref:`here <timeseries.components>` | ||
for an overview) of DatetimeIndex, PeriodIndex and TimedeltaIndex previously | ||
returned numpy arrays, now they will return a new Index object (:issue:`15022`). | ||
Only in case of a boolean field, still a boolean array is returned to support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only in the case of a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last sentence is awkward, see if you can reword.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe explicity list the Index boolean methods? (e.g. is_quarter_start.....)
|
||
# boolean fields | ||
fields = ['is_leap_year'] | ||
# other boolean fields like 'is_month_start' and 'is_month_end' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose let's make an issue for this NaT enhancement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will open an issue for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -64,6 +64,7 @@ def f(self): | |||
if self.tz is not utc: | |||
values = self._local_timestamps() | |||
|
|||
# boolean accessors -> return array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be worth it to add something like this:
class DatetimeIndex....:
_boolean_ops = ['is_month_start'......]
_datetimelike_ops = [....] + _boolean_ops
then you can use that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle, that is indeed cleaner. But, the problem is that I would still have to distinguish here in another way, as the is_leap_year
is also a boolean one, but has to be processed differently. So not sure if that is then worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is is_leap_year
different? seems that it should be the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's wrong in the code, it can be treated exactly like the others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the handling of NaNs is different (that is related to the other issue of NaT not having the boolean fields, will open an issue about that). For is_leap_year
(which returns False for NaT), the handling of missing values in self._maybe_mask_results(result, convert='float64')
would return the wrong result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh ok. pls open a new issue and I will do a followup to fixup this. It much too specially casey. So good to go when you are ready.
elif field in ['is_leap_year']: | ||
# no need to mask NaT | ||
return libts.get_date_field(values, field) | ||
|
||
# non-boolean accessors -> return Index | ||
elif field in ['weekday_name']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above maybe list this in DatetimeIndex, maybe _other_ops = ['weekday_name']
or something
just to avoid explicity listing these in two places.
dad30a2
to
41728a9
Compare
@jorisvandenbossche let's merge this unless anything else (I'll rebase on top after). |
Codecov Report
@@ Coverage Diff @@
## master #15589 +/- ##
==========================================
- Coverage 91.02% 90.99% -0.03%
==========================================
Files 143 143
Lines 49403 49407 +4
==========================================
- Hits 44967 44960 -7
- Misses 4436 4447 +11
Continue to review full report at Codecov.
|
thanks! |
… (GH15022) closes pandas-dev#15022 Author: Joris Van den Bossche <jorisvandenbossche@gmail.com> Closes pandas-dev#15589 from jorisvandenbossche/api-dt-fields-index and squashes the following commits: ffacd38 [Joris Van den Bossche] doc fixes 41728a9 [Joris Van den Bossche] FIX: boolean fields should still return array 6317b6b [Joris Van den Bossche] Add whatsnew 96ed069 [Joris Van den Bossche] Preserve name for PeriodIndex field accessors cdf6cae [Joris Van den Bossche] Preserve name for DatetimeIndex field accessors f2831e2 [Joris Van den Bossche] Update timedelta accessors 52f9008 [Joris Van den Bossche] Fix tests 41008c7 [Joris Van den Bossche] API: return Index instead of array from datetime field accessors (GH15022)
git diff upstream/master | flake8 --diff
This changes the datetime field accessors of a DatetimeIndex (and PeriodIndex, etc) to return an Index object instead of a plain array:
So for example:
instead of