Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix freq setter for DatetimeIndex/TimedeltaIndex and deprecate for PeriodIndex #20772

Merged
merged 2 commits into from
Apr 30, 2018

Conversation

jschendel
Copy link
Member

For DatetimeIndex/TimedeltaIndex:

  • Setting freq with a string alias now works
  • Frequencies are validated with invalid frequencies raising
    • Same behavior as when the constructor is called with an invalid frequency

For PeriodIndex:

  • Setting freq is deprecated in favor of asfreq
    • The behavior of the setter would be ambiguous without the extra asfreq params
  • Left the existing (buggy) freq setter behavior as-is
    • Didn't seem worthwhile to try fixing it
  • I don't think that the setter was widely used for PeriodIndex anyways
    • Didn't see any warnings in the tests

msg = ('Setting PeriodIndex.freq has been deprecated and will be '
'removed in a future version; use PeriodIndex.asfreq instead. '
'The PeriodIndex.freq setter is not guaranteed to work.')
warnings.warn(msg, FutureWarning, stacklevel=2)
Copy link
Member Author

@jschendel jschendel Apr 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we really need to deprecate setting this: it's user facing, but I think the only case it works is if you set to the existing frequency. Could potentially remove the setter entirely, causing an AttributeError to be raised, or could raise a custom AttributeError within the setter to give a better message directing users to .asfreq.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, IMO it was buggy, and as you say, only if it was the same frequency, it was working. But given that corner case, it maybe does not hurt to do the deprecation I would say.

I think the custom AttributeError message is a good idea in the future.

@codecov
Copy link

codecov bot commented Apr 20, 2018

Codecov Report

Merging #20772 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20772      +/-   ##
==========================================
+ Coverage   91.77%   91.78%   +<.01%     
==========================================
  Files         153      153              
  Lines       49313    49322       +9     
==========================================
+ Hits        45259    45268       +9     
  Misses       4054     4054
Flag Coverage Δ
#multiple 90.17% <100%> (ø) ⬆️
#single 41.9% <83.33%> (+0.01%) ⬆️
Impacted Files Coverage Δ
pandas/core/indexes/period.py 92.67% <100%> (+0.06%) ⬆️
pandas/core/indexes/datetimelike.py 96.79% <100%> (+0.09%) ⬆️
pandas/core/indexes/datetimes.py 95.73% <100%> (-0.04%) ⬇️
pandas/core/indexes/timedeltas.py 91.15% <100%> (-0.07%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 563a6ad...68b8b99. Read the comment docs.

@jreback jreback added Bug Frequency DateOffsets Deprecate Functionality to remove in pandas labels Apr 21, 2018
@jreback
Copy link
Contributor

jreback commented Apr 21, 2018

I would also deprecate settig for DTI/TDI, and direct to using .asfreq. I don't like the idea of disallowing on PI but not the others.

@jschendel
Copy link
Member Author

Can implement .asfreq and deprecate the setters for DTI/TDI in the next day or so.

cc @jorisvandenbossche : Thoughts on this? Your preference in the issue was to allow setting for DTI/TDI and not implement .asfreq.

@jorisvandenbossche
Copy link
Member

Yes, -1 on asfreq. As I tried to argue on the issue, it is something different as PeriodIndex.asfreq, so we should not call it the same. Directly setting the attribute is fine IMO if you need this.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the tz question, this looks good IMO

return None

on_freq = cls._generate(
index[0], None, len(index), None, freq, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is generating the index needed to know if it a correct frequence? Checking the inferred frequency is not enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, see it was also done like that below (still wondering though)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's necessary, at least to a certain extent. There are cases where multiple frequencies are valid for a given set of dates, so you can have a valid frequency that's not the inferred frequency, e.g. ['2018-01-01', '2018-01-02', '2018-01-03'] could be calendar day or business day. Users can also define custom frequencies that would need to be validated but would not be the inferred frequency.

There are cases where you don't need to generate the entire index to reject invalid frequencies, e.g. if the second generated value doesn't match. But I don't immediately see how to get around generating the entire index to determine if a frequency is valid. I suppose you could maybe optimize memory usage with huge indexes by doing some type of elementwise validation, only keeping the current generated element in memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, I see, that's for sure a valid case.

For me, improving the performance of the validation is not very high priority.

inferred = subarr.inferred_freq
if inferred != freq.freqstr:
on_freq = cls._generate(subarr[0], None, len(subarr), None,
freq, tz=tz, ambiguous=ambiguous)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

passing the tz in the new version is not needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, and I don't think it was actually needed in the old version either. At the point at which this is being done, subarr and it's elements should already be tz aware, and cls._generate will infer tz from subarr[0].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yep, indeed, that sounds correct (and tested it with a small example)

msg = ('Setting PeriodIndex.freq has been deprecated and will be '
'removed in a future version; use PeriodIndex.asfreq instead. '
'The PeriodIndex.freq setter is not guaranteed to work.')
warnings.warn(msg, FutureWarning, stacklevel=2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, IMO it was buggy, and as you say, only if it was the same frequency, it was working. But given that corner case, it maybe does not hurt to do the deprecation I would say.

I think the custom AttributeError message is a good idea in the future.

@jorisvandenbossche
Copy link
Member

I don't like the idea of disallowing on PI but not the others.

I think this difference is acceptable, because freq also has a different notion for both: for PI the freq is required (you cannot have a PI without a freq), and the underlying int values are dependent on this freq. So changing the freq is either buggy because the values and the freq no longer match (current behaviour), or it converts the underlying int values to match the new freq (but this is what asfreq already does, and since this alters the data, I think should not be done by setting an attribute).
On the other hand, for DTI and TDI, the freq is not required, and setting it allows you to add one. But this only changes the freq to the inferred_freq and does not alter the data (as otherwise an error is raised).

@jorisvandenbossche jorisvandenbossche added this to the 0.23.0 milestone Apr 27, 2018
@jorisvandenbossche
Copy link
Member

(note: I tagged for 0.23, but it does not need to hold up the rc)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code change looks fine. I still think should deprecate setting the freq on DTI/TDI. I don't see much value there and the inconcistenty with that you cannot set PI is pretty jaring. Is there an actual good usecase for setting DTI/TDI? If there is we should actually do this via a function rather than an attribute to make it more explicit.

@TomAugspurger
Copy link
Contributor

(note: I tagged for 0.23, but it does not need to hold up the rc)

If this is going in 0.23, then I'd like it to be in the RC. Ideally we only merge bugfixes for the RC to the 0.23 release branch.

@jschendel
Copy link
Member Author

I more or less have code ready that addresses @jreback's comments, but am a bit hesitant to push since there isn't agreement on if the freq setter should be deprecated, and if so what the name of the alternative method would be.

@jschendel
Copy link
Member Author

jschendel commented Apr 29, 2018

This commit would go on top of the current changes in this PR to address @jreback's comments, in case anyone cares too look while things are still being discussed here: jschendel@2fca775

The CI services that are hooked to my repo all ran successfully, and I didn't see any warnings related to my changes in the logs. Didn't want to mess around with this PR too much while there's still ongoing discussion.

@jorisvandenbossche
Copy link
Member

So I have a clear -1 on re-using the name asfreq for this, as I explained before (asfreq on Period is doing something different, I haven't see anybody giving answers to that argument).

If we want to keep the possibility of setting the freq, I don't mind it being the direct setting of the freq attribute (I don't think it is worth to introduce another method for this)

the inconcistenty with that you cannot set PI is pretty jaring.

I don't mind allowing setting on a PI. But if we follow the same rules as for setting freq on DTI/TDI (= only a freq that matches the data, so it does not change the interpretation of the data), then for PI this would mean that only the freq it already has will work, all other freqs will raise an exception.
I am fine with allowing that if you really want that for consistency, but it's not particularly useful.

@jreback
Copy link
Contributor

jreback commented Apr 29, 2018

So I have a clear -1 on re-using the name asfreq for this, as I explained before (asfreq on Period is doing something different, I haven't see anybody giving answers to that argument).

Can you repeat I can't find what you said.

I am fine the the PI change here. The issue comes down to whether we should allow one to change freq on the immutable DTI/TDI. I see no good reason for this. If you want to set the freq, simply create another one, or we make availble .asfreq on TDI/DTI (which I am little less excited about). I think having a method only on PI is fine. However the setting of an attribute is just plane wrong (yes we allow it for .name, but recently removed it for .tz as that was buggy).

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 29, 2018

I have said it here #20772 (comment) and here #20678 (comment). But it actually just boils down to "for PI asfreq changes the underlying data (it is for conversion of one frequency to another), for DTI/TDI setting the freq just updates an informative attribute that does not change the data at all"

However the setting of an attribute is just plane wrong

There is nothing "plain wrong" about setting an attribute, it is perfectly valid and depending on the use case often even preferable python code (I think often having a lot of "set_.." methods is seen as "old-java-style" while just having a property with getter and setter under the hood is more pythonic).

@jreback
Copy link
Contributor

jreback commented Apr 29, 2018

However the setting of an attribute is just plane wrong

On an immutable object this is True. We have 1 and only 1 attribute that we allow to be changed on an Index and that is the name. All others give you back a new object. Why break this paradigm?

Further I don't buy your argument. Whether .asfreq happens to do work on a PI, and doesn't on a TDI/DTI is not a compelling argument.

Is there a compelling use-case to actually set freq (whether inplace or return a new object) on a DTI/TDI?

@jschendel
Copy link
Member Author

Is there a compelling use-case to actually set freq (whether inplace or return a new object) on a DTI/TDI?

I made this PR after noticing a few SO questions that worked around the buggy setting behavior. That doesn't necessarily there's actually a valid use case for it, as there may be more direct workarounds for the issues people are having, but at the very least people are trying to use it.

I think one of the issues is that people generally default to using pd.to_datetime for parsing, which as far as I can tell does not have the ability to set a frequency on creation, and so they then want to set a frequency after creation. This might be more of an argument for a freq parameter in pd.to_datetime though. There also isn't much documentation for the freq='infer' option of the DTI constructor, so I also see people wanting to set the inferred frequency after creation, without actually knowing what that frequency is (again, more of a documentation issue).

I could also imagine a use case where one has messy data which needs to be cleaned first, prior to actually setting the freq. So a workflow like parse -> clean -> set frequency. Some cleaning operations may take care of the setting the freq, but there are cases where multiple frequencies are possible, so the correct one may not be set, or a custom frequency may be desired, or maybe no frequency is desired and a user wants to remove by setting None.

Whether .asfreq happens to do work on a PI, and doesn't on a TDI/DTI is not a compelling argument.

To add another data point, Series.asfreq and DataFrame.asfreq will also do work on an underlying DTI (doesn't appear to work for TDI).

Not particularly invested in being for/against reusing the asfreq name though, as there are downsides to both without a clear winner in my mind (potential for confusion with reuse vs. API bloat for new method).

@jreback
Copy link
Contributor

jreback commented Apr 29, 2018

@jschendel thanks for the color. But why are people actually setting the freq. This doesn't actually do anything (I don't think). Anytime we need the freq we actually infer it (again for DTI/TDI)

@jschendel
Copy link
Member Author

jschendel commented Apr 30, 2018

Setting the freq enables a couple operations that fail without one:

In [2]: dti = pd.DatetimeIndex(['20180101', '20180102', '20180103'])

In [3]: dti
Out[3]: DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03'], dtype='datetime64[ns]', freq=None)

In [4]: dti.shift(1)
---------------------------------------------------------------------------
NullFrequencyError: Cannot shift with no freq

In [5]: dti + 4
---------------------------------------------------------------------------
NullFrequencyError: Cannot shift with no freq

In [6]: dti = pd.DatetimeIndex(['20180101', '20180102', '20180103'], freq='B')

In [7]: dti.shift(1)
Out[7]: DatetimeIndex(['2018-01-02', '2018-01-03', '2018-01-04'], dtype='datetime64[ns]', freq='B')

In [8]: dti + 4
Out[8]: DatetimeIndex(['2018-01-05', '2018-01-08', '2018-01-09'], dtype='datetime64[ns]', freq='B')

I've seen people use operations like this to enable some shorthand for inserting the previous/next logical value into an index, e.g. dti.insert(0, dti[0] - 1)

@jreback
Copy link
Contributor

jreback commented Apr 30, 2018

right, we allow an integer to stand in an represent the freq. ok I would say these are ok usecases (if prob not very well documented). but that is not sufficient reason to actually allow setting of an attribute on an immutable object (rather than just having a function give you a new one). its just bad semantics.

I would be ok with

dti.asfreq('D'), makes it pretty explict. We have methods for setting things like tz's, why not freq.

@jorisvandenbossche
Copy link
Member

On an immutable object this is True. We have 1 and only 1 attribute that we allow to be changed on an Index and that is the name. All others give you back a new object. Why break this paradigm?

Exactly because the freq does not change the index, so the index is still immutable, except for those informative attributes like name and freq. So that in sense, freq is very similar to name for a DTI/TDI.

To add another data point, Series.asfreq and DataFrame.asfreq will also do work on an underlying DTI (doesn't appear to work for TDI).

For me that is another reason not to use asfreq, as for Series this is indeed a basic form of resample. So the name asfreq is already too much overloaded, no need to add yet another meaning.

@jreback
Copy link
Contributor

jreback commented Apr 30, 2018

ok, so I would be on board with deprecating .freq setting as in this PR, and adding a .set_freq() as an alternative (DTI/TDI, I don't think this makes sense for PI).

This would be much more in line with our handling of names in any event. The PI changes are fine.

Agree that .asfreq might be confusing here for DTI/TDI.

@TomAugspurger TomAugspurger mentioned this pull request Apr 30, 2018
71 tasks
@jorisvandenbossche
Copy link
Member

@jreback Are you fine with merging this now as is? The changes it does are already good (fixing a clear bug in the current setting functionality + deprecation for PI), and the other changes you propose in your last comment can be done separately I think (on top of this PR).

(just as way to include this in the rc)

@jreback
Copy link
Contributor

jreback commented Apr 30, 2018

yes ok to merge now (and let’s open an issue though)

@TomAugspurger
Copy link
Contributor

Opened #20886 for the rest.

@TomAugspurger TomAugspurger merged commit d274d0b into pandas-dev:master Apr 30, 2018
@TomAugspurger
Copy link
Contributor

Thanks all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Deprecate Functionality to remove in pandas Frequency DateOffsets
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: freq setter for DatetimeIndex/TimedeltaIndex/PeriodIndex is poorly supported
4 participants